To illustrate our approach, consider the action of a person checking a watch seen from
frontal view (Fig. 1). This action can be characterized by the upward movement of the
hand and upper arm during the early part of the action (to bring the watch to a readable
distance) and the downward movement of the same body parts at the end of the action.
We can imagine encoding these body part movements with a cluster of flow vectors,
where each cluster explains some portion of the total flow across the video. We denote
these clusters as flow words. In the check-watch example, the upward hand movement
might be mapped to a single flow word. That word would be present in the first half of
the frames and absent in the other half (when the hand moves downward).