To illustrate our approach, consider the action of a person checking a watch seen from
frontal view (Fig. 1). This action can be characterized by the upward movement of the
hand and upper arm during the early part of the action (to bring the watch to a readable
distance) and the downward movement of the same body parts at the end of the action.
We can imagine encoding these body part movements with a cluster of flow vectors,
where each cluster explains some portion of the total flow across the video. We denote
these clusters as flow words. In the check-watch example, the upward hand movement
might be mapped to a single flow word. That word would be present in the first half of
the frames and absent in the other half (when the hand moves downward).
Given a set of extracted flow words, our goal is to represent an action by encoding
the pattern of temporal occurrence of the flow words. In the example of Fig. 1, the green
and cyan words occur early in the action (when the hand and upper arm are raised)
while the blue and magenta words occur later in the action. We construct an MPH for
each flow word which encodes its dynamics.
We now describe the process of constructing the MPH representation. We assume
that the video is captured using a static camera (we relax this assumption in Section
3.2). First we compute dense optical flow over the video clip. Then, we use EM to
cluster together the flow vectors from all frames based only on the flow direction (we
only consider flow vectors whose magnitudes are above a certain threshold). Each flowcluster defines a single flow word. In Figure 1(a)-1(b) we can see the flows color-coded
according to the five flow words.We then generate an MPH for each of the flow clusters
by binning the flow vectors. Each bin t in the MPH hc corresponds to frame number t,
and contains the sum of flow magnitudes for all pixel flows f that corresponds to cluster
c in that frame. Let mc denote the set of flow vectors that map to cluster c: