The recognition and retrieval of actions in videos is challenging due to the need to
handle many sources of variations: viewpoint, size and appearance of actors, scene
lighting and video quality, etc. In this paper we introduce a novel action representation
based on motion dynamics that is robust to such variations.
Currently, state-of-the-art performance in action classification is achieved by extracting
dense local features (HOG, HOF, MBH) and grouping them in a bag-of-features
(BOF) framework [26]. The basic BOF representation ignores information about the
spatial and temporal arrangement of the local features by pooling them over the entire
video volume. More recently, it has been shown that considering the spatial and
temporal arrangements (dynamics) of an action (eg. extracting separate BOF model for
each subvolume of a video [14,26] or modelling the spatio-temporal arrangements of
the interest points [29]) adds more discriminative power to the representation.
Our approach is based on the observation that the dynamics of an action provide a
powerful cue for discrimination. In Johansson’smoving light display experiment, it was
shown that humans perceive actions by abstracting a coherent structure from the spatiotemporal
pattern of local movements [9]. While humans respond to both spatial and
temporal information, the spatial configuration of movements that comprise an action is
strongly affected by changes in viewpoint. This suggests that representing the temporal
structure of an action could be valuable for reducing the effect of viewpoint. Motivated
by this observation, we define human actions as a composition of temporal patterns of
movements.
The recognition and retrieval of actions in videos is challenging due to the need tohandle many sources of variations: viewpoint, size and appearance of actors, scenelighting and video quality, etc. In this paper we introduce a novel action representationbased on motion dynamics that is robust to such variations.Currently, state-of-the-art performance in action classification is achieved by extractingdense local features (HOG, HOF, MBH) and grouping them in a bag-of-features(BOF) framework [26]. The basic BOF representation ignores information about thespatial and temporal arrangement of the local features by pooling them over the entirevideo volume. More recently, it has been shown that considering the spatial andtemporal arrangements (dynamics) of an action (eg. extracting separate BOF model foreach subvolume of a video [14,26] or modelling the spatio-temporal arrangements ofthe interest points [29]) adds more discriminative power to the representation.Our approach is based on the observation that the dynamics of an action provide apowerful cue for discrimination. In Johansson’smoving light display experiment, it wasshown that humans perceive actions by abstracting a coherent structure from the spatiotemporalpattern of local movements [9]. While humans respond to both spatial andtemporal information, the spatial configuration of movements that comprise an action isstrongly affected by changes in viewpoint. This suggests that representing the temporalstructure of an action could be valuable for reducing the effect of viewpoint. Motivated
by this observation, we define human actions as a composition of temporal patterns of
movements.
การแปล กรุณารอสักครู่..
