The aim of this paper is to address recognition of natural
human actions in diverse and realistic video settings. This
challenging but important subject has mostly been ignored
in the past due to several problems one of which is the lack
of realistic and annotated video datasets. Our first contribution is to address this limitation and to investigate the
use of movie scripts for automatic annotation of human actions in videos. We evaluate alternative methods for action
retrieval from scripts and show benefits of a text-based classifier. Using the retrieved action samples for visual learning, we next turn to the problem of action classification in
video. We present a new method for video classification
that builds upon and extends several recent ideas including
local space-time features, space-time pyramids and multichannel non-linear SVMs. The method is shown to improve
state-of-the-art results on the standard KTH action dataset
by achieving 91.8% accuracy. Given the inherent problem
of noisy labels in automatic annotation, we particularly investigate and show high tolerance of our method to annotation errors in the training set. We finally apply the method
to learning and classifying challenging action classes in
movies and show promising results.
1. Introduction
In the last decade the field of visual recognition had an
outstanding evolution from classifying instances of toy objects towards recognizing the classes of objects and scenes
in natural images. Much of this progress has been sparked
by the creation of realistic image datasets as well as by the
new, robust methods for image description and classification. We take inspiration from this progress and aim to
transfer previous experience to the domain of video recognition and the recognition of human actions in particular.
Existing datasets for human action recognition (e.g. [15],
see figure 8) provide samples for only a few action classes
recorded in controlled and simplified settings. This stands
in sharp contrast with the demands of real applications focused on natural video with human actions subjected to inFigure 1. Realistic samples for three classes of human actions:
kissing; answering a phone; getting out of a car. All samples have
been automatically retrieved from script-aligned movies.
dividual variations of people in expression, posture, motion
and clothing; perspective effects and camera motions; illumination variations; occlusions and variation in scene surroundings. In this paper we address limitations of curren
The aim of this paper is to address recognition of naturalhuman actions in diverse and realistic video settings. Thischallenging but important subject has mostly been ignoredin the past due to several problems one of which is the lackof realistic and annotated video datasets. Our first contribution is to address this limitation and to investigate theuse of movie scripts for automatic annotation of human actions in videos. We evaluate alternative methods for actionretrieval from scripts and show benefits of a text-based classifier. Using the retrieved action samples for visual learning, we next turn to the problem of action classification invideo. We present a new method for video classificationthat builds upon and extends several recent ideas includinglocal space-time features, space-time pyramids and multichannel non-linear SVMs. The method is shown to improvestate-of-the-art results on the standard KTH action datasetby achieving 91.8% accuracy. Given the inherent problemof noisy labels in automatic annotation, we particularly investigate and show high tolerance of our method to annotation errors in the training set. We finally apply the methodto learning and classifying challenging action classes inmovies and show promising results.1. IntroductionIn the last decade the field of visual recognition had anoutstanding evolution from classifying instances of toy objects towards recognizing the classes of objects and scenesin natural images. Much of this progress has been sparkedby the creation of realistic image datasets as well as by thenew, robust methods for image description and classification. We take inspiration from this progress and aim totransfer previous experience to the domain of video recognition and the recognition of human actions in particular.Existing datasets for human action recognition (e.g. [15],see figure 8) provide samples for only a few action classesrecorded in controlled and simplified settings. This standsin sharp contrast with the demands of real applications focused on natural video with human actions subjected to inFigure 1. Realistic samples for three classes of human actions:kissing; answering a phone; getting out of a car. All samples havebeen automatically retrieved from script-aligned movies.dividual variations of people in expression, posture, motionand clothing; perspective effects and camera motions; illumination variations; occlusions and variation in scene surroundings. In this paper we address limitations of curren
การแปล กรุณารอสักครู่..