We evaluate alternative methods for action retrieval from scripts and show benefits of a text-based classifier. Using the retrieved action samples for visual learning, we next turn to the problem of action classification in video. We present a new method for video classification that builds upon and extends several recent ideas including local space-time features, space-time pyramids and multichannel non-linear SVMs. The method is shown to improve state-of-the-art results on the standard KTH action dataset by achieving 91.8% accuracy. Given the inherent problem of noisy labels in automatic annotation, we particularly investigate and show high tolerance of our method to annotation errors in the training set. We finally apply the method to learning and classifying challenging action classes in movies and show promising results. 1. Introduction In the last decade the field of visual recognition had an outstanding evolution from classifying instances of toy objects towards recognizing the classes of objects and scenes in natural images. Much of this progress has been sparked by the creation of realistic image datasets as well as by the new, robust methods for image description and classification. We take inspiration from this progress and aim to transfer previous experience to the domain of video recognition and the recognition of human actions in particular. Existing datasets for human action recognition (e.g. [15], see figure 8) provide samples for only a few action classes recorded in controlled and simplified settings. This stands in sharp contrast with the demands of real applications focused on natural video with human actions subjected to inFigure 1. Realistic samples for three classes of human actions: kissing; answering a phone; getting out of a car. All samples have been automatically retrieved from script-aligned movies. dividual variations of people in expression, posture, motion and clothing; perspective effects and camera motions; illumination variations; occlusions and variation in scene surroundings. In this paper we address limitations of curren