The system then extracts twenty image frames evenly
from each of these five-minute video clips for visual
feature extraction. Additionally, the system splits the
audio channel of each clip into twenty individual fifteensecond
segments for further audio feature extraction.