Text is a high-level semantic feature which has often
been used for content-based information retrieval. In lecture
videos, texts from lecture slides serve as an outline for the
lecture and are very important for understanding. Therefore
after segmenting a video file into a set of key frames (all
the unique slides with complete contents), the text detection
procedure will be executed on each key frame, and the
extracted text objects will be further used in text recognition
and slide structure analysis processes. Especially, the
extracted structural metadata can enable more flexible video
browsing and video search functions.
Speech is one of the most important carriers of information
in video lectures. Therefore, it is of distinct advantage
that this information can be applied for automatic lecture
video indexing. Unfortunately, most of the existing lecture
speech recognition systems in the reviewed work cannot
achieve a sufficient recognition result, the Word Error Rates
(WERs) having been reported from [1], [2], [3], [4], [5] and
[6] are approximately 40–85 percent. The poor recognition