Therefore, the classification of audio segments in good or
bad quality is not yet solvable automatically, as the term
“good quality” is very subjective and strongly depends on
one’s personal perception. Nevertheless, the proposed segmentation
method can perform a preselection that speeds
up the manual transcription process significantly. The
experimental results show that the WER decreased by about
19 percent, when adding 7.2 hours of speech data from our
lecture videos to the training set (cf. Table 3).