3. AUTOMATIC SPEECH RECOGNITION
The ASR service implements an audio segmentation and multi-pass ASR decoding strategy transcription engine. A video is first seg- mented into speech utterances and the utterances are subsequently transcribed.
The audio segmentation is based on 64-component Text Inde- pendent Gaussian Mixture Models (TIGMM). An ergodic HMM with state models representing the TIGMMs segments the audio stream into regions of speech, music, noise or silence by computing the Viterbi alignment. Utterances are then defined as the found speech regions. The speech utterances are clustered based on full covariance Gaussians on the Perceptual Linear Prediction (PLP) feature stream. Each cluster is forced to contain at least 10 seconds of speech and utterance in these clusters share adaptation parameters in the subsequent transcription process.
The baseline transcription system was a Broadcast News (BN) system trained on the 96 and 97 DARPA Hub4 acoustic model train- ing sets (about 150 hours of data) and the 1996 Hub4 CSR language model training set (128M words)4 . This system uses a Good-Turing smoothed 4-gram language model, pruned using the Seymore- Rosenfeld algorithm [5] to about 8M n-grams for a vocabulary of about 71k words. The baseline acoustic model is trained on PLP cepstra, uses a linear discriminative analysis transform to project from 9 consecutive 13-dimensional frames to a 39-dimensional fea- ture space and uses Semi-tied Covariances [6]. The acoustic model uses triphone state tying with about 8k distinct distributions. Dis- tributions are modeling emissions using 16-component Gaussian mixture densities. In addition to the baseline acoustic model, a feature space speaker adaptive model is used [6].