Modern general-purpose speech recognition systems
are based on Hidden Markov Models (HMM). HMM
is a doubly stochastic process with an underlying
stochastic process that is not observable (it is hidden),
but can only be observed through another set of
stochastic processes that produce the sequence of observed
symbols [7], [8]. HMMs are statistical models
that output a sequence of symbols or quantities, and
are used in speech recognition because a speech signal
can be viewed as a piecewise stationary signal or
a short-time stationary signal. In a short time-scales
(e.g., 10 milliseconds), speech can be approximated
as a stationary process. Speech can be thought of as a
Markov model for many stochastic purposes [9]. Another
reason why HMMs are popular is because they
can be trained automatically and are simple and computationally
feasible to use. In speech recognition, the
hidden Markov model would output a sequence of ndimensional
real-valued vectors (with n being a small
integer, such as 10), outputting one of these every 10
milliseconds. The vectors would consist of cepstral
coefficients, which are obtained by taking a Fourier
transform of a short time window of speech and decorrelating
the spectrum using a cosine transform,
then taking the first (most significant) coefficients. The
hidden Markov model will tend to have in each state
a statistical distribution that is a mixture of diagonal
covariance Gaussians, which will give a likelihood for
each observed vector. Each word