Some previous works in speech recognition system for tonal languages were proposed in order to improve the performance of speech recognition using additional information such as tone features. Previous experiment results showed that using tone features as additional inputs for training the acoustic model yielded higher accuracy compared to the baseline system for Thai [1] and Mandarin news broadcast speech recognition [2]. The method of context-independent acoustic model for Thai language has also been investigated [3]. The method of creating an acoustic model is considered to enhance the performance of learning from speech data. Hidden Markov Model (HMM) is well-known and popular in acoustic training data. The parameters of the model can be estimated and adapted automatically to give optimal performance. Although, HMMs are effective approaches to the problems of acoustic modeling, they also suffer from some limitations, for example, HMMs assumes the duration of exponential distribution, the transition
probability depends only on the origin and destination, and all observation frames are dependent only on the state that generated them, not on neighboring observation frames. Furthermore, Gaussian Mixture Models (GMMs) are powerful when generating statistic values in the HMM frameworks. Neural networks have been used also in speech recognition with forward-backward probability generated targets [4],[5]. However, the connectionistHMM framework which uses neural networks to generate the output posterior probabilities, which can be used to replace the GMMs acoustic model with a neural network to estimate the posterior probabilities of phonetic unit given the input vector of context window frames [6],[7]. It can be applied for continuous speech recognition [8] or integration with fuzzy logic in Arabic speech recognition [9]. To determine the model, first order left to right HMM models with self loops are generally used for acoustic models. An efficiency model for speech utterance is the Continuous Density Hidden Markov Model (CDHMM) which is suitable for describing the speech events [10]. In this paper, the state emission probabilities are estimated with an Artificial Neural Network particularly Multi-Layer Perceptrons (MLPs) so called Hybrid MLP-HMM in order to improve the performance of speech recognition over the HMM framework. The state emission probabilities of phoneme HMM will be estimated from the output node of the MLP. Then Viterbi algorithm is employed to be used as the decoder. Tone features are extracted from speech signal and classified by MLP as additional feature for tonal languages. The comparison of the baseline system is tested with different configurations, such as tone features and a number of hidden layers in the MLP classifier throughout the experiments.
The rest of this paper is organized as follows. In Section 2, a review of the Thai phonetic system is presented. In Section 3, the proposed framework consisting of a hybrid MLP-HMM and tone recognition will be introduced. The experiment and results are described in Section 4. Section 5 gives the conclusion.