Feature extraction for phone based recognizer, 39 dimensional MFCC feature vectors (12 MFCC plus energy and their first and second order temporal derivatives) were extracted from speech signals with pre-emphasis performed first. Speeches are analyzed based on a frame size of 25ms and shifted window of 10ms using hamming window. The baseline system of automatic speech recognition was compared, a continuous phone-based HMM recognizer was implemented using HTK [11] for comparison purposes. Each phone was represented as a 5 state left-to-right model with one Gaussian mixture using diagonal co-variances. The acoustic models were trained using maximum likelihood estimator (MLE) as a statistical method to estimate the value of parameters, based on a set of observations of a random variable that related to the parameters being estimated..
Phoneme recognition using hybrid MLP-HMM framework, 39 dimensional MFCC feature vectors are fed into input layer with fully-connected to one hidden layer and 53 output neurons corresponding to phonetic units. In order to provide the MLP with contextual information, 9 consecutive frames of data are given as input.
Tone recognition, the feature set of input layer is based on a smoothed F0 and delta, double delta F0 with fully 1 http://audacity.sourceforge.net/
connected to one hidden layer and 5 output neurons corresponding to five tones. Additional, bigram model is used as language models for all configurations in part of decoder.