Normally, in speech recognition, a 5-state HMM is usually used for phoneme modeling in speech recognition task. The 5-state HMM might be not suitable for classifying emotions in the whole sentence, which contains several phonemes. Therefore, we conducted this experiment to find the optimal number of HMM states and also that of Gaussian mixtures. This experiment was set to vary the number of HMM states starting at 8 to 32 states and the number of Gaussian mixtures from 1 to 16 mixtures. In general, the number of Gaussian mixtures are represented the number of clustering. More clustering would be more accuracy, but it usually takes much time to process the system. In some case that training data is not much enough to provide all statistical probabilities of the distinguish features in clustering. For that reason, more number of clustering will decrease the accuracy. This problem is so called over-fitting [6].
The feature utilized in this research was MFCC since it showed the best accuracy in speech recognition. The results of this experiment are shown in Table 4. From the table, the accuracy reached the best at 50.75% when using 16 and 32 states of HMM with two mixtures. Then, we decided to use 16 states of HMM and two Gaussian mixtures for the rest experiments.