This paper presents a systematic performance comparison
among various levels of acoustic modeling units for DNNs and
GMMs based Chinese speech recognition. The introduction of
DNNs based acoustic models would change many conclusions
based on GMMs, owing to the difference that DNN is a discriminative
model and the other is generative model. For the context
independent acoustic modeling units, syllable based models have
shown better performance than initial/finals or phone based
models, especially in the DNNs based ASR systems. The outstanding
performance mainly benefits from the more detailed
description of acoustic representation. In addition, the best performance
is obtained with the context dependent phones in the
DNN systems. When the context dependency information is
introduced, the performances of initial/finals and phones have
gained remarkable improvement. Besides, for the DNNs based
systems, the impact of the number of senones is also discussed.
Unlike the GMMs based systems, when the number of senones
increases, although the classification performance of DNNs
decreases, the ASR performance of DNNs based acoustic models
has been less affected. What should be pointed out is that, with
DNNs, the context independent syllable based systems have
gained the similar performance with context dependent initial/
finals based systems. Through introducing the multi-task learning
strategy, multiple modeling units can be combined to train a better
acoustic model. The information coming from different levels of
modeling units can help the feature learning for DNNs.