I. INTRODUCTION
Yet another approach that is getting some traction in the literature is to sidestep the network adaptation problem altogether and train networks on speaker-adapted features instead. Such features can be extracted using the speaker normalization machinery readily available for GMM-HMMs such as vocal tract length normalization and feature-space MLLR. This approach works well despite the fact that the VTLN and FMLLR transforms are estimated assuming a GMM-HMM acoustic model and are now being used in conjunction with a DNN-HMM.
A better way might be to provide the network with untransformed features and let it figure out the speaker normalization during training. In order to do that, the network has to be informed which features belong to which speaker. This can be accomplished by creating two sets of time-synchronous inputs: one set of acoustic features for phonetic discrimination and another set of features that characterize the speaker which provided the audio for the first set of features. This idea is similar to [3] with one important difference: in our proposed work, the features which characterize a speaker are the same for all the data of that speaker. Another work relevant to ours is [4], where the authors propose to learn speaker codes which are fed to a speaker adaptation network. The network produces speaker-adapted features which form the input to a regular DNN. The main difference in our proposed work (besides using i-vectors instead of speaker codes) is that we train a single network that does speaker adaptation and phone classification simultaneously instead of two separate networks. Lastly, noise-aware DNNs proposed in [6], which use as input uncompensated features and time-dependent estimates of the noise, are also relevant to our work.
Why speaker recognition features should be helpful can be shown through a simple thought experiment. Imagine that there are two types of speakers, say A and B, which differ in the way they pronounce the phone /AA/. Speaker type A uses the canonical pronunciaton /AA/ whereas speaker type B systematically pronounces it as /AE/. A DNN without speaker features will tend to classify B’s /AA/ as /AE/ because statistically there will be more /AE/’s with canonical pronunciations in the training data. A DNN with speaker identity features however, will learn to significantly increase the output score for /AA/ when presented with /AE/ acoustics for speakers of type B (but not for speakers of type A). In other words, the network can learn speaker-dependent transforms for the acoustic features in order to create a canonical phone classification space in which inter-speaker variability is significantly reduced.
978-1-4799-2756-2/13/$31.00 ©2013 IEEE ASRU 2013
I-vectors [7] are a popular technique for speaker verification and speaker recognition because they encapsulate all the relevant information about a speaker’s identity in a lowdimensional fixed-length representation [8]. This makes them an attractive tool for speaker adaptation techniques for ASR. A concatenation of i-vectors and ASR features is used in [9] for discriminative speaker adaptation with region dependent linear transforms. I-vectors are also employed in [10], [11] for clustering speakers or utterances on mobile devices for more efficient adaptation. The attractiveness of i-vectors and these