Abstract—We propose to adapt deep neural network (DNN)
acoustic models to a target speaker by supplying speaker identity
vectors (i-vectors) as input features to the network in parallel with
the regular acoustic features for ASR. For both training and test,
the i-vector for a given speaker is concatenated to every frame
belonging to that speaker and changes across different speakers.
Experimental results on a Switchboard 300 hours corpus show
that DNNs trained on speaker independent features and ivectors
achieve a 10% relative improvement in word error rate
(WER) over networks trained on speaker independent features
only. These networks are comparable in performance to DNNs
trained on speaker-adapted features (with VTLN and FMLLR)
with the advantage that only one decoding pass is needed.
Furthermore, networks trained on speaker-adapted features and
i-vectors achieve a 5-6% relative improvement in WER after
hessian-free sequence training over networks trained on speakeradapted
features only