Our Active Appearance Model was built using 44,000 frames from
200 of the source sequences, with 110 facial landmarks identified
for each frame, 32 of them around the mouth. After shape normalisation
and PCA the 10 largest PCA parameters were retained as
they contained over 98% of the energy. The corresponding audio
data was sampled at 44100 Hz and parameterised using 13 mel frequency
cepstal coefficients (MFCCs). Finally the audio-video dual
HMM model was built in the joint audio-video space. This joint
model is used to produce photorealistic videos from audio only input
as described by D.Cosker [Cosker 2006].