In the literature, two main approaches for acoustic feature
modelling are used. Frame level features (LLDs) directly,
and statistical functionals computed across the LLDs that
make up an utterance. Frame sizes are usually around 20-
50ms, and LLDs are typically MFCCs, pitch, and energy
measures. Combining acoustic features with other sources
of information usually takes either the `early' or `late' fusion
approach. Concatenating together feature vectors extracted
from dierent data types for direct classication is referred
to as `early fusion'. `Late fusion' uses a separate model for
each source of data, and a later decision stage to obtain the
nal result. Early fusion is known to be superior when mutual
information is present between the modes of data [11],
although feature selection is usually necessary to deal with
the high dimensionality of these feature vectors.