Metallinou et al. [6] presented a high performing system
that used both visual and acoustic features to classify the
spontaneous content in IEMOCAP on the continuous valence
and activation descriptors. Also investigated were various
hierarchical system designs that exploited contextual
information at the conversation level. In the context-free
case, HMMs were trained on early fused frame-level acoustic/
visual features. Facial features were shown to be better
for valence classication, and acoustic features performed
better for activation. Their combination led to improvements
in the valence task.
The existing literature indicates the combination of acoustic,
visual, and linguistic features is a viable and important
direction for SER.