In the context-free case, HMMs were trained on early fused frame-level acoustic/
visual features. Facial features were shown to be better
for valence classication, and acoustic features performed
better for activation. Their combination led to improvements
in the valence task.
The existing literature indicates the combination of acoustic,
visual, and linguistic features is a viable and important
direction for SER.