This is not surprising since the eliciting videos are longer than the sentences in the corpus and thus can more easily build the emotional states in the viewer; also the absence of eyes, facial texture, and rest of the body makes the renderings of the tracked faces less effective in conveying the emotions.