To evaluate and understand these results, we have to look at this outcome in detail. Figure 5 shows the scores achieved by the expert (one of the co-authors of this paper) for the visual tasks, the textual ones, and both combined. It can be seen that we actually won the expert round; solving 13 of the 16 tasks correctly (the systems coming in 2nd and 3rd each found ten correct ones). This first place is mostly due to the high performance in the textual tasks (Fig. 5, center), which in turn does not come that unexpected. Bridging the semantic gap [13], i.e., the mapping between the abstract, generally feature-based representation processed by a machine and the human interpretation of the presented information is considered the Holy Grail in video indexing and retrieval.