observing a pattern of certain NCAs, CAs and physical state combined with certain argumentative structures in the data. The training can be done using machine learning techniques like Hidden Markov Models [19] or Dynamic Bayesian Networks, looking for statistical correlations.
This level requires a lexicon of meeting activity types. An example of such a lexicon can be found in [20]. An approach to obtain such a lexicon is the use of unsupervised clustering techniques on the lower level annotation elements, as mentioned in [27]. The resulting clusters can be indicative of possible meeting activity types.
Annotations
As meetings can be structured in layers and we wish to label or annotate chunks of data in accordance with these layers, there is a need for an annotation lan- guage that supports these structures. An annotation format can be seen as an instantiation of a model. A model describes how the annotation should look like, which annotation structures are possible and what these structures mean. This implies, however, that if the model changes, the annotations are influenced as well and vice versa.
The choice of annotation schemas and structures for the separate boxes should in most applications be inspired by explanatory models of humans inter- action and the application goals. Different models or different uses of the models may lead to distinct annotation schemas for the information in the boxes.
5.1 Manual Annotations
The annotations discussed above are not necessarily automatically produced: corpus based work always involves a large amount of manual annotation work as well. There are several reasons for creating manual annotations of corpus ma- terial. In the first place ground truth knowledge is needed in order to evaluate new techniques for automatic annotation. In the second place high quality an- notations are needed to do social psychology research on the corpus data. As long as the quality of the automatic annotation results is not high enough, only manual annotations provide the quality of information to analyze certain aspects of human behaviour.
It is a well known problem that manual annotation of human interaction is extremely expensive in terms of effort. Annotating a stretch of video with not- too-complicated aspects may easily take ten times the duration of that video. Shriberg et al. report an efficiency of 18xRT (18 times the duration of the video is spent on annotating) on annotation of Dialog Acts boundaries, types and ad- jacency pairs on meeting recordings [28]. Simple manual transcription of speech usually takes 10xRT. For more complicated speech transcription such as prosody 100-200xRT has been reported in Syrdal et al. [29]. The cost of syntactic annota- tion of text (PoS tagging and annotating syntactic structure and labels for nodes and edges) may run to an average of 50 seconds per sentence with an average sentence length of 17.5 tokens (cf. Brants et al. [30], which describes syntactic annotation of a German newspaper corpus). As a final example, Lin et al. [31] report an annotation efficiency of 6.8xRT for annotating MPEG-7 metadata on video using the VideoAnnEx tool. The annotation described there consists of correction of shot boundaries, selecting salient regions in shots and assigning semantic labels from a controlled lexicon. It may be obvious that more complex annotation of video will further increase the cost.