When validating a method, it is very unlikely that different evaluators will agree exactly by giving identical results for all evaluation sessions. We investigated the aspects of consistency between our method evaluations using two statistical measures (see Table 1). Firsts results about our method validation can be seen in [24].
The any-two agreement method measures the extent of agreement on what problems the system contains for pairs of evaluators [20]. For each comparison the number of agreements, disagreements and single points were recorded, always considering the margin of 4 seconds to be counted as the same observation point. The average score of the comparisons for each pair of evaluations was 44%, exceeding the average of 38.5% obtained in [14]. Based on the 64% agreement obtained applying the Cohen's Kappa [21] measurement that included the participation of evaluators with distinct experience levels, we can assume that even inexperienced evaluators will be able to apply the method after receiving an appropriate training. However both assumptions need to be further investigated.