Many evaluation issues for grammatical error detection have previously been overlooked,
making it hard to draw meaningful comparisons between different approaches, even when
they are evaluated on the same corpus. To begin with, the three-way contingency between a
writer’s sentence, the annotator’s correction, and the system’s output makes evaluation more
complex than in some other NLP tasks, which we address by presenting an intuitive evaluation
scheme. Of particular importance to error detection is the skew of the data – the low frequency
of errors as compared to non-errors – which distorts some traditional measures of performance
and limits their usefulness, leading us to recommend the reporting of raw measurements (true
positives, false negatives, false positives, true negatives). Other issues that are particularly
vexing for error detection focus on defining these raw measurements: specifying the size or
scope of an error, properly treating errors as graded rather than discrete phenomena, and
counting non-errors. We discuss recommendations for best practices with regard to reporting
the results of system evaluation for these cases, recommendations which depend upon making
clear one’s assumptions and applications for error detection. By highlighting the problems with
current error detection evaluation, the field will be better able to move forward.
KEYWORDS: grammatical error detection, system evaluation, evaluation metrics.