Many evaluation issues for grammatical error detection have previously been overlooked,
making it hard to draw meaningful comparisons between different approaches, even when
they are evaluated on the same corpus. To begin with, the three-way contingency between a
writer’s sentence, the annotator’s correction, and the system’s output makes evaluation more
complex than in some other NLP tasks, which we address by presenting an intuitive evaluation
scheme. Of particular importance to error detection is the skew of the data – the low frequency
of errors as compared to non-errors – which distorts some traditional measures of performance
and limits their usefulness, leading us to recommend the reporting of raw measurements (true
positives, false negatives, false positives, true negatives). Other issues that are particularly
vexing for error detection focus on defining these raw measurements: specifying the size or
scope of an error, properly treating errors as graded rather than discrete phenomena, and
counting non-errors. We discuss recommendations for best practices with regard to reporting
the results of system evaluation for these cases, recommendations which depend upon making
clear one’s assumptions and applications for error detection. By highlighting the problems with
current error detection evaluation, the field will be better able to move forward.
KEYWORDS: grammatical error detection, system evaluation, evaluation metrics.
Many evaluation issues for grammatical error detection have previously been overlooked,making it hard to draw meaningful comparisons between different approaches, even whenthey are evaluated on the same corpus. To begin with, the three-way contingency between awriter’s sentence, the annotator’s correction, and the system’s output makes evaluation morecomplex than in some other NLP tasks, which we address by presenting an intuitive evaluationscheme. Of particular importance to error detection is the skew of the data – the low frequencyof errors as compared to non-errors – which distorts some traditional measures of performanceand limits their usefulness, leading us to recommend the reporting of raw measurements (truepositives, false negatives, false positives, true negatives). Other issues that are particularlyvexing for error detection focus on defining these raw measurements: specifying the size orscope of an error, properly treating errors as graded rather than discrete phenomena, andcounting non-errors. We discuss recommendations for best practices with regard to reportingthe results of system evaluation for these cases, recommendations which depend upon makingclear one’s assumptions and applications for error detection. By highlighting the problems withcurrent error detection evaluation, the field will be better able to move forward.KEYWORDS: grammatical error detection, system evaluation, evaluation metrics.
การแปล กรุณารอสักครู่..
