it was verified whether the space was a sentence break or not.
The system was trained and tested with subsets of the ORCHID corpus
(Charoenporn et al., 1997), and 80% break detection and
9% false-break rates were achieved.
An extension of the algorithm was proposed by Charoenpornsawat and Sornlertlamvanich (2001).
Not only the POSs of surrounding
words but also collocations of surrounding words and
lengths of surrounding token texts were used as the features
for determining whether space characters were sentence
boundaries.
These features were confirmed to be useful.
In (Charoenpornsawat and Sornlertlamvanich, 2001),
these features were extracted automatically by machine learning using the system Winnow.
Winnow was also used for sentence break detection.
Compared to the POS n-gram model,
a 1.7% improvement of break-detection rate
and a 79% reduction of false-break rate were achieved.
Although these gains are substantial,
the algorithms depend strongly on word segmentation and POS tagging.
A larger POS tagged corpus is needed to improve all these components
it was verified whether the space was a sentence break or not. The system was trained and tested with subsets of the ORCHID corpus(Charoenporn et al., 1997), and 80% break detection and9% false-break rates were achieved. An extension of the algorithm was proposed by Charoenpornsawat and Sornlertlamvanich (2001). Not only the POSs of surroundingwords but also collocations of surrounding words andlengths of surrounding token texts were used as the featuresfor determining whether space characters were sentenceboundaries. These features were confirmed to be useful.In (Charoenpornsawat and Sornlertlamvanich, 2001), these features were extracted automatically by machine learning using the system Winnow. Winnow was also used for sentence break detection. Compared to the POS n-gram model, a 1.7% improvement of break-detection rate and a 79% reduction of false-break rate were achieved. Although these gains are substantial, the algorithms depend strongly on word segmentation and POS tagging. A larger POS tagged corpus is needed to improve all these components
การแปล กรุณารอสักครู่..
