The algorithm considered two consecutive strings with a space in between.
The strings were first segmented into word sequences with POS tagged to each word.
By exploiting a POS n-gram model,
it was verified whether the space was a sentence break or not.
The system was trained and tested with subsets of the ORCHID corpus
(Charoenporn et al., 1997), and 80% break detection and
9% false-break rates were achieved.
An extension of the algorithm was proposed by Charoenpornsawat and Sornlertlamvanich (2001).
Not only the POSs of surrounding
words but also collocations of surrounding words and
lengths of surrounding token texts were used as the features
for determining whether space characters were sentence
boundaries.
These features were confirmed to be useful.
In (Charoenpornsawat and Sornlertlamvanich, 2001),
these features were extracted automatically by machine learning using the system Winnow.
Winnow was also used for sentence break detection.
Compared to the POS n-gram model,
a 1.7% improvement of break-detection rate
and a 79% reduction of false-break rate were achieved.
Although these gains are substantial,
the algorithms depend strongly on word segmentation and POS tagging.
A larger POS tagged corpus is needed to improve all these components