We applied a number of words in a chunk, which encodes
contextual behaviour of an entire sentence, and rule-based
method, which potentially reduces false-breaks, to Thai
sentence-breaking tasks. Those features outperformed the
other features and indicate improvement in some measures.
Moreover, integration of those features and the rule-based
approach shows a great performance. However, high quality
pre-processing – correct CG tagging and word segmentation –
is required to be an application. This work only investigated
the performance on specific domain due to the lack of tagged
data, so in order to use as an application, general domain
corpus is necessary. However, using space as a clue to define
sentence boundary does not clearly cover all kind of sentences
especially in modern or informal writing.
We will begin to research on clause boundary detection
because in Thai grammar, clause is a grammatical unit that
closely represents natural structure of Thai and appropriate
meaning rather than a sentence. Moreover, clause boundary
may be clearly defined with less ambiguity than sentence
boundary causing much inconsistency.