Natural Language Processing (NLP) presumes sentencebreaking as a fundamental task. Most NLP applications such
as machine translation, information retrieval, text
summarization require input text as sentences rather than a
whole paragraph. In language with explicit sentence markers
like English, there is a possibility that the markers can cause
ambiguities for a machine. Thus, many approaches have been
proposed to determine a sentence boundary in English [1], [2].
In Thai, the problem is even obvious to the fact that there is no
explicit sentence marker which represents the end of a
sentence. Fortunately, space is generally used at the end of a
sentence in Thai writing system; however, a space does not
always indicate the end of a sentence [3], [4]. It is also used
for other purposes such as indicating clause/phrase break in a
sentence, placing before and after numerals etc. Therefore,
Thai sentence-breaking is practically regarded as the
mechanism to classify each space as either sentence-breaking
or non-sentence-breaking.
In Thai text analysis, there are few studies on sentencebreaking [5]-[9] which still do not give acceptable results.
Several researchers proposed various solutions for Thai
sentence-breaking. Ruled-based method which considered
main verb and conjunction was used to identify sentence
boundary [5]. Then, Trigram model with part-of-speech (POS)
was used to solve the problem [6]. However, it considered
only POS with restricted range of context; so some
information may not be taken into account. Then, Winnow
algorithm with left-two and right-two POS tags and words
was applied to improve the previous method and gave better
results [7]. Maximum entropy algorithm with surrounding
words was utilized to achieve the task of Thai sentencebreaking for large scale machine translation [8]. Minimum
processing time was carefully concerned in the study so
simple features and a large amount of training data were
utilized to achieve high accuracy. It gave results with higher
space-correct score than the previous ones; however, falsebreak score was still not satisfying. Recently, a study showed
that using Categorial Grammar (CG) as a main feature in Thai
sentence-breaking yielded slightly improvement from the
previous ones [9]. Nevertheless, contextual behaviour features
regarding an entire sentence rather than only a context around
a space were not taken into account.
In previous works, statistical approaches were widely used
in many NLP tasks. However, performance of statistical
approaches generally depends on training data and features.
Thus, appropriate features reflecting natural usage of each
specific NLP task were considered as a key in improving
performance and finding features reflecting a contextual
behaviour of an entire sentence are considered as a challenge
to enhance accuracy. On the other hand, there are many
researches in Thai sentence extraction [5], [10] showing
promises that rule-based method is a competitive candidate in
NLP tasks. The major advantages of rule are their
independence from inconsistent training data. Due to the
capability of statistical and rule-based approach, this study
proposes to utilize both statistical approach with contextual
behaviour features and appropriate grammar rules to improve
accuracy of Thai sentence-breaking.
This paper is structured as follows. Section II explains
Thai grammar rules. Section III describes the methodology.
Section IV illustrates experiment settings and results. Section
V consists of discussions. We draw conclusions VI.