In Thai writing system, there are defined rules of how to
use space from The Royal Institute [11]. Space is categorized
into two types: major space indicating the end of sentence and
minor space indicating other purposes. Thai grammar has no
978-1-4799-0545-4/13/$31.00 ©2013 IEEE
rule to determine major space. Fortunately, it has several rules
to obviously recognize minor space.
A. Rules of Minor Space
There are many rules for using minor space. However,
they do not cover all kinds of use. Some rules, which have a
major impact of improving accuracy, are shown in this section.
However, the rest was also applied in the experiments. The
rules of minor space have details as follows:
1. One minor space after a colon (,).
2. One space before and after a pair of single and double
quotation marks (‘’, “ ”).
3. One space before and after a pair of parentheses.
4. One space before and after repetition mark (ๆ).
5. One space before and after mathematic symbols.
6. One space before and after a digit and time.
7. One space before and after foreign words or phrases.
8. One space before and after noun classifier (ลักษณะนาม).
9. One space after a minor omission mark (ฯ).
10. One space between นาย (Mr.), นาง (Mrs.), นางสาว (Miss)
and name.
Although it is defined that there is no space betweenนาย, นาง,
นางสาว, and name, lots of native users always make a mistake
even in formal writing. In order to cover the frequent
occurrence of this pattern, this rule was also applied.
B. Advantage of Rules for Sentence-Breaking
The major advantages of rules are their independence from
training data which sometimes do not provide useful clues to
be a guideline for reliable prediction. For example, a digit is
tagged as np in the corpus with CG tagging and it generally
has a minor space before and after digit. So, there is a
possibility to incorrectly classify those space as a sentencebreak since it shows no information that it is a digit. In this
case, applying rules can minimize predictable mistakes.