III. METHODOLOGY
In this study, we continued developing from the previous
study, so most of experiment settings and data were similar to
the study. We had to concern not only appropriate algorithms
and features but also a practical solution to integrate statistic
and rules when used both statistical and rule-based method. So,
there are three main considerations as follows.
A. Learning Algorithm
From the previous study [12], Classification and
Regression Tree (CART) gave the best performance among
different algorithms for phrase boundary prediction closely
related to the task of sentence-breaking. In addition, it
achieved an excellent performance in Thai NLP task. In this
paper, we used CART as a learning algorithm.
B. Design of Features for Thai Sentence-Breaking
As described in the introduction, the main idea of good
features is that they can represent a contextual behaviour of a
whole sentence. In this study, a total of six kinds of features
were considered shown in table I.
CG and POS were widely used in representing Thai natural
language and showed satisfactory results. We used CG as a
main feature due to its success from the previous study [9].
From native user’s point of view, indicating a sentence
boundary not only a task to consider a context around a space
but also a process to judge a chunk between a previous break
and a considering space. Moreover, a research [13] proved
that in phrase break prediction, closely related to this study,
using features linked to a chunk can increase performance. So,
three features – NWrd_SB, NWrd_End and V – reflecting
contextual behaviour of a chunk were proposed to chart
essential information. NWrd_SB was proposed from the
concept that a sentence normally has a proper length. Thus, a
chunk being in that length should have high probability to be
predicted as a sentence-break.
III. METHODOLOGY
In this study, we continued developing from the previous
study, so most of experiment settings and data were similar to
the study. We had to concern not only appropriate algorithms
and features but also a practical solution to integrate statistic
and rules when used both statistical and rule-based method. So,
there are three main considerations as follows.
A. Learning Algorithm
From the previous study [12], Classification and
Regression Tree (CART) gave the best performance among
different algorithms for phrase boundary prediction closely
related to the task of sentence-breaking. In addition, it
achieved an excellent performance in Thai NLP task. In this
paper, we used CART as a learning algorithm.
B. Design of Features for Thai Sentence-Breaking
As described in the introduction, the main idea of good
features is that they can represent a contextual behaviour of a
whole sentence. In this study, a total of six kinds of features
were considered shown in table I.
CG and POS were widely used in representing Thai natural
language and showed satisfactory results. We used CG as a
main feature due to its success from the previous study [9].
From native user’s point of view, indicating a sentence
boundary not only a task to consider a context around a space
but also a process to judge a chunk between a previous break
and a considering space. Moreover, a research [13] proved
that in phrase break prediction, closely related to this study,
using features linked to a chunk can increase performance. So,
three features – NWrd_SB, NWrd_End and V – reflecting
contextual behaviour of a chunk were proposed to chart
essential information. NWrd_SB was proposed from the
concept that a sentence normally has a proper length. Thus, a
chunk being in that length should have high probability to be
predicted as a sentence-break.
การแปล กรุณารอสักครู่..
