I. INTRODUCTION
Most Thai people are familiar with Thai herbs, which have
been used for curing many simple symptoms, such as
indigestion, stomachache, fever and so on. Many organizations
[1-4] have published useful Thai herb information on their
websites. Although these information are similar and somewhat
consistent, none have completed information regarding names
of Thai herbs, symptoms, and parts of used. For example, one
webpage only gives scientific name but does not give common
name; while another webpage only indicate which Thai herb to
be used for curing which symptom, but does not provide which
part of the Thai herb to be used.
This paper proposes Thai herb information extraction
process to recognize useful Thai herb information and extract
them from multiple websites. The process employed an open
source HTML parser called JSOUP [5] and several template
files. The overall process has two main phase; symptom name
collection phase and treatment information extraction phase.
Output of the first phase is a file containing symptom names,
which collecting from multiple websites using synonyms of a
word 'treat'. The second phase extracted Thai herb treatment
information, including Thai herb names and medicinal used.
Names include science name, common name, and family name.
Medicinal used includes part of used and symptom name.
The challenge of Thai herb information extraction is that
each websites have different HTML structures. Some are
simple and some are complicated. When one Thai herb can
treat many symptoms, the symptom names are listed
continuously, with or without delimiter. However, eachsymptom can be treated using different Thai herb parts. The
process needs to recognize which content is the Thai herb part
and which content is the symptom.
Not only that Thai herb information from multiple sources
has no fixed pattern content, but there is also inherently nested
content and listed content structure within medicinal-used topic
of each page. That is, medicinal-used topic can contain many
part-of-used topics, and each part-of-used topic can contain
many symptoms that it can cure.
This paper divides into 3 parts; the next section
presents related work on web information extraction. Section 3
describes the proposed process for extracting Thai herb
information from multiple websites. The last section describes
the experiment process and evaluates precision and recall of the
proposed process.
I. INTRODUCTIONMost Thai people are familiar with Thai herbs, which havebeen used for curing many simple symptoms, such asindigestion, stomachache, fever and so on. Many organizations[1-4] have published useful Thai herb information on theirwebsites. Although these information are similar and somewhatconsistent, none have completed information regarding namesof Thai herbs, symptoms, and parts of used. For example, onewebpage only gives scientific name but does not give commonname; while another webpage only indicate which Thai herb tobe used for curing which symptom, but does not provide whichpart of the Thai herb to be used.This paper proposes Thai herb information extractionprocess to recognize useful Thai herb information and extractthem from multiple websites. The process employed an opensource HTML parser called JSOUP [5] and several templatefiles. The overall process has two main phase; symptom namecollection phase and treatment information extraction phase.Output of the first phase is a file containing symptom names,which collecting from multiple websites using synonyms of aword 'treat'. The second phase extracted Thai herb treatmentinformation, including Thai herb names and medicinal used.Names include science name, common name, and family name.Medicinal used includes part of used and symptom name.The challenge of Thai herb information extraction is thateach websites have different HTML structures. Some aresimple and some are complicated. When one Thai herb cantreat many symptoms, the symptom names are listedcontinuously, with or without delimiter. However, eachsymptom can be treated using different Thai herb parts. Theprocess needs to recognize which content is the Thai herb partand which content is the symptom.Not only that Thai herb information from multiple sourceshas no fixed pattern content, but there is also inherently nestedcontent and listed content structure within medicinal-used topicof each page. That is, medicinal-used topic can contain manypart-of-used topics, and each part-of-used topic can containmany symptoms that it can cure.This paper divides into 3 parts; the next sectionpresents related work on web information extraction. Section 3describes the proposed process for extracting Thai herbinformation from multiple websites. The last section describesthe experiment process and evaluates precision and recall of theproposed process.
การแปล กรุณารอสักครู่..
