We have introduced a new hybrid approach
for finding abbreviations and their definitions in
unstructured texts. The problem of abbreviation
processing has attracted relatively little attention
in NLP field. However, technical documents use
a lot of abbreviations to represent domainspecific
knowledge. Thus, the ability to find
correct abbreviations and their definitions is
very important to being able to utilize the
information contained in those documents. It is
also very useful for many NLP applications such
as information retrieval [1] and glossary
extraction [4, 9, 11].
The proposed method has the following
advantages:
(1) It is simple and fast.
A small number of formation rules can
describe many abbreviations. By keeping
these rules in the rulebase, this system can
process most abbreviations by simple
pattern matches. Furthermore, the
abbreviation matcher consists of 5 simple
match routines and each routine is dedicated
to a certain type of abbreviations. Thus, it is
conceptually simple and fast.
(2) It shows high recall and precision rates.
(3) It provides for flexible user customization.
For example, users can specify rule
thresholds for updating the rulebase.
(4) It is trainable. The rulebase may be
automatically refined as the system
processes new documents.
(5) It is adaptable to new styles and editorial
conventions. It can process new types of
abbreviations by inserting appropriate rules
in the rulebase without modifying the
system. Rules are symbolic, so users can
easily add, modify, or delete the rules by
hand.
(6) It can be adapted to new technical domains.
The dictionary, set of replacement matches,
stopword list, and prefix list, can be tailored
for new domains.
In addition to the lacunae mentioned in
Section 5, we are aware that there are classes of
abbreviations which our current method does not
handle adequately. These are typically written
with all lower-case characters and are almost
never introduced with text markers or cue
words. Examples are :
· cu – customer
· hw – hardware
· mgr – manager
· pgm – program
· sw – software
Mechanisms for processing these abbreviations,
which tend to occur in informal text such as
email, chat rooms, or customer service call
records, are the subject of ongoing research in
our project