Information extraction refers to the task of extracting information from documents
so that it takes a form that is useful in other applications. This might
mean identifying the answer to a specific question or recording the information
in a form from which questions can be answered at a later date. One such form
is known as a frame, which is essentially a template in which specifics are
recorded. For example, consider a system for reading a newspaper. The system
might make use of a variety of frames, one for each type of article that might
appear in a newspaper. If the system identifies an article as reporting on a burglary,
it would proceed by trying to fill in the slots in the burglary frame. This
frame would probably request such items as the address of the burglary, the time
and date of the burglary, the items taken, and so on. In contrast, if the system
identifies an article as reporting on a natural disaster, it would fill in the natural
disaster frame, which would lead the system toward identifying the type of disaster,
amount of damage, and so on.
Another form in which information extractors record information is known as
a semantic net. This is essentially a large linked data structure in which pointers
are used to indicate associations among the data items. Figure 11.3 shows part of
a semantic net in which the information obtained from the sentence
Mary hit John.
has been highlighted.