In collaboration with Ed Hovy, Jerry Hobbs and their NLP group at ISI, we have taken a first step toward building a learning by reading system: cobbling together a prototype, analyzing its performance and identifying the major obstacles to success. We built the prototype system by assembling three off-the-shelf systems for the tasks of parsing, semantic elaboration and knowledge integration. The system's starting knowledge is our Component Library, which contains formal representations of about 700 general concepts - such as the events Penetrate and Enter, and the entities Barrier and Container. We applied the system to the domain of heart biology, giving it numerous paragraphs on the structure and function of the human heart. The texts were unrestricted in their use of English, and were roughly at the level of Wikipedia articles. To help the system get started, we extended its general knowledge with ten concepts - such as Pump and Muscle - that are domain-general, but important to understanding heart texts.
By reading texts, the system attempts to learn a knowledge base of concept-relation-concept triples for the information conveyed by the text. In addition, it attempts to formulate hypotheses (also triples) for inferences it derives, but cannot confirm, from the text. To evaluate its performance, we compare the system's recall and precision with that of human readers, thereby establishing a performance baseline for evaluating future systems.
With funding from DARPA and an expanded team of researchers - including Peter Clark and Ralph Weischedel - and project management by Noah Friedland and David Israel, we're now in Phase II of the project. Our group will continue to focus on the key research challenge: Knowledge Integration.