Text mining appears to embrace the whole of automatic natural language processing and, arguably,
far more besides—for example, analysis of linkage structures such as citations in the academic
literature and hyperlinks in the Web literature, both useful sources of information that lie outside
the traditional domain of natural language processing. But, in fact, most text mining efforts
consciously shun the deeper, cognitive, aspects of classic natural language processing in favor of
shallower techniques more akin to those used in practical information retrieval.
The reason is best understood in the context of the historical development of the subject of natural
language processing. The field’s roots lie in automatic translation projects in the late 1940s and
early 1950s, whose aficionados assumed that strategies based on word-for-word translation would
provide decent and useful rough translations that could easily be honed into something more
accurate using techniques based on elementary syntactic analysis. But the sole outcome of these
3
high-profile, heavily-funded projects was the sobering realization that natural language, even at an
illiterate child’s level, is an astonishingly sophisticated medium that does not succumb to
simplistic techniques. It depends crucially on what we regard as “common-sense” knowledge,
which despite—or, more likely, because of—its everyday nature is exceptionally hard to encode
and utilize in algorithmic form [Lenat, 1995].
As a result of these embarrassing and much-publicized failures, researchers withdrew into “toy
worlds”—notably the “blocks world” of geometric objects, shapes, colors, and stacking
operations—whose semantics are clear and possible to encode explicitly. But it gradually became
apparent that success in toy worlds, though initially impressive, does not translate into success on
realistic pieces of text. Toy-world techniques deal well with artificially-constructed sentences of
what one might call the “Dick and Jane” variety after the well-known series of eponymous
children’s stories. But they fail dismally when confronted with real text, whether painstakingly
constructed and edited (like this article) or produced under real-time constraints (like informal
conversation).
Meanwhile, researchers in other areas simply had to deal with real text, with all its vagaries,
idiosyncrasies, and errors. Compression schemes, for example, must work well with all
documents, whatever their contents, and avoid catastrophic failure even when processing
outrageously deviant files (such as binary files, or completely random input). Information retrieval
systems must index documents of all types and allow them to be located effectively whatever their
subject matter or linguistic correctness. Key-phrase extraction and text summarization algorithms
have to do a decent job on any text file. Practical, working systems in these areas are topicindependent,
and most are language-independent. They operate by treating the input as though it
were data, not language.
Text mining is an outgrowth of this “real text” mindset. Accepting that it is probably not much,
what can be done with unrestricted input? Can the ability to process huge amounts of text
compensate for relatively simple techniques? Natural language processing, dominated in its
infancy by unrealistic ambitions and swinging in childhood to the other extreme of unrealistically
artificial worlds and trivial amounts of text, has matured and now embraces both viewpoints:
relatively shallow processing of unrestricted text and relatively deep processing of domain-specific
material.
It is interesting that data mining also evolved out of a history of difficult relations between
disciplines, in this case machine learning—rooted in experimental computer science, with ad hoc
evaluation methodologies—and statistics—well-grounded theoretically, but based on a tradition of
testing explicitly-stated hypotheses rather than seeking new information. Early machine learning
researchers knew or cared little of statistics; early researchers on structured statistical hypotheses
remained ignorant of parallel work in machine learning. The result was that similar techniques (for
example, decision-tree building and nearest-neighbor learners) arose in p