looking for patterns in text. However, the superficial similarity between the two conceals real
differences. Data mining can be more fully characterized as the extraction of implicit, previously
unknown, and potentially useful information from data [Witten and Frank, 2000]. The information
is implicit in the input data: it is hidden, unknown, and could hardly be extracted without recourse
to automatic techniques of data mining. With text mining, however, the information to be extracted
is clearly and explicitly stated in the text. It’s not hidden at all—most authors go to great pains to
make sure that they express themselves clearly and unambiguously—and, from a human point of
view, the only sense in which it is “previously unknown” is that human resource restrictions make
it infeasible for people to read the text themselves. The problem, of course, is that the information
is not couched in a manner that is amenable to automatic processing. Text mining strives to bring
it out of the text in a form that is suitable for consumption by computers directly, with no need for
a human intermediary.
Though there is a clear difference philosophically, from the computer’s point of view the problems
are quite similar. Text is just as opaque as raw data when it comes to extracting information—
probably more so.
Another requirement that is common to both data and text mining is that the information extracted
should be “potentially useful.” In one sense, this means actionable—capable of providing a basis
for actions to be taken automatically. In the case of data mining, this notion can be expressed in a
relatively domain-independent way: actionable patterns are ones that allow non-trivial predictions
to be made on new data from the same source. Performance can be measured by counting
successes and failures, statistical techniques can be applied to compare different data mining
methods on the same problem, and so on. However, in many text mining situations it is far harder
to characterize what “actionable” means in a way that is independent of the particular domain at
hand. This makes it difficult to find fair and objective measures of success.
In many data mining applications, “potentially useful” is given a different interpretation: the key
for success is that the information extracted must be comprehensible in that it helps to explain the
data. This is necessary whenever the result is intended for human consumption rather than (or as
well as) a basis for automatic action. This criterion is less applicable to text mining because, unlike
data mining, the input itself is comprehensible. Text mining with comprehensible output is
tantamount to summarizing salient features from a large body of text, which is a subfield in its
own right: text summarization.