might thus describe the issue as one of problem uncertainty.
To take a familiar example, which we do
not explore in detail in this paper because it has
been explored elsewhere, the relative costs of different
kinds of misclassification may differ and may be
unknown. A very common resolution is to assume
equal costs (Jamain and Hand [24] found that most
comparative studies of classification rules made this
assumption) and to use straightforward error rate
as the performance criterion. However, equality is
but one choice, and an arbitrary one at that, and
one which we suspect is in fact rarely appropriate.
In assuming equal costs, one is adopting a particular
problem which may not be the one which is really
to be solved. Indeed, things are even worse than
this might suggest, because relative misclassification
costs may change over time. Provost and Fawcett
[36] have described such situations: “Comparison often
is difficult in real-world environments because
key parameters of the target environment are not
known. The optimal cost/benefit tradeoffs and the
target class priors seldom are known precisely, and
often are subject to change (Zahavi and Levin [47];
Friedman and Wyatt [8]; Klinkenberg and Thorsten
[29]). For example, in fraud detection we cannot ignore
misclassification costs or the skewed class distribution,
nor can we assume that our estimates are
precise or static (Fawcett and Provost [6]).”
Moving on, our fourth argument is that classification
methods are typically evaluated by reporting
their performance on a variety of real data sets.
However, such empirical comparisons, while superficially
attractive, have major problems which are often
not acknowledged. In general, we suggest in Section
5 that no method will be universally superior
to other methods: relative superiority will depend
on the type of data used in the comparisons, the
particular data sets used, the performance criterion
and a host of other factors. Moreover, the relative
performance will depend on the experience the person
making the comparison has in using the methods,
and this experience may differ between methods:
researcher A may find that his favorite method
is best, merely because he knows how to squeeze the
best performance from this method.
These various arguments together suggest that an
apparent superiority in classification accuracy, obtained
in “laboratory conditions,” may not translate
to a superiority in real-world conditions and, in
particular, the apparent superiority of highly sophisticated
methods may be illusory, with simple methods
often being equally effective or even superior in
classifying new data points.