largest and most striking aspects of the data structure,
and then turns to progressively smaller aspects
(stopping, one hopes, before the process begins to
model idiosyncrasies of the observed sample of data
rather than aspects of the true underlying distribution).
In Section 2 we show that this means that the
large gains in predictive accuracy in classification
are won using relatively simple models at the start of
the process, leaving potential gains which decrease
in size as the modeling process is taken further. All
of this means that the extra accuracy of the more
sophisticated approaches, beyond that attained by
simple models, is achieved from “minor” aspects of
the distributions and classification problems.
Second, in Section 3 we argue that in many, perhaps
most, real classification problems the data points
in the design set are not, in fact, randomly drawn
from the same distribution as the data points to
which the classifier will be applied. There are many
reasons for this discrepancy, and some are illustrated.
It goes without saying that statements about classifier
accuracy based on a false assumption about the
identity of the design set distribution and the distribution
of future points may well be inaccurate.
Third, when constructing classification rules, various
other assumptions and choices are often made
which may not be appropriate and which may give
misleading impressions of future classifier performance.
For example, it is typically assumed that the classes
are objectively defined, with no arbitrariness or uncertainty
about the class labels, but this is sometimes
not the case. Likewise, parameters are often
estimated by optimizing criteria which are not relevant
to the real aim of classification accuracy. Such
issues are described in Section 4 and, once again, it
is obvious that these introduce doubts about how
the claimed classifier performance will generalize to
real problems.
The phenomena with which we are concerned in
Sections 3 and 4 are related to the phenomenon of
overfitting. A model overfits when it models the design
sample too closely rather than modeling the distribution
from which this sample is drawn. In Sections
3 and 4 we are concerned with situations in
which the models may accurately reflect the design
distributions (so they do not underfit or overfit), but
where they fail to recognize that these distributions,
and the apparent classification problems described,
are in fact merely a single such problem drawn from
a notional distribution of problems. The real aim
might be to solve a rather different problem. One