in improvement in classifier accuracy are made in
the simple first steps. This is a phenomenon which
has been noticed by others (e.g., Rendell and Seshu
[37]; Shavlik, Mooney and Towell [41]; Mingers
[34]; Weiss, Galen and Tadepalli [45]; Holte [22]).
Holte [22], in particular, carried out an investigation
of this phenomenon. His “simple classifier” (called
1R) consists of a partition of a single variable, with
each cell of the partition possibly being assigned to
a different class: it is a multiple-split single-level tree
classifier. A search through the variables is used to
find that which yields the best predictive accuracy.
Holte compared this simple rule with C4.5, a more
sophisticated tree algorithm, finding that “on most
of the datasets studied, 1R’s accuracy is about 3
percentage points lower than C4’s.”
We carried out a similar analysis. Perhaps the
earliest classification method formally developed is
Fisher’s linear discriminant analysis [7]. Table 1 shows
misclassification rates for this method and for the
best performing method we could find in a search
of the literature (these data were abstracted from
the data accumulated by Jamain [23] and Jamain
and Hand [24]) for a randomly selected sample of
ten data sets. The first numerical column shows the
misclassification rate of the best method we found
(mT ), the second shows that of linear discriminant
analysis (mL), the third shows the default rule of assigning
every point to the majority class (m0) and
the final column shows the proportion of the difference
between the default rule and the best rule
which is achieved by linear discriminant analysis
[(m0 − mL)/(m0 − mT )]. It is likely that the best
rules, being the best of rules which many researchers
have applied, are producing results near the Bayes
error rate.