In this paper we argue that, under a broad range of
circumstances, all data reduction techniques will result
in some decrease in tree size with little impact
on accuracy. Section 2 oers detailed empirical evidence
for the validity of this claim, but an intuitive
feeling for why it might be true can be grasped by
looking at Figure 1. The gure shows plots of tree
size and accuracy as a function of training set size for
the UC Irvine australian dataset. c4.5 was used to
generate the trees (Quinlan 1993) and each plot corresponds
to a dierent pruning mechanism: error-based
(ebp { the c4.5 default) (Quinlan 1993), reduced error
(rep) (Quinlan 1987), minimum description length
(mdl) (Quinlan & Rivest 1989), cost-complexity with
the 1se rule (ccp1se) (Breiman et al. 1984), and costcomplexity
without the 1se rule (ccp0se). On the
left-hand side of the graphs, no training instances are
available and the best one can do with test instances is
to assign them a class label at random. On the righthand
side of the graph, the entire dataset (excluding
test instances) is available to the tree building process.
Movement from the the left to the right corresponds
to the addition of randomly selected instances to the
training set. Alternatively, moving from the right to
the left corresponds to removing randomly selected instances
from the training set.
In this paper we argue that, under a broad range ofcircumstances, all data reduction techniques will resultin some decrease in tree size with little impacton accuracy. Section 2 o ers detailed empirical evidencefor the validity of this claim, but an intuitivefeeling for why it might be true can be grasped bylooking at Figure 1. The gure shows plots of treesize and accuracy as a function of training set size forthe UC Irvine australian dataset. c4.5 was used togenerate the trees (Quinlan 1993) and each plot correspondsto a di erent pruning mechanism: error-based(ebp { the c4.5 default) (Quinlan 1993), reduced error(rep) (Quinlan 1987), minimum description length(mdl) (Quinlan & Rivest 1989), cost-complexity withthe 1se rule (ccp1se) (Breiman et al. 1984), and costcomplexitywithout the 1se rule (ccp0se). On theleft-hand side of the graphs, no training instances areavailable and the best one can do with test instances isto assign them a class label at random. On the righthandside of the graph, the entire dataset (excludingtest instances) is available to the tree building process.Movement from the the left to the right correspondsto the addition of randomly selected instances to thetraining set. Alternatively, moving from the right tothe left corresponds to removing randomly selected instancesfrom the training set.
การแปล กรุณารอสักครู่..
