The central problem investigated here is the problem of minimizing the cost of classification
when the tests are expensive. We argued that this requires assigning a cost to classification
errors. We also argued that a decision tree is the natural form of knowledge representation
for this type of problem. We then presented a general method for calculating the average cost
of classification for a decision tree, given a decision tree, information on the calculation of
test costs, a classification cost matrix, and a set of testing data. This method is applicable to
standard classification decision trees, without regard to how the decision tree is generated.
The method is sensitive to test costs, sensitive to classification error costs, capable of handling
conditional test costs, and capable of handling delayed tests.
We introduced ICET, a hybrid genetic decision tree induction algorithm. ICET uses a
genetic algorithm to evolve a population of biases for a decision tree induction algorithm.
Each individual in the population represents one set of biases. The fitness of an individual is
determined by using it to generate a decision tree with a training dataset, then calculating the
average cost of classification for the decision tree with a testing dataset.
We analyzed the behavior of ICET in a series of experiments, using five real-world medical
datasets. Three groups of experiments were performed. The first group looked at the
baseline performance of the five algorithms on the five datasets. ICET was found to have sig-nificantly lower costs than the other algorithms. Although it executes more slowly, an average
time of 23 minutes (for a typical dataset) is acceptable for many applications, and there
is the possibility of much greater speed on a parallel machine. The second group of experiments
studied the robustness of ICET under a variety of modifications to its input. The
results show that ICET is robust. The third group of experiments examined ICET’s search in
bias space. We discovered that the search could be improved by seeding the initial population
of biases.
In general, our research is concerned with pragmatic constraints on classification problems
(Provost & Buchanan, in press). We believe that many real-world classification problems
involve more than merely maximizing accuracy (Turney, in press). The results
presented here indicate that, in certain applications, a decision tree that merely maximizes
accuracy (e.g., trees generated by C4.5) may be far from the performance that is possible
with an algorithm that considers such realistic constraints as test costs, classification error
costs, conditional test costs, and delayed test results. These are just a few of the pragmatic
constraints that are faced in real-world classification problems.