As Fig. 3.4 shows the information gain is 0.107012. This is calculated by tak
ing the old overall entropy (0.946848) minus the new overall entropy (0.839836). Note that still all persons are classified as young. However, we gained information by splitting on attribute smoker. The information gain, i.e., a reduction in entropy, was obtained because we were able to find a group of persons for which there is less variability; most smokers die young. The goal is to maximize the information
gain by selecting a particular attribute to split on. Maximizing the information gain corresponds to minimizing the entropy and heterogeneity in leaf nodes. We could also have chosen the attribute drinker first. However, this would have resulted in a smaller information gain.
The lower part of Fig. 3.4 shows what happens if we split the set of nonsmokers based on attribute drinker. This results in two new leaf nodes. The node that corresponds to persons who do not smoke and do not drink has a low entropy value
(E = 0.198234). This can be explained by the fact that indeed most of the people
associated to this leaf node live long and there are only two exceptions to this rule.
The entropy of the other new leaf node (people that drink but do not smoke) is again close two one. However, the overall entropy is clearly reduced. The information gain is 0.076468. Since we abstract from the weight attribute we cannot further split the leaf node corresponding to people that drink but do not smoke. Moreover, it makes no sense to split the leaf node with smokers because little can be gained as the entropy is already low.
Note that splitting nodes will always reduce the overall entropy. In the extreme case, all the leaf nodes corresponds to single individuals (or individuals having exactly the same attribute values). The overall entropy is then by definition zero. However, the resulting tree is not very useful and probably has little predictive value. It is vital to realize that the decision tree is learned based on examples. For instance, if in the data set no customer ever ordered six muffins, this does not imply that this is not possible. A decision tree is “overfitting” if it depends too much on the particularities of the data used to learn it (see also Sect. 3.6). An overfitting decision tree is overly complex and performs poorly on unseen instances. Therefore, it is important to select the right attributes and to stop splitting when little can be gained.
Entropy is just one of several measures that can be used to measure the diversity in a leaf node. Another measure is the Gini index of diversity that measures the
“impurity” of a data set: G = 1 − J.k
(pi )2. If all classifications are the same, then
G = 0. G approaches 1 as there is more and more diversity. Hence, an approach can
be to select the attribute that maximizes the reduction of the G value (rather than
the E value).
See [5, 15, 52, 129] for more information (and pointers to the extensive literature) on the different strategies to build decision trees.
Decision tree learning is unrelated to process discovery, however it can be used in combination with process mining techniques. For example, process discovery techniques such as the α-algorithm help to locate all decision points in the process (e.g., the XOR/OR-splits discussed in Chap. 2). Subsequently, we can analyze each decision point using decision tree learning. The response variable is the path taken and the attributes are the data elements known at or before the decision point.