Next, Fig. 3.4 shows what happens if we split the data set based on attribute smoker. Now there are two leaf nodes both bearing the label young. Of the people that smoke (195), most die young (184). Hence, the entropy of this leaf node
is very small: E = −(((184/195) log2(184/195)) + ((11/195) log2(11/195))) =
0.313027. This means that the variability is much smaller. The other leaf node is more heterogeneous: about half of the 665 non smokers (362 to be precise) die young. Indeed E = −(((362/665) log2(362/665))+((303/665) log2(303/665))) = 0.994314 is higher. However, the overall entropy is still lower. The overall entropy can be found by simply taking the weighted average, i.e., E = (195/860) × 0.313027 + (665/860) × 0.994314 = 0.839836.