Attribute Selection
Now we investigate which subset of attributes produces the best cross-validated
classification accuracy for the IBk algorithm on the glass dataset. Weka contains
automated attribute selection facilities, which are examined in a later section, but it
is instructive to do this manually.
Performing an exhaustive search over all possible subsets of the attributes is
infeasible (why?), so we apply the backward elimination procedure described in
Section 7.1 (page 311). To do this, first consider dropping each attribute individually
from the full dataset, and run a cross-validation for each reduced version. Once you
have determined the best eight-attribute dataset, repeat the procedure with this
reduced dataset to find the best seven-attribute dataset, and so on.
Exercise 17.2.4. Record in Table 17.1 the best attribute set and the greatest
accuracy obtained in each iteration. The best accuracy obtained in this process
is quite a bit higher than the accuracy obtained on the full dataset.
Exercise 17.2.5. Is this best accuracy an unbiased estimate of accuracy on
future data? Be sure to explain your answer. (Hint: To obtain an unbiased
estimate of accuracy on future data, we must not look at the test data at all