Not all of the attributes (i.e., terms) are important when classifying documents.
The reason is that many words are irrelevant for determining an article’s topic. Weka’s
AttributeSelectedClassifier, using ranking with InfoGainAttributeEval and the Ranker
search, can eliminate less useful attributes. As before, FilteredClassifier should be
used to transform the data before passing it to AttributeSelectedClassifier.
Exercise 17.5.11. Experiment with this, using default options for
StringToWordVector and NaiveBayesMultinomial as the classifier. Vary
the number of the most informative attributes that are selected from the
information gain–based ranking by changing the value of the numToSelect
field in the Ranker. Record the AUC values you obtain. How many attributes
give the best AUC for the two datasets discussed before? What are the best
AUC values you managed to obtain?
17.6 MINING ASSOCIATION RULES