Data with String Attributes
The StringToWordVector filter assumes that the document text is stored in an attribute
of type String—a nominal attribute without a prespecified set of values. In the filtered
data, this is replaced by a fixed set of numeric attributes, and the class attribute
is put at the beginning, as the first attribute.
To perform document classification, first create an ARFF file with a string attribute
that holds the document’s text—declared in the header of the ARFF file using
@attribute document string, where document is the name of the attribute. A nominal
attribute is also needed to hold the document’s classification.
Exercise 17.5.1. Make an ARFF file from the labeled mini-documents in Table
17.4 and run StringToWordVector with default options on this data. How many
attributes are generated? Now change the value of the option minTermFreq to
2. What attributes are generated now?
Exercise 17.5.2. Build a J48 decision tree from the last version of the data you
generated.
Exercise 17.5.3. Classify the new documents in Table 17.5 based on the
decision tree generated from the documents in Table 17.4. To apply the same