Exploring the StringToWordVector Filter
By default, the StringToWordVector filter simply makes the attribute value in the
transformed dataset 1 or 0 for all single-word terms, depending on whether the word
appears in the document or not. However, as mentioned in Section 11.3 (page 439),
there are many options:
• outputWordCounts causes actual word counts to be output.
• IDFTransform and TFTransform: When both are set to true, term frequencies
are transformed into TF × IDF values.
• stemmer gives a choice of different word-stemming algorithms.
• useStopList lets you determine whether or not stopwords are deleted.
• tokenizer allows different tokenizers for generating terms, such as one that
produces word n-grams instead of single words.
There are several other useful options. For more information, click on More in the
Generic Object Editor window.
Exercise 17.5.10. Experiment with the options that are available. What options
give a good AUC value for the two datasets above, using NaiveBayesMultinomial
as the classifier?