The spambase data set was analysed by Sean Kelly.
This dataset contains roughly 4000 instances of 58 attributes each, representing
e-mail messages. One attribute is binary, registering whether or not the message
is spam, three are numbers describing the shortest, average, and total length
of strings of capitol letters in the message, and the other 54 are values describing
the frequency with which certain key words are used in the message. The data
were first evaluated by all ranking methods, but the volume of data caused resultant
charts to be unreadable. To counter this, the dataset was broken into multiple
sets of 10 attributes- spam flag, capitol string attributes, and 6 word-specific attributes
each. The smaller datasets were evaluated using GainRatio, ReliefF and
outlier ranking, but the resultant museum was still too large to display, so outlier
ranking was dropped. In each subset, the capitol string attributes rank lower than
at least one word-specific attribute. Many word-specific attributes, however, take
the value 0 in the majority of instances in the datafile, hence box plots of the data
are heavily weighted.
35