4. Experiments
4.1. Experimental setup
To compare the performance of the three different big data
mining procedures, four large scale datasets that cover different
domain problems are used. They are the KDD Cup2 2004 (protein
homology prediction) and 2008 (breast cancer prediction), covertype3
and person activity4 datasets. Table 2 lists the basic information
for these four datasets. The former two datasets (i.e. KDD
Cup 2004 and 2008) belong to 2-class classification problems and
the latter two (i.e. covertype and person activity) are multi-class
classification problems.
In addition, each dataset is divided into 90% training and 10%
testing sets based on the 10-fold cross validation strategy (Kohavi,
1995), for training and testing the SVM classifier, respectively. The
classification accuracy of the SVM and the times for training and