In this section, the Naive Bayes multi-label classifier is used to detect engineering student problems from the Purdue
data set. There were 35,598 unique tweets in the Purdue tweet collection. We took a random sample of 1,000 tweets, and found no more than 5 percent of these tweets were discussing engineering problems. Our purpose here was to detect the small number of tweets that reflect engineering
students’ problems. The differences between #engineering-Problems data set and Purdue data set is that the latter contains
much smaller number of positive samples to be detected, and its “others” category has more diverse content. Therefore, to make the training set better adapt to the
Purdue data set, we took a random sample of 5,000 tweets from the Purdue data set, added them into the 2,785 #engineeringProblems tweets, and labeled them as “others”. Less than 5 percent positive samples in this category do not influence the effectiveness of the trained model. We thus used the 7,785 tweets as input to train the multi-label Na€ıve Bayes
classifier. Since no extra human effort is needed, and Na€ıve Bayes classifier is very efficient in terms of computation time, the model training here incurred almost no extra cost. Table 5 shows the top most probable words in each category ranked using the conditional probability p(w|c) as in (2).
Our purpose here is to detect the small number of the five problems from the large Purdue data set, so we do not discuss
the “others” in this section.