3.3. Data Annotation
After defining the coding schema, a subset of tweets (5000) is randomly sampled and manually annotated into different themes. During the initial annotation process, we notice that most of the tweets are annotated as the others category, and some categories only contain a very small number of tweets. To ensure that we have enough tweets to build a classification model for the predefined categories, more tweets from each category should be included into the sampling sets which will then be used for the subsequent model training and validation processes. Therefore, an automatic program using a simple text match approach is developed to categorize the remaining tweets into different themes. A tweet is attributed to a specific category if it contains associated keywords defined in Table 1. We look into the tweets of each initial category except for the others category, and annotate those for which we are confident of their true categories, which are then added into our sampling sets. In order to reduce the duplicated tweets on the classifier, all retweets are discarded. In the end, 8807 tweets are included to train and test the multi-label classifier that will be presented in the following section.
3.3. Data Annotation After defining the coding schema, a subset of tweets (5000) is randomly sampled and manually annotated into different themes. During the initial annotation process, we notice that most of the tweets are annotated as the others category, and some categories only contain a very small number of tweets. To ensure that we have enough tweets to build a classification model for the predefined categories, more tweets from each category should be included into the sampling sets which will then be used for the subsequent model training and validation processes. Therefore, an automatic program using a simple text match approach is developed to categorize the remaining tweets into different themes. A tweet is attributed to a specific category if it contains associated keywords defined in Table 1. We look into the tweets of each initial category except for the others category, and annotate those for which we are confident of their true categories, which are then added into our sampling sets. In order to reduce the duplicated tweets on the classifier, all retweets are discarded. In the end, 8807 tweets are included to train and test the multi-label classifier that will be presented in the following section.
การแปล กรุณารอสักครู่..
