However, these measures can only be used for data that belong to mutually exclusive categories (single-label classification). Because we were dealing with a multi-label classification problem with non-mutually exclusive categories, these measurements were inapplicable for our study. Therefore, we used F1 measure [44]—which is the harmonic mean between two sets of data. F1 score is 1 when the two sets of data are exactly the same, and is 0 if the two sets of data are completely different. It represents how close two label sets are assigned to one tweet by two
researchers.