We calculated the F1 scores between the label sets given by any two researchers to a tweet, and then averaged over all the tweets to represent the agreement. Assume there are a total number of N tweets categorized by two researchers. For the ith tweet, x1i represents the number of labels given to this tweet by researcher A, x2i represents the number of labels given to this tweet by researcher B, and si represents the number of labels that are common between researcher A and researcher B (the agreed number of labels). Let p1i ¼ si=x1i, and p2i ¼ si=x2i, then