Crawling and Dataset
For building a training dataset, we collected 6 million public
tweets in English language using the Twitter Search API1
from November 19th, 2012 to December 19th, 2012 in a
15km radius around the city centers of Seattle, WA and
Memphis, TN. For labeling the tweets, we first extracted
tweets containing incident related keywords and hyponyms
of these keywords. The latter are extracted using WordNet2.
We defined four classes in our training set: ”car crash”,
”fire”, ”shooting”, and ”not incident related”. 20.000 tweets
were randomly selected from the initial set and manually
labeled by scientific members of our departments. The final
training set consists of 213 car accident related tweets,
212 fire incident related tweets, 231 shooting incident related
tweets, and 219 not incident related tweets.