B. Data Cleaning
We downloaded the tweets as raw text files and
cleaned them by performing the following steps:
• Removed the word “RT” (denotes re-tweet)
• Removed usernames, written as “@username”
• Removed all URLs
• Removed the following punctuation marks:, " ' ? ! ; : #
$ % & ( ) * + - / < > = [ ] ^ _ { } | ~
• Removed all dates and timestamps of tweets from the
tweet content but stored it in a separate record against
the tweet.
• Replaced double and triple spaces between words
with single space.
• Removed stop words, but later when constructing the
attribute matrix (detailed in next section), this effort
seemed unnecessary.
After executing the aforementioned steps, we got a normalized
version of most tweets. For instance, the tweet “RT
@NadeemfParacha: A defiant press conference in Karachi by
ANP, MQM and PPP. Tue Apr 30 19:39:16 +0000 2013” was
transformed to “A defiant press conference in Karachi by ANP
MQM and PPP” after cleaning.
IV. EXPERIMENTAL METHODOLOGY
As mentioned in Section I, our goal is to predict the
winning political party of the Pakistan 2013 elections. For
this, we train predictive models for each party and test them to
predict the winning party. Our prediction (classification)
labels are Pro and Anti; Pro represents a positive sentiment
favoring the party and Anti represents a negative one. We
constructed three separate attribute matrices for each political
party, and use them to construct predictive models using
tweets from 1st January, 2013 till 7th May, 2013. Specifically,
we manually visualized and extracted those tweets for each
party which contained those attributes that represent either a
Pro or Anti opinion. We rejected tweets with a neutral