A Retrospective and Predictive Analysis on U.S. Midterm Elections on Twitter with Recurrent Neural Networks

A Retrospective and Predictive Analysis on U.

S.

Midterm Elections on Twitter with Recurrent Neural NetworksMarco BrambillaBlockedUnblockFollowFollowingFeb 20Despite the skeptics and detractors that used to be around a few years ago, I guess that everyone in the last months has perceived the impact and role that social media can have on the public debate arena, spanning social issues, policymaking, politics and political elections too.

However, this is often experienced only through news outlets that describe very controversial causality or correlation between economic and political lobbies, aiming at having specific political impacts.

Sometimes this based on objective analysis, some other times just on witnesses, or even just perceived outcomes.

thus leading to still debatable demonstrability of the claims.

On the other hand, technical analyses are popping up in data science, computer science, and web science community that run large-scale studies on the matter.

Which instead are frequently very general and do not deliver actual insights of what’s happening.

What I want to report here instead is a simple data science exercise in the political context, but with a very focused purpose, aiming at responding to non-obvious questions.

We implemented an analysis (meaning both a method and a system) that aim to gauge local support for the two major US political parties in the 68 most competitive House of Representative districts during the 2018 mid-term elections.

SPOILER: The whole analysis implementation and results are available.

See next for materials and links.

ObjectivesThe analysis attempts to mirror the “Generic Ballot” poll, i.

e.

, a survey of voters of a particular district which aims to measure local popularity of national parties by querying participants on the likelihood they would vote for a “generic” Democrat or Republican candidate.

We collect the tweets containing national parties and politicians in the 68 most competitive districts.

By most competitive we mean that they are rated as: toss up, 50%-50%, or lean by the Cook Political Report.

This means we are addressing an extremely challenging analysis and prediction problem, while disregarding the simpler cases (everyone is good at predicting the obvious!).

Aside from this Generic Ballot model, we attempted to capture the “On Ballot” model, aiming at the popularity of local candidates, i.

e.

, the ones actually on the ballot running for congress.

However, this model featured clear bias and could not be validated before the election in the same way the Generic Ballot model could.

So, we report only the first analysis.

Data CollectionThe methodology for querying and extracting tweets is sometimes overlooked in this kind of analysis.

However, the querying process may in itself introduce biases, which are difficult to detect in later stages of the process, therefore it requires careful design.

Our solution employs the Twitter Search API to query for tweets mentioning a national leader or party, posted form a limited geographic region (i.

e.

, each specific congressional district).

For example, the following query extracts tweets on Republicans:TRUMP OR REPS OR Republicans OR Republican OR MCCCONNELL OR ‘MIKE PENCE’ OR ‘PAUL RYAN’ OR #Republicans OR #REPS OR @realDonaldTrump OR @SpeakerRyan OR @senatemajldr OR @VP OR GOP OR @POTUSTo limit the search to each congressional district, we use the geocode field in the search query of the API, which queries a circular area based on the coordinates of the center and the radius.

Because of the irregular shape of the congressional districts, multiple queries are needed for each of them, therefore we built a custom set of bubbles that approximate the district shape (see figure for District PA-10 and its approximation).

Other approaches that do not limit the search to geo-tagged tweets have been proposed by others, but they are not precise enough for our purpose.

AnalysisFor the analysis of the tweets, we adopted a Recurrent Neural Network, namely a RNN-LSTM binary classifier trained on tweets.

This architecture proved accurate in a number of NLP tasks, including sentiment analysis, for its ability capture dependencies over sequences of words.

 To build training and testing data we collected tweets of users with clear political affiliation, including candidates, political activists, and also lesser know users, well versed in the political vernacular.

The accounts selected yielded around 280,000 tweets in 6 months before election day, labeled based on the author’s political affiliation.

 After labeling, pre-processing, and tokenization, we keep the 15,000 most frequent words to be used for further analysis.

The data processed trains a 128 wide Word2Vec embedding layer to learn abstract representations of the tokens.

Results are fed to a RNN composed of two layers, whose dimensions after hyperparameter optimization are respectively 256 and 64.

A number of different techniques are used to tune the network, including dropout or the use of a second layer which helps with abstractionValidationWe first validated our method before the elections, by considering as validation a set of 10 districts, for which traditional polls gave confident predictions (lead of 10% or more for one party).

Over this set, always predicted a correct outcome, i.

e.

, our method obtained 100% success.

Overall, our tweet classifier reached 83% accuracy on a random validation set.

After training, a particularly challenging testing set was gathered during the week of the Kavanaugh hearings; despite the peculiar content, the model was anyway able to classify correctly over 72% of the tweets.

By imposing a confidence threshold of 75%, the accuracy grew to around 80%.

 Finally, at the district prediction level, our solution classified more that 60% of the 68 most competitive districts correctly, recognizing very close races in 10% of the others.

This post is based on a short scientific paper presented at IEEE Big Data Conference in Seattle, WA on December 2018 and on a previous Medium post by Antonio Lopardo.

Further details can be found in the paper.

In case you want to cite the work, you can do it in this way:A.

Lopardo and M.

Brambilla, “Analyzing and Predicting the US Midterm Elections on Twitter with Recurrent Neural Networks,” 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 2018, pp.

5389–5391.

doi: 10.

1109/BigData.

2018.

8622441.

URL: http://ieeexplore.

ieee.

org/stamp/stamp.

jsp?tp=&arnumber=8622441&isnumber=8621858The online running prototype, the full description of the project, its results, and source code are available at http://www.

twitterpoliticalsentiment.

com/USA/.

Notice that the method is a general political-purpose language-independent analysis framework, that can be applied to any national or local context.

.. More details

Leave a Reply