Stock Prediction Using TwitterEver wondered if you could predict the stock market what you can do? A lot :)Khan Saad Bin HasanBlockedUnblockFollowFollowingJan 3Many economist have argued that the stock market is random because it is governed by random events, this is suggested in Efficient Market Hypothesis and Random Walk Theory.

But is it really?Researchers have put this to test and have tried to predict the stock market to show that it is indeed possible to have a sense of where the market will go and seems to have proven their point with some accuracy.

One of the landmark papers on this topic was by Bollen et.

al.

,[1] In this blog I will try to explain in simple terms how they did it.

Bollen used the public opinion(using twitter tweets) and proved that there was a correlation between the moods of public expressed on twitter and the way the stock market performs.

Before understanding in detail what Bollen did and how he did it, we need to understand a few things:What is stock market and how does it work?How to find the mood expressed from tweets?How to find/prove a correlation between two time series’Given that two time series are related how to regress/predict the future outcome from one time series of the other.

What is Stock Market?A company is a large entity.

Now, most companies are not owned by a single person or even a single organization, but a large number of people share the company, these are called shareholders.

All these people own a part of the company called a stock hence these peoples are also called stockholders.

Stock Market is a place where you can sell your stocks of a company or buy stocks of a company.

More formally a stock can be defined as:The stock of a corporation is all of the shares into which ownership of the corporation is divided.

[1] In American English, the shares are commonly called stocks.

[1] A single share of the stock represents fractional ownership of the corporation in proportion to the total number of shares.

This typically entitles the stockholder to that fraction of the company’s earnings, proceeds from liquidation of assets (after discharge of all senior claims such as secured and unsecured debt),[2] or voting power, often dividing these up in proportion to the amount of money each stockholder has invested.

-WikipediaAnd a Stock Market can be defined as:A stock market is the aggregation of buyers and sellers (a loose network of economic transactions, not a physical facility or entity) of stocks (also called shares), which represent ownership claims on businesses; these may include securities listed on a public stock exchange, as well as stock that is only traded privately.

-WikipediaIf you are interested in knowing more please refer to this linkNow there are a lot of stock markets but our focus will be “The New York Stock Exchange”.

Why?.because it is the largest stock market and most of the research has focused around it.

Now there is a term we need to understand before proceeding and it is “The Dow Jones Industrial Average” or simply “The Dow”.

The Dow Jones Industrial Average (DJIA) is a price-weighted average of 30 significant stocks traded on the New York Stock Exchange (NYSE) and the Nasdaq.

The DJIA was invented by Charles Dow in 1896.

Often referred to as “the Dow,” the DJIA is one of the oldest, single most-watched indices in the world and includes companies such as the General Electric Company, the Walt Disney Company, Exxon Mobil Corporation and Microsoft Corporation.

When the TV networks say “the market is up today,” they are generally referring to the Dow.

-InvestopediaSo Dow Jones(or DJIA) gives us a good idea whether the stock market closed at a high or at a low, So what does DJIA measure exactly?.It is just a weighted average of the stock prices of the top-30 companies where the stocks with more value are given more weight and the end result is normalized to factor out one-time events.

Hence it is just a price in itself.

So with DJIA, we now have a reliable way to see how the market fared on a day.

What we need now is a way to mine the public opinion, we turn to twitter for that.

Twitter Mood AnalysisCan machines understand emotions?.No, they can’t, not till now, by no I don’t mean a perfect no(but a rather very imperfect one) because machines can now understand a wide range of emotions(though not perfectly and reliably) and can easily be fooled/confused.

(this is a topic for another blog post).

With Machine Learning Algorithms it is possible to gauge the mood(or sentiment) expressed in a certain piece of text.

But the accuracy leaves a lot to be desired.

But if the sentiment is explicitly expressed and the text is not twisted, then we can make a reliable sentiment analysis model.

This is the idea behind the tools used by Bollen to get the mood from tweets.

Sentiment Analysis is a very important application of Machine learning, No wonder many different(by many i mean a lot) algorithms have been applied to get sentiment from text, lets take one of the easiest and intuitive one.

Lets consider the text:“I am pretty impressed by Elon Musk’s personality and his philosophy towards life no wonder Tesla and spaceX have been such great endeavors”.

Our Algorithm will look at only the important words like “pretty” , “impressed” etc.

(and not words like “I”,“am” etc.

, also our Algorithm may not know spaceX , Elon , Musk so it will probably just ignore it).

Now, Considering that the Algorithm has previously seen how positive text looks like(while training it), it has already figured out that words like “pretty” , “impressed” , “great” are mostly associated with positive emotions.

Hence it is likely to label the text as positive.

This is a very simple approach and it is what is taken by Textblob(a tool for text analysis), Textblob works on Naive Bayes Algorithm, A very simple algorithm which gives great results considering its simplicity, So the idea is that- each word is given a score considering the type of document(s) it is associated with, Hence a word that occurs more in positive documents than in negative documents will likely have a larger(or more positive) score and hence if it appears in a document then this document is more likely to be positive.

So each word is given its score and then these are averaged to get the sentiment of the document.

Needless to say this approach is not ideal and will give very poor results on complex document(s).

Hence most researchers use more complex classifiers(like SVMs) to make sentiment analysis models.

Bollen used two tools for opinion mining one was Opinion Finder and the other was the Google Profile of Moods States.

I haven’t seen many people use Opinion Finder these days and it is not really important for this blog either so we will just leave that out and focus on Google Profile of Moods States(or GPOMS).

GPOMS is a tool that can help you detect the mood expressed in a piece of text with good accuracy.

It is based on the Profile Of Moods States Questionnaire, which is a questionnaire consisting of 65 or 37 questions depending on which one you choose.

For each question you indicate how you are feeling- from “Not at all” , “A little” , “Moderately”, “Quite a Lot” , “Extremely” .

So for example for question “regretful” you will indicate how you are feeling as one of the above mentioned states, it will be converted to a score using a standard and your mood will be calculated depending on your response.

Here is a link to the test.

So how does GPOMS use POMS to predict mood from text?.Here comes the google connection.

Bollen used one of the datasets released by google.

The dataset consists of frequency counts for n-grams extracted from 1 trillion words of English Web text.

You can try it here.

So what bollen did is: associate each word in POMS questionnaire with a n-gram in the google n-gram dataset and then separate the most frequently occurring n-grams into tokens and now each of these words have a mood associated with it, and can be given a weighted score(based on occurrence) and depending on how these words occur in a piece of text it can be labelled with its corresponding mood.

The above is my understanding of what Bollen explains in his paper as:The enlarged lexicon of 964 terms thus allows GPOMS to capture a much wider variety of naturally occurring mood terms in Tweets and map them to their respective POMS mood dimensions.

We match the terms used in each tweet against this lexicon.

Each tweet term that matches an n-gram term is mapped back to its original POMS terms (in accordance with its co-occurence weight) and via the POMS scoring table to its respective POMS dimension.

The score of each POMS mood dimension is thus determined as the weighted sum of the co-occurence weights of each tweet term that matched the GPOMS lexicon.

Unfortunately, GPOMS is no longer available and it is now a closed source tool.

Goel-Mittal[2] built a similar model(though less accurate) with a much simpler approach, they used synonyms of the words occurring in the POMS questionnaire and then mapped them to text.

Based on his GPOMS model and the Opinion Finder Bollen found out the mood of the people and here are the results:Results From GPOMS and Opinion FinderAs you can see there are two prominent things on the graph: one is the 2008 Presidential Election and the other is Thanksgiving, Bollen seems to have chosen this period for this very particular reason, Because people’s mood is understandable on these occasions so we can use this to show that our model indeed is able to gauge the public’s mood and the bump will be significant both in the public’s mood and the stock market and hence easily observable.

We are now done with two of the four parts, we know how to gauge the mood of the public and the mood of the market.

Now we can move on to prove that these two are indeed correlated.

Correlation Between Time Series’As shown in the above figure we have obtained the time series of the people’s mood, a similar time series for DJIA score can easily be obtained.

Now we need to look at a way to prove that these are correlated or the people’s mood(mined from twitter) causes the Stock Market to change.

We use Granger Causality to do this.

So what exactly is Granger Causality?.According to this Scholaropedia article:Granger causality is a statistical concept of causality that is based on prediction.

According to Granger causality, if a signal X1 “Granger-causes” (or “G-causes”) a signal X2, then past values of X1 should contain information that helps predict X2 above and beyond the information contained in past values of X2 alone.

Its mathematical formulation is based on linear regression modeling of stochastic processes (Granger 1969).

More complex extensions to nonlinear cases exist, however these extensions are often more difficult to apply in practice.

So, if we have two time Series, say X1 and X2 and we can show that X2 depends on its previous values i.

e,X2(t) = a(0) + a(1)X2(t-1) + a(2)X2(t-2) + …….

+ a(p)X2(t-p)Here, at least one of the constants a(1),a(2),….

,a(p) is not zero then we can say that X2(t) depends on its previous values, after showing this, if the following relation holds true for at least one of b(1),b(2),….

,b(p) not zero:X2(t) = a(0) + a(1)X2(t-1) + a(2)X2(t-2) + …….

+ a(p)X2(t-p) + b(1)X1(t-1) + b(2)X1(t-2) + …….

+ b(p)X1(t-p)then we can say that X1 granger causes X2 or it is possible for X1 to forecast X2.

please refer to this video to get a better understanding of Granger Causality.

After doing Bivariate Granger Causality analysis Bollen found out that- out of the six mood states(i.

e, Calm, Alert, Sure, Vital, Kind and Happy) only one, namely “Calm” mood state had the highest Granger Causality relation with stock market for lags ranging from 2 to 6 days, The other four mood dimension don’t show significant causal relation with the stock market.

So, Bollen plotted the “calm” time series(lagged by 3 days) and the DJIA time series together to show the correlation between the two:The shaded portion shows the portions with significant correlation.

We should keep in mind that the calm graph is 3-days lagged hence the twitter data is not predicting the market simultaneously but instead 3-days before.

If we look carefully we can see that there is a great amount of correlation present in this graph, hence it is now established that there is a correlation between the two time series’ .

We can leverage this information to predict the stock market and see how accurately we were able to predict it.

Predicting the Stock MarketNow for the litmus test, can we predict the previously unseen stock market trends on the basis of the tweets available, Can we predict the future?!!To predict the stock market, Bollen used something called Self Organizing Fuzzy Neural Networks(SOFNN), they used a five-layer hybrid SOFNN model for prediction of stock market and obtained pretty impressive results.

They used different permutations of data e.

g, calm only , calm with happy etc.

and their best accuracy was 87.

6%(ouch!!!) which they obtained after combining calm with happy in a non-linear manner.

So, what are SOFNN exactly?.well according to this scholaropedia article, they combine the best of Fuzzy Logic and Neural networks to create a very good model for tasks like these.

Both these topics are out of the scope of this article.

Hence we shall discuss each of these in brief only.

When dealing with computers we mostly deal with Boolean or binary logic i.

e, any entity can be either 0 or 1 but this type of logic is not applicable in many real world scenarios since we are mostly dealing with more than one outcomes for example- the result of a game is mostly win or loss but it can also be a draw/tie or perhaps the gap of winning may also be taken into consideration, hence there can be a lot more states between 0 and 1, this seems more natural than a binary(aka black and white) approach to us and is more helpful to model real world situations, this fuzzy approach is the logic behind fuzzy logic, you can read more about it here.

Neural Network is a buzzword these days if you haven’t heard of them its time to come out from under the rock, and read some blogs.

Basically they are a mathematical model that try to mimic(haven’t succeeded much) the neurons inside the human brain.

Here is a good explanation from sholaropedia of what hybrid Fuzzy Neural Networks are:Hybrid neuro-fuzzy systems are homogeneous and usually resemble neural networks.

Here, the fuzzy system is interpreted as special kind of neural network.

The advantage of such hybrid NFS is its architecture since both fuzzy system and neural network do not have to communicate any more with each other.

They are one fully fused entity.

These systems can learn online and offline.

The rule base of a fuzzy system is interpreted as a neural network.

Fuzzy sets can be regarded as weights whereas the input and output variables and the rules are modeled as neurons.

Neurons can be included or deleted in the learning step.

Finally, the neurons of the network represent the fuzzy knowledge base.

Obviously, the major drawbacks of both underlying systems are thus overcome.

The Big PictureThe above four parts try to explain the smaller parts used to build the model proposed by Bollen.

Now we shall look at the final model to better appreciate the model that can be used for predicting the stock market:The Final ModelFirst the Raw Data from twitter and DJIA are extracted and processed, then the twitter data is passed through mood analysis models Opinion Finder and GPOMS, A Granger Causality analysis is then done on them to prove that the mood from twitter does have some correlation with the DJIA values, once that is out of the way we can now start predicting the stock market with the SOFNN model.

[1] Bollen, J.

, Mao, H.

, Zeng, X.

: Twitter mood predicts the stock market.

Journal of Computational Science, 2(1), 1–8 (2011)[2] Mittal, Anshul, and Arpit Goel.

”Stock prediction using twitter sentiment analysis.

” Stanford CS229(2011) http://cs229.

stanford.

edu/proj2011/GoelMittal-StockMarketPredictionUsingTwitterSentimentAnalysis.

pdf ) (2012).

.. More details