Sentiment Analysis using Deep Learning techniques with India Elections 2019 — A Case study

The prominent parties standing for the elections, party leaders and representatives have a busy schedule organizing campaigns and convincing people to vote.

While the media is busy capturing all events starting from press conferences to any gatherings, and putting it in front of public, the public is deeply engrossed with latest news and developments.

The phenomenal growth in real time data tracking and analyzing techniques has inspired data scientists to visualize and predict sentiments, build real-time models to predict the winners, etc.

Trust me , the most exciting part of it is capturing the information online from all sources and predict in real time with highest accuracy.

The great challenge in this scenario is the accuracy and ever increasing length of date getting flooded from all sources every second.

With the current challenges in view, I decided to use few Deep Learning ML techniques to predict moods using Twitter data.

Note that this article assumes a basic knowledge of data science and NLP (Natural Language Processing).

But if you are a newcomer to this world, I have provided links throughout the article to help you out.

This blog is structured like this:Describe deep learning algorithms, LSTM, Bi-directional LSTM, Bi-directional GRU, CNN.

Train these algorithms using contextual election corpus as well as pre-trained word embeddings to predict sentiments of electing parties.

Comparing the accuracy and log loss of different models.

Glove Pre-trained Word EmbeddingsGlove: Pre-trained Word Embeddings , Source :https://nlp.

stanford.

edu/projects/glove/We started our sentiment classification technique with Google’s pre-trained Word2Vec model that represents words as vectors, built on the basis of aggregated global word-word co-occurrence statistics from a corpus.

The Word2Vec model, trained by Google predicts words close to the target word with a neural network to represent a linear substructures of the word vector space.

As we represent each word with a vector and a sentence (tweet) as an average of its words (vectors) to illustrate its sentiment, it becomes obvious to train the word vector with different moods to aid in the classification and prediction process.

As such, Word2Vec is trained with different RNN models.

Recurrent Neural NetworksA recurrent neural network (RNN) is a sequence of inter-linked artificial neural network where connections between nodes form a directed graph along a sequence.

They are particularly known for processing data related to sequence : text, time series, videos, etc where the output at any given instant t is affected by the output at previous instant t-1 along with the input at t.

Source : https://www.

analyticsvidhya.

com/blog/2017/12/introduction-to-recurrent-neural-networks/We will see how RNN based models (LSTM, GRU, Bi-directional LSTM) perform with an external embedding which has been trained and distilled on a very large corpus of data as well as with an internal embedding, where a part of the contextual corpus has been considered for training.

Basic RNNs suffers from vanishing and exploding gradient problems for which LSTM based networks have evolved to handle this problem.

Auto-encoderAuto-encoders are a special type of RNN known for compressing a relatively long sequence into a limited, fixed-size, dense vector.

They are well known for classifying textual sentiments and hence used here for the same purpose for training and predicting mood categories for election tweets.

An auto-encoder attempts to copy its input to its output through an encoder and decoder architecture.

The dimension of the middle-hidden layer is lower than that of the input data.

Thus, the neural network is designed to represent the input in a smart and compact way in order to reconstruct it successfully.

The AutoEncoders used here follow simple Sequnce2Sequence architecture built from an input layer followed by encoding LSTM layer, an embedding layer, decoding LSTM layer, and a softmax layer.

Both the input and the output of the entire architecture are vectorized representation of the tweets and their labelled sentiments.

Finally, the output of the LSTM is passed through softmax activation to represent the sentiment category.

Auto-Encoder Source : https://www.

eurekalert.

org/multimedia/pub/129766.

phpAuto-Encoder Training with Pre-trained GloveLSTMLSTMs, kind of Recurrent Neural Networks possess internal contextual state cells that act as long-term or short-term memory cells.

LSTMs solve many problems of vanilla Recurrent Neural Networks by :Helping to preserve a constant error, by continuous learning and back propagation through time and layers.

LSTMs contain gated cell that controls flow of information.

Gated cells remain responsible for information read, write and storage.

They remain primary decision makers to retain cell state information (input gate), to determine the amount of cell state to pass on to next neural network layers (output gate) and amount of existing information from memory that can be forgotten (forget gate).

Gates in LSTMs contain analog information ranging from 0 to 1 through sigmoid activation functions.

The analog information flow in gates facilitates back propagation to happen through multiple bounded nonlinearities.

LSTM solves vanishing gradient problem by keeping the gradients steep enough, therefore training relatively short batches with high accuracy.

The below figure shows how word embedding can feed an input sentence to LSTM.

The LSTM layers takes into consideration previous hidden state to extract the key feature vectors that determines the sentiment of the sentence.

The source code below shows how to build a Word Embedding with single hidden layer LSTM of 128 neurons and classify tweets based on predefined classes using “softmax” classifier and “Adam” optimizer.

Source code available at https://github.

com/sharmi1206/elections-2019#fileName classifyw2veclstm.

pyNO_CLASSES = 8embedded_sequences = embedding_layer(sequence_input)l_lstm = LSTM(128)(embedded_sequences)preds = Dense(NO_CLASSES, activation='softmax')(l_lstm)model = Model(sequence_input, preds)model.

compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])model.

summary()model.

fit(x_train, y_train, nb_epoch=15, batch_size=64)output_test = model.

evaluate(x_test, y_test, verbose=0)Model Summary with single Layer LSTMGRUGRU is just a slightly modified version of the LSTM to capture the dependencies between time instances adaptively.

Absence of a memory unit like LSTM makes it incapable to control the flow of information like the LSTM unit.

GRU functions with “reset” and “update” gate.

The reset gate remains located between the previous activation and the next candidate activation to allow forget from previous state.

The update gate decides how much of the candidate activation to use in updating the cell state.

Possesses fewer parameters and thus may train a bit faster or need less data to generalize.

Falls short to LSTM in processing larger datasets where LSTMs have shown to perform better.

The source code below shows how to build a GRU with a single hidden layer and classify tweets using “softmax” classifier and “Adam” optimizer.

#fileName classifyw2veclstm.

py at https://github.

com/sharmi1206/elections-2019NO_CLASSES = 8embedded_sequences = embedding_layer(sequence_input)l_lstm = GRU(128)(embedded_sequences)preds = Dense(NO_CLASSES, activation='softmax')(l_lstm)model = Model(sequence_input, preds)model.

compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])model.

summary()model.

fit(x_train, y_train, nb_epoch=15, batch_size=64)output_test = model.

evaluate(x_test, y_test, verbose=0)Model Summary with single Layer GRUBi-directional LSTMBidirectional Recurrent Neural Networks (BRNN) connects two hidden layers of opposite directions to the same output, thus increasing the amount of input information available to the network.

This architecture facilitates the output layer to get information from past (backwards) and future (forward) states simultaneously.

BRNN has been used in analyzing public sentiments towards elections as the election context is fed as its input and BRNN has increased performance when the knowledge of words proceeding and following the most polarized word is taken into consideration from either directions.

BRNN aims to :Divide the neurons of a regular RNN into two directions, one for positive time direction (forward states), and another for negative time direction (backward states).

This facilitates information inclusion from both past and future of the current time frame.

Output of two states remain disconnected with the inputs of the opposite direction states.

BRNNs can be trained using similar algorithms to RNNs, because the training process does not involve any interactions between both the directional neurons.

The training involves three steps with forward pass, backward pass and weight updates:For forward pass, forward states and backward states are passed first, then output neurons are passed.

For backward pass, output neurons are passed first, then forward states and backward states are passed nextAfter forward and backward passes are done, the weights are updated.

Bi-directional LSTM model summaryConvolutional Neural Networks (CNN)CNN used for sentiment prediction using pre-trained word embeddings is composed of 1D convolution layers and 1D Global Max Pooling layers with 128 filters.

1D convolution layer in the network performs convolutions (feature mapping) over the ordered embedded word vectors in a sentence using filter size of 5, sliding over 5 words at a time.

Single layer CNN with 128 filters#fileName classifygloveattlstm.

py at https://github.

com/sharmi1206/elections-2019model = Sequential()model.

add(layers.

Embedding(len(word_index) + 1, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=True))model.

add(layers.

Conv1D(128, 5, activation='relu'))model.

add(layers.

GlobalMaxPooling1D())model.

add(Dense(8, activation='softmax'))model.

compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])model.

summary()history = model.

fit(x_train, y_train, nb_epoch=15, batch_size=64, validation_data=(x_test, y_test))loss, accuracy = model.

evaluate(x_train, y_train, verbose=False)print("Training Accuracy: {:.

4f}".

format(accuracy))loss, accuracy = model.

evaluate(x_test, y_test, verbose=False)print("Testing Accuracy: {:.

4f}".

format(accuracy))LSTM, Bi-directional LSTM, Bi-directional GRU with Attention MechanismLSTM with Attention Layer, Source : https://skymind.

ai/wiki/attention-mechanism-memory-networkAttention mechanisms allow neural networks to decide which vectors (or words) from the past are important for future decisions by considering them in context to the word in question.

In this process, it filters important and relevant chunks of information, and force hops in parts of the sequence that is not relevant to the final goal or task.

Such relationships among words and connection to neighboring words can be represented by directed arcs of a semantic dependency graph.

Further, an attention mechanism takes into account the input from several time steps, distributes attention over the hidden states by assigning different weights, or degrees of importance, to those inputs.

For a fixed target word, the first task is to loop over all encoders’ states to compare target and source states to generate scores for each state in encoders.

A softmax is then introduced to normalize all scores, which generates the probability distribution conditioned on target states.

At last, the weights are introduced to make context vector easy to train.

The principle advantage of attention mechanism lies in the context vector’s ability to take all cells’ outputs as input to compute the probability distribution of source, providing the decoder an ability to represent global information, instead of a single hidden state.

Bi-directional GRU and LSTM networks with Attention mechanismModel Summary Bi-directional LSTM/GRU with Attention layerThe source code below shows how to build a single Bi-directional GRU layer, with Attention layer of 64 neurons, and classify tweets based on predefined classes using “softmax” classifier and “Adam” optimizer.

Source code available at https://github.

com/sharmi1206/elections-2019#fileName classifygloveattlstm.

py at https://github.

com/sharmi1206/elections-2019from keras.

layers import Densefrom keras.

layers import GRU, Bidirectional, Embeddingfrom keras.

models import Modelfrom sklearn.

metrics import log_loss, accuracy_scorefrom sklearn import metricsfrom sklearn.

metrics import confusion_matrixNO_CLASSES = 8embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=True)sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')embedded_sequences = embedding_layer(sequence_input)l_gru = Bidirectional(GRU(100, return_sequences=True))(embedded_sequences)#Refr:https://github.

com/richliao/textClassifier/issues/28l_att = AttLayer(64)(l_gru)preds = Dense(NO_CLASSES, activation='softmax')(l_att)model = Model(sequence_input, preds)model.

compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])model.

summary()model.

fit(x_train, y_train, nb_epoch=15, batch_size=64)#Evaluate model Accuracyoutput_test = model.

predict(x_test)final_pred = np.

argmax(output_test, axis=1)org_y_label = [np.

where(r==1)[0][0] for r in y_test]results = confusion_matrix(org_y_label, final_pred)precisions, recall, f1_score, true_sum = metrics.

precision_recall_fscore_support(org_y_label, final_pred)pred_indices = np.

argmax(output_test, axis=1)classes = np.

array(range(0, NO_CLASSES))preds = classes[pred_indices]print('Log loss: {}'.

format(log_loss(classes[np.

argmax(y_test, axis=1)], output_test)))print('Accuracy: {}'.

format(accuracy_score(classes[np.

argmax(y_test, axis=1)], preds)))Accuracy with Pre-trained Word EmbeddingsAccuracy and Log Loss for sentiment prediction BJP vs CongressWord Embeddings with Convolutional Neural Networks (CNN) on Election TweetsConvolution Neural Networks with Word2Vec Models with Gensim by building the election corpusThe word2vec tool takes a text corpus (list of tweets) as input and produces the word vectors as output.

It first constructs an unique vocabulary set from the training text data (list of tokenized tweets) and then learns vector representation of words, representing n-gram features that aids in sentiment classification process.

The process is known as word embedding as used in pre-trained word embeddings, the only difference being the training process takes place using election tweets instead of pre-trained data.

We used Keras to convert positive integer representations of words into a word embedding by an Embedding layer.

#fileName classifyw2veccnn.

py at https://github.

com/sharmi1206/elections-2019num_words = 20000tokenizer = Tokenizer(num_words=num_words)tokenizer.

fit_on_texts(combined_df['tweet'].

values)word_index = tokenizer.

word_index# Pad the tweet dataX = tokenizer.

texts_to_sequences(combined_df['tweet'].

values)X = pad_sequences(X, maxlen=2000)Y = pd.

get_dummies(combined_df['mood']).

valuesword2vec = Word2Vec(sentences=tokenized_corpus, size=vector_size, window=window_size, iter=500, seed=300, workers=multiprocessing.

cpu_count())# Copy word vectors X_vecs = word2vec.

wvCNN used for sentiment prediction is composed of 1D convolution layers and 1D pooling layers over a series of 4 layers, with 32, 64, 128 and 256 filters respectively in each layer.

1D convolution layer in the network performs convolutions (feature mapping) over the ordered embedded word vectors in a sentence using filter size of 3, sliding over 3 words at a time.

This allows considering at 3-grams to understand how words contribute to sentiment in the context of those around them.

After each convolution, we add a max-pool layer to extract the most significant elements and turn them into a feature vector.

Further, we also add a regularization of 20% to ensure the model does not overfit.

The resultant tensor of varying shape is concatenated into one big, single columned vector through flattening.

The long feature vector is then used by dense layer with software activation to yield a resultant classified output.

#fileName classifyw2veccnn.

py at https://github.

com/sharmi1206/elections-2019from keras.

layers.

core import Dense, Dropout, Flattenfrom keras.

layers.

convolutional import Conv1D, MaxPooling1Dfrom keras.

optimizers import Adamfrom keras.

models import Sequentialbatch_size = 64nb_epochs = 20vector_size = 512max_tweet_length = 100model = Sequential()model.

add(Conv1D(32, kernel_size=3, activation='elu', padding='same', input_shape=(max_tweet_length, vector_size)))model.

add(MaxPooling1D(pool_size=2))model.

add(Dropout(0.

2))model.

add(Conv1D(64, kernel_size=3, activation='elu', padding='same'))model.

add(MaxPooling1D(pool_size=2))model.

add(Dropout(0.

2))model.

add(Conv1D(128, kernel_size=3, activation='elu', padding='same'))model.

add(MaxPooling1D(pool_size=2))model.

add(Dropout(0.

2))model.

add(Conv1D(256, kernel_size=3, activation='elu', padding='same', input_shape=(max_tweet_length, vector_size)))model.

add(Dropout(0.

2))model.

add(MaxPooling1D(pool_size=2))model.

add(Flatten())model.

add(Dense(8, activation='softmax'))# Compile the modelmodel.

compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.

001, decay=1e-6), metrics=['accuracy'])# Fit the modelmodel.

fit(X_train, Y_train, batch_size=batch_size, shuffle=True, epochs=nb_epochs)model.

add(Flatten())model.

add(Dense(8, activation='softmax'))# Compile the modelmodel.

compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.

001, decay=1e-6), metrics=['accuracy'])# Fit the modelmodel.

fit(X_train, Y_train, batch_size=batch_size, shuffle=True, epochs=nb_epochs)Model Summary Convolution Neural NetworksWord Embeddings with Recurrent Neural Networks (LSTM/GRU/Bi-directional LSTMs) on Election TweetsThe neural network architecture (each of LSTM, GRU, Bi-directional LSTM/GRU) is modeled to 20000 most frequent words, where each tweet is padded to a maximum length of 2000.

The first layer is the Embedded layer that uses 128 length vectors (each word is tokenized with Keras’s Tokenizer) to represent each word.

The next layer is the LSTM layer with 256 memory neurons.

Finally, the results are fed to a single output Dense layer with 8 neurons and a softmax activation function to predict the associated mood.

#fileName classifyw2veclstm.

py at https://github.

com/sharmi1206/elections-2019NO_CLASSES = 8embed_dim = 128lstm_out = 256model = Sequential()model.

add(Embedding(num_words, embed_dim, input_length = X.

shape[1]))model.

add(LSTM(lstm_out, recurrent_dropout=0.

2, dropout=0.

2))model.

add(Dense(NO_CLASSES, activation='softmax'))model.

compile(loss = 'categorical_crossentropy', optimizer='adam', metrics = ['categorical_crossentropy'])print(model.

summary())X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.

2, random_state = 42, stratify=Y)# Fit the modelmodel.

fit(X_train, Y_train, batch_size=batch_size, shuffle=True, epochs=nb_epochs)output_test = model.

predict(X_test)The model yields 99.

58% accuracy over 5 epochs with a batch-size of 128 .

Epoch 5/5.

.

.

64/7344 [.

.

.

.

.

] – ETA: 58:45 – loss: 0.

0218 – acc: 1.

0000 128/7344 [.

.

.

.

.

] – ETA: 54:28 – loss: 0.

0259 – acc: 1.

0000 192/7344 [.

.

.

.

.

] – ETA: 57:35 – loss: .

.

.

7232/7344 [============================>.

] – ETA: 58s – loss: 0.

0328 – acc: 0.

99607296/7344 [============================>.

] – ETA: 24s – loss: 0.

0330 – acc: 0.

99597344/7344 [==============================] – 3811s 519ms/step – loss: 0.

0331 – acc: 0.

9958ConclusionIn this post, we reviewed deep learning methods for creating vector representations of sentences with RNNs, CNNs and presented their effectiveness in solving a supervised sentiment prediction.

With glove pre-trained word embeddings, Bi-directional LSTM and Bidirectional GRU with Attention Layer perform the best, while Auto-encoder model, performs the worst both in case of BJP and Congress.

With Word Embedding matrix solely trained with election context tweets increases accuracy of models (LSTM, GRU, Bi-directional LSTM/GRU) to almost 99.

5%.

But CNN model performs the worst, with 50% accuracy.

However each of these models can be further improved using extensive tuning of hyper-parameters, different epochs, learning rates and addition of more labelled data for minority classes.

Further altering the neural network architecture by increasing or decreasing the number of neurons and hidden layers might give added improvements.

Referenceshttps://www.

researchgate.

net/figure/The-architecture-of-sentence-representation-learning-network_fig2_325642880https://blog.

myyellowroad.

com/unsupervised-sentence-representation-with-deep-learning-104b90079a93https://www.

analyticsvidhya.

com/blog/2019/01/sequence-models-deeplearning/http://mccormickml.

com/2016/04/19/word2vec-tutorial-the-skip-gram-model/https://code.

google.

com/archive/p/word2vec/Please let me know if there were any mistakes, suggestions feedbacks are welcome.

The election repository is available at https://github.

com/sharmi1206/elections-2019.

Please feel free to follow me at linkedin.

.

. More details

Leave a Reply