Sentiment Analysis with Word Bags and Word Sequences

That is the question we explore here.

We start with a simpler binary classification task in this post and consider a multilabel classification task in a later post.

We use Support Vector Machines (SVM) with tf-idf vectors as the proxy for bag-of-words approach and LSTM for the sequence respecting approach.

SVM is implemented via SciKit and LSTM is implemented via Keras.

While we go through some code snippets here, the full code for reproducing the results can be downloaded from github.

1.

Tokenize the Movie ReviewsThe text corpus, large movie reviews from Stanford is often used for binary sentiment classification — i.

e.

is the movie good or bad based on the reviews.

The positive and negative reviews are downloaded to disk in separate directories.

Here is the code snippet to ‘clean’ the documents and tokenize them for analysis.

Lines #10–11.

Tokenization.

Remove all punctuation and NLTK stop words.

Make sure all words/tokens start with a letter.

And only retain those words between 3 and 15 characters long.

Line #15–24: Loop through the movie review files in each folder and tokenize.

Line # 25: Taking note of number of words in each document helps us choose a reasonable sequence length for LSTM later.

The percentile stats on nTokens shows that over 86% of the documents have less than 200 words in them.

Token Summary:min avg median std 85/86/87/90/95/99 max3 116 86 88 189/195/203/230/302/457 13882.

Pack Bags and SequencesLSTM works with word sequences as input while the traditional classifiers work with word bags such as tf-idf vectors.

Having each document in hand as a list of tokens we are ready for either.

2.

1 Tf-Idf Vectors for SVMWe use Scikit’s Tf-Idf Vectorizer to build the vocabulary and the document vectors from the tokens.

2.

2 Sequences for LSTMThe text processor in Keras turns each document into a sequence/string of integers, where the integer value indicates the actual word as per the {word:index} dictionary that the same processing generates.

We use 200-long sequences as the stats on the tokens show that over 86% of the documents have less than 200 words.

In Line # 8 in the code below, the documents with fewer than 200 words will be ‘post’ padded with the index value 0 that is ignored by the embedding layer (mask_zero=True is set for in the definition of embedding layer in Section 3).

3.

ModelsLSTM is implemented via Keras while SVM is implemented via SciKit.

Both work with the same train/test split so a comparison would be fair.

Twenty percent of the overall corpus (i.

e 10,000 documents) are set aside for test while training on the remaining 40,000 documents.

3.

1 LSTMAs in the earlier article, we use the simplest possible LSTM model, with an embedding layer, one LSTM layer and the output layer.

Figure 1.

A simple LSTM model for binary classification.

The embedding layer in Figure 1 reduces the number of features from 98089 (the number of unique words in the corpus) to 300.

The LSTM layer outputs a 150-long vector that is fed to the output layer for classification.

The model itself is defined quite simply below.

Line #4: Embedding layer is trained to convert the 98089 long 1-hot vetcors to dense 300-long vectorsLine #6: The dropout fields are to help with preventing overfittingTraining is done with early stopping to prevent over training in Line #6 in the code below.

The final output layer yields a vector that is as long as the number of labels, and the argmax of that vector is the predicted class label.

3.

2 SVMThe model for SVM is much less involved as there are far fewer moving parts and parameters to decide upon.

That is always a good thing of course.

4.

SimulationsThe confusion matrix and the F1-scores obtained are what we are interested in.

With the predicted labels in hand from either approach we use SciKit’s API to compute them.

While we have gone through some snippets in different order, the complete code for lstm_movies.

py for running LSTM and svm_movies.

py for running SVM is on github.

As indicated in the previous article, various random seeds are initialized for repeatability.

4.

1 LSTMRunning LSTM with:#!/bin/bashecho "PYTHONHASHSEED=0 ; pipenv run python .

/lstm_movies.

py"PYTHONHASHSEED=0 ; pipenv run python .

/lstm_movies.

pyyields about 0.

87 as the F1-score converging in 6 epochs due to early stopping.

Using TensorFlow backend.

Token Summary:min/avg/median/std 85/86/87/88/89/90/95/99/max:3 116.

47778 86.

0 88.

1847205941687 189.

0 195.

0 203.

0 211.

0 220.

0 230.

0 302.

0 457.

0 1388X, labels #classes classes 50000 (50000,) 2 ['neg', 'pos']Vocab padded_docs 98089 (50000, 200)_________________________________________________________________Layer (type) Output Shape Param # =================================================================embedding_1 (Embedding) (None, 200, 300) 29427000 _________________________________________________________________lstm_1 (LSTM) (None, 150) 270600 _________________________________________________________________dense_1 (Dense) (None, 2) 302 =================================================================Total params: 29,697,902Trainable params: 29,697,902Non-trainable params: 0_________________________________________________________________NoneTrain on 40000 samples, validate on 10000 samplesEpoch 1/50 – 1197s – loss: 0.

3744 – acc: 0.

8409 – val_loss: 0.

3081 – val_acc: 0.

8822Epoch 2/50 – 1195s – loss: 0.

1955 – acc: 0.

9254 – val_loss: 0.

4053 – val_acc: 0.

8337.

Epoch 6/50 – 1195s – loss: 0.

0189 – acc: 0.

9938 – val_loss: 0.

5673 – val_acc: 0.

8707Epoch 00006: early stopping[[4506 494] [ 799 4201]] precision recall f1-score support neg 0.

8494 0.

9012 0.

8745 5000 pos 0.

8948 0.

8402 0.

8666 5000 micro avg 0.

8707 0.

8707 0.

8707 10000 macro avg 0.

8721 0.

8707 0.

8706 10000weighted avg 0.

8721 0.

8707 0.

8706 10000Time Taken: 7279.

3338294029244.

2 SVMRunning SVM with#!/bin/bashecho "PYTHONHASHSEED=0 ; pipenv run python .

/svm_movies.

py"PYTHONHASHSEED=0 ; pipenv run python .

/svm_movies.

pyyields 0.

90 as the F1-scoreToken Summary.

min/avg/median/std/85/86/87/88/89/90/95/99/max:3 116.

47778 86.

0 88.

1847205941687 189.

0 195.

0 203.

0 211.

0 220.

0 230.

0 302.

0 457.

0 1388X, labels #classes classes 50000 (50000,) 2 ['neg', 'pos']Vocab sparse-Xencoded 98089 (50000, 98089).

.

*optimization finished, #iter = 59Objective value = -6962.

923784nSV = 20647[LibLinear][[4466 534] [ 465 4535]] precision recall f1-score support neg 0.

9057 0.

8932 0.

8994 5000 pos 0.

8947 0.

9070 0.

9008 5000 micro avg 0.

9001 0.

9001 0.

9001 10000 macro avg 0.

9002 0.

9001 0.

9001 10000weighted avg 0.

9002 0.

9001 0.

9001 10000Time Taken: 0.

72562265396118165.

ConclusionsClearly, both SVM at 0.

90 as the F1-score and LSTM at 0.

87 have done very well for binary classification.

The confusion matrices show excellent diagonal dominance as expected.

Figure 2.

Both LSTM and SVM have done very well for this binary sentiment classification exerciseWhile they are equal on the quality side, LSTM does take much longer — 2hrs as opposed to less than a second.

That is too big a difference to not to be ignored.

With that we conclude this post.

In the next post we go over the results for a multilabel classification exercise and the impact of external word-embeddings such as fasttext.

…Originally published at xplordat.

com on January 28, 2019.

.

. More details

Leave a Reply