Sentiment Analysis using LSTM (Step-by-Step Tutorial)

’)print ('Number of reviews :', len(reviews_split))Number of reviews : 250015) Tokenize — Create Vocab to Int mapping dictionaryIn most of the NLP tasks, you will create an index mapping dictionary in such a way that your frequently occurring words are assigned lower indexes.

One of the most common way of doing this is to use Counter method from Collections library.

from collections import Countercount_words = Counter(words)total_words = len(words)sorted_words = count_words.

most_common(total_words)Let’s have a look at these objects we have createdprint (count_words)— Output —Counter({'the': 336713, 'and': 164107, 'a': 163009, 'of': 145864In order to create a vocab to int mapping dictionary, you would simply do thisvocab_to_int = {w:i for i, (w,c) in enumerate(sorted_words)}There is a small trick here, in this mapping index will start from 0 i.

e.

mapping of ‘the’ will be 0.

But later on we are going to do padding for shorter reviews and conventional choice for padding is 0.

So we need to start this indexing from 1vocab_to_int = {w:i+1 for i, (w,c) in enumerate(sorted_words)}Let’s have a look at this mapping dictionary.

We can see that mapping for ‘the’ is 1 nowprint (vocab_to_int)— Output —{'the': 1, 'and': 2, 'a': 3, 'of': 4,6) Tokenize — Encode the wordsSo far we have created a) list of reviews and b) index mapping dictionary using vocab from all our reviews.

All this was to create an encoding of reviews (replace words in our reviews by integers)reviews_int = []for review in reviews_split: r = [vocab_to_int[w] for w in review.

split()] reviews_int.

append(r)print (reviews_int[0:3])— Output —[[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, .

.

], [5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, .

], [1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194.

.

]]Note: what we have created now is a list of lists.

Each individual review is a list of integer values and all of them are stored in one huge list7) Tokenize — Encode the labelsThis is simple because we only have 2 output labels.

So, we will just label ‘positive’ as 1 and ‘negative’ as 0encoded_labels = [1 if label =='positive' else 0 for label in labels_split]encoded_labels = np.

array(encoded_labels)8) Analyze Reviews Lengthimport pandas as pdimport matplotlib.

pyplot as plt%matplotlib inlinereviews_len = [len(x) for x in reviews_int]pd.

Series(reviews_len).

hist()plt.

show()pd.

Series(reviews_len).

describe()Review Length AnalysisObservations : a) Mean review length = 240 b) Some reviews are of 0 length.

Keeping this review won’t make any sense for our analysis c) Most of the reviews less than 500 words or more d) There are quite a few reviews that are extremely long, we can manually investigate them to check whether we need to include or exclude them from our analysis9) Removing Outliers — Getting rid of extremely long or short reviewsreviews_int = [ reviews_int[i] for i, l in enumerate(reviews_len) if l>0 ]encoded_labels = [ encoded_labels[i] for i, l in enumerate(reviews_len) if l> 0 ]10) Padding / Truncating the remaining dataTo deal with both short and long reviews, we will pad or truncate all our reviews to a specific length.

We define this length by Sequence Length.

This sequence length is same as number of time steps for LSTM layer.

For reviews shorter than seq_length, we will pad with 0s.

For reviews longer than seq_length we will truncate them to the first seq_length words.

def pad_features(reviews_int, seq_length): ''' Return features of review_ints, where each review is padded with 0's or truncated to the input seq_length.

''' features = np.

zeros((len(reviews_int), seq_length), dtype = int) for i, review in enumerate(reviews_int): review_len = len(review) if review_len <= seq_length: zeroes = list(np.

zeros(seq_length-review_len)) new = zeroes+review elif review_len > seq_length: new = review[0:seq_length] features[i,:] = np.

array(new) return featuresNote: We are creating/maintaining a 2D array structure as we created for reviews_int .

Output will look like thisprint (features[:10,:])11) Training, Validation, Test Dataset SplitOnce we have got our data in nice shape, we will split it into training, validation and test setstrain= 80% | valid = 10% | test = 10%split_frac = 0.

8train_x = features[0:int(split_frac*len_feat)]train_y = encoded_labels[0:int(split_frac*len_feat)]remaining_x = features[int(split_frac*len_feat):]remaining_y = encoded_labels[int(split_frac*len_feat):]valid_x = remaining_x[0:int(len(remaining_x)*0.

5)]valid_y = remaining_y[0:int(len(remaining_y)*0.

5)]test_x = remaining_x[int(len(remaining_x)*0.

5):]test_y = remaining_y[int(len(remaining_y)*0.

5):]12) Dataloaders and BatchingAfter creating our training, test and validation data.

Next step is to create dataloaders for this data.

We can use generator function for batching our data into batches instead we will use a TensorDataset.

This is one of a very useful utility in PyTorch for using our data with DataLoaders with exact same ease as of torchvision datasetsimport torchfrom torch.

utils.

data import DataLoader, TensorDataset# create Tensor datasetstrain_data = TensorDataset(torch.

from_numpy(train_x), torch.

from_numpy(train_y))valid_data = TensorDataset(torch.

from_numpy(valid_x), torch.

from_numpy(valid_y))test_data = TensorDataset(torch.

from_numpy(test_x), torch.

from_numpy(test_y))# dataloadersbatch_size = 50# make sure to SHUFFLE your datatrain_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)In order to obtain one batch of training data for visualization purpose we will create a data iterator# obtain one batch of training datadataiter = iter(train_loader)sample_x, sample_y = dataiter.

next()print('Sample input size: ', sample_x.

size()) # batch_size, seq_lengthprint('Sample input:.', sample_x)print()print('Sample label size: ', sample_y.

size()) # batch_sizeprint('Sample label:.', sample_y)Here, 50 is the batch size and 200 is the sequence length that we have defined.

Now our data prep step is complete and next we will look at the LSTM network architecture for start building our model13) Define the LSTM Network ArchitectureLSTM Architecture for Sentiment AnalysisThe layers are as follows:0.

Tokenize : This is not a layer for LSTM network but a mandatory step of converting our words into tokens (integers)Embedding Layer: that converts our word tokens (integers) into embedding of specific sizeLSTM Layer: defined by hidden state dims and number of layersFully Connected Layer: that maps output of LSTM layer to a desired output sizeSigmoid Activation Layer: that turns all output values in a value between 0 and 1Output: Sigmoid output from the last timestep is considered as the final output of this network14) Define the Model Class15) Training the NetworkInstantiate the network# Instantiate the model w/ hyperparamsvocab_size = len(vocab_to_int)+1 # +1 for the 0 paddingoutput_size = 1embedding_dim = 400hidden_dim = 256n_layers = 2net = SentimentLSTM(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)print(net)SentimentRNN( (embedding): Embedding(74073, 400) (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.

5) (dropout): Dropout(p=0.

3) (fc): Linear(in_features=256, out_features=1, bias=True) (sig): Sigmoid())Training LoopMost of the code in training loop is pretty standard Deep Learning training code that you might see often in all the implementations that’s using PyTorch framework.

16) TestingOn Test DataOn User-generated DataFirst, we will define a tokenize function that will take care of pre-processing steps and then we will create a predict function that will give us the final output after parsing the user provided review.

Results:test_review = 'This movie had the best acting and the dialogue was so good.

I loved it.

'seq_length=200 # good to use the length that was trained onpredict(net, test_review_neg, seq_length)Positive review detectedClosing thoughts:I have tried to detail out the process invovled in building a Sentiment Analysis classifier based on LSTM architecture using PyTorch framework.

Please feel to write your thoughts / suggestions / feedbacks.

. More details

Leave a Reply