Multi-Class Text Classification with LSTM

Photo credit: PixabayMulti-Class Text Classification with LSTMHow to develop LSTM recurrent neural network models for text classification problems in Python using Keras deep learning librarySusan LiBlockedUnblockFollowFollowingApr 9Automatic text classification or document classification can be done in many different ways in machine learning as we have seen before.

This article aims to provide an example of how a Recurrent Neural Network (RNN) using the Long Short Term Memory (LSTM) architecture can be implemented using Keras.

We will use the same data source as we did Multi-Class Text Classification with Scikit-Lean, the Consumer Complaints data set that originated from data.

gov.

The DataWe will use a smaller data set, you can also find the data on Kaggle.

In the task, given a consumer complaint narrative, the model attempts to predict which product the complaint is about.

This is a multi-class text classification problem.

Let’s roll!df = pd.

read_csv('consumer_complaints_small.

csv')df.

info()Figure 1df.

Product.

value_counts()Figure 2Label ConsolidationAfter first glance of the labels, we realized that there are things we can do to make our lives easier.

Consolidate “Credit reporting” into “Credit reporting, credit repair services, or other personal consumer reports”.

Consolidate “Credit card” into “Credit card or prepaid card”.

Consolidate “Payday loan” into “Payday loan, title loan, or personal loan”.

Consolidate “Virtual currency” into “Money transfer, virtual currency, or money service”.

“Other financial service” has very few number of complaints and it does not mean anything, so, I decide to remove it.

df.

loc[df['Product'] == 'Credit reporting', 'Product'] = 'Credit reporting, credit repair services, or other personal consumer reports'df.

loc[df['Product'] == 'Credit card', 'Product'] = 'Credit card or prepaid card'df.

loc[df['Product'] == 'Payday loan', 'Product'] = 'Payday loan, title loan, or personal loan'df.

loc[df['Product'] == 'Virtual currency', 'Product'] = 'Money transfer, virtual currency, or money service'df = df[df.

Product != 'Other financial service']After consolidation, we have 13 labels:df['Product'].

value_counts().

sort_values(ascending=False).

iplot(kind='bar', yTitle='Number of Complaints', title='Number complaints in each product')Figure 3Text Pre-processingLet’s have a look how dirty the texts are:def print_plot(index): example = df[df.

index == index][['Consumer complaint narrative', 'Product']].

values[0] if len(example) > 0: print(example[0]) print('Product:', example[1])print_plot(10)Figure 4print_plot(100)Figure 5Pretty dirty, huh!Our text preprocessing will include the following steps:Convert all text to lower case.

Replace REPLACE_BY_SPACE_RE symbols by space in text.

Remove symbols that are in BAD_SYMBOLS_RE from text.

Remove “x” in text.

Remove stop words.

Remove digits in text.

text_preprocessing_LSTM.

pyNow go back to check the quality of our text pre-processing:print_plot(10)Figure 6print_plot(100)Figure 7Nice!.We are done text pre-processing.

LSTM ModelingVectorize consumer complaints text, by turning each text into either a sequence of integers or into a vector.

Limit the data set to the top 5,0000 words.

Set the max number of words in each complaint at 250.

# The maximum number of words to be used.

(most frequent)MAX_NB_WORDS = 50000# Max number of words in each complaint.

MAX_SEQUENCE_LENGTH = 250# This is fixed.

EMBEDDING_DIM = 100tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-.

/:;<=>?@[]^_`{|}~', lower=True)tokenizer.

fit_on_texts(df['Consumer complaint narrative'].

values)word_index = tokenizer.

word_indexprint('Found %s unique tokens.

' % len(word_index))Truncate and pad the input sequences so that they are all in the same length for modeling.

X = tokenizer.

texts_to_sequences(df['Consumer complaint narrative'].

values)X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)print('Shape of data tensor:', X.

shape)Converting categorical labels to numbers.

Y = pd.

get_dummies(df['Product']).

valuesprint('Shape of label tensor:', Y.

shape)Train test split.

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.

10, random_state = 42)print(X_train.

shape,Y_train.

shape)print(X_test.

shape,Y_test.

shape)The first layer is the embedded layer that uses 100 length vectors to represent each word.

SpatialDropout1D performs variational dropout in NLP models.

The next layer is the LSTM layer with 100 memory units.

The output layer must create 13 output values, one for each class.

Activation function is softmax for multi-class classification.

Because it is a multi-class classification problem, categorical_crossentropy is used as the loss function.

consumer_complaint_lstm.

pyFigure 8accr = model.

evaluate(X_test,Y_test)print('Test set.Loss: {:0.

3f}.Accuracy: {:0.

3f}'.

format(accr[0],accr[1]))plt.

title('Loss')plt.

plot(history.

history['loss'], label='train')plt.

plot(history.

history['val_loss'], label='test')plt.

legend()plt.

show();Figure 9plt.

title('Accuracy')plt.

plot(history.

history['acc'], label='train')plt.

plot(history.

history['val_acc'], label='test')plt.

legend()plt.

show();Figure 10The plots suggest that the model has a little over fitting problem, more data may help, but more epochs will not help using the current data.

Jupyter notebook can be found on Github.

Enjoy the rest of the week!.

. More details

Leave a Reply