Applying Sentiment Analysis to E-commerce classification using Recurrent Neural Networks in Keras: Theory and Implementation

To get a better understanding of the architecture, let’s look at how an RNN layer operates on a sequence.

Figure 1 below displays an abstraction of an RNN layer featuring a gated recurrent unit, which can be summarized as a hidden layer featuring the capability to propagate information across a sequence.

Suppose that we are feeding network a series of inputs in a time-step based sequence (X0, X1 ….

Xt).

It then follows that:Figure 1.

Abstraction of an RNN architecutre (source)1.

X0 is fed into the layer, producing a hypothesis H0 but also an activation value A0 , which is stored in memory.

(The prediction of X0 itself is influenced by a randomized activation value)2.

X1 is fed into the layer together with A0 (retrieved from memory), in order to produce a hypothesis H1.

An activation value A1 is also produced and stored in memory, which now contains information related to X1 and X0.

3.

X2 is fed into the layer together with A1 (retrieved from memory), in order to produce a hypothesis H1.

An activation value A2 is also produced and stored in memory, which now contains information related to X1, X0, and X2.

4.

The process is repeated for the sequence until a pre-defined token “stop-word” is encountered (let’s assume a period exists after Xt).

Essentially, an RNN layer allows for information from one end of the sequence to influence the prediction processes at the other end.

While the exact training process is beyond the scope of this tutorial, it is based on backpropagation across time, where loss functions calculated at each step of the sequence prediction process and are summed together for an overall loss function.

Training affects three separate sets of weights, namely:- Between the input and the hidden layer (Wax)- Between the hidden layer and the output layer (Way)- Between the hidden layer’s activation functions across timesteps (Waa)You may have noticed that our architecture is unidirectional, meaning that while the earlier elements of a sequence can influence later predictions, later elements cannot be used to influence earlier predictions.

This can be addressed with bidirectional variant of recurrent neural networks (BRNNs), which you can read about here.

While the RNN above is an example of a many-to-many predictive architecture, sentiment classifiers are usually based on many-to-one architectures, where are hypothesis is only generated at the end of the sequence.

As the aforementioned activation value propagation process is still present during forward propagation, this final prediction contains information on the overall sequence as a whole.

ImplementationWhile the Shopee datasets are private and proprietary, I’ve taken the liberty to create some mock data in the same style of the fashion dataset for illustrative purposes.

Each listing has its own title, path to an image, and attribute categories corresponding to a feature specific to that class of data.

We’ll be focusing on the attribute and title feature columns for our RNN.

Note that as we can’t release the datasets for your use, you won’t be able to replicate these results.

However, you may be inspired to try something similar with your own problems!Mock data in the same style as the official fashion dataset.

NaN here stands for “Not a Number”You may wonder why we aren’t planning to use the image data.

This was a conscious decision by the team, as we found that the raw images were inconsistent in terms of layout, lighting, or content — some images were simply stock photos, others badly lit, and others had their contents partially cropped.

As all of these images were submitted by thousands of individual sellers, noisy data is understandable and contrasts well with how standard datasets are in commonly found deep learning tutorials.

In light of this, we decided to focus on the title feature column (containing the raw text of each listing) and attribute categories as the as our network inputs and target labels, respectively.

Have no fear, we’ll be covering an image-based approach, and how to deal with noisy data, in a later tutorial.

All of our work was performed on Google’s Compute Engine, using the free sign-up credit granted upon registration.

As we don’t have access to the original dataset, we’ll skip any data importing steps and move straight into preprocessing — for the length of this tutorial, assume that the mock data is an excellent imitation of the actual training and validation datasets.

Essentially, our approach boils down to treating each attribute feature value as an independent sentiment value.

To execute this, feature values across all attribute categories would have to be collected and converted into one-hot-encoded vectors.

The initial step of preprocessing was to append all target label entries of the training data with their textual representations, using the provided dataset JSON dictionary.

This can be done via Pandas’s internal merge command, which repeat for all attribute categories:fashion_trainval = pd.

merge(fashion_trainval, fashion_ref.

loc[~np.

isnan(fashion_ref[“Pattern”]), [“Pattern”,”Attribute”]], on=”Pattern”, how=”left”)fashion_trainval = fashion_trainval.

rename(columns={“Attribute”:”Pattern_Attr”})fashion_trainval = pd.

merge(fashion_trainval, fashion_ref.

loc[~np.

isnan(fashion_ref[“Collar Type”]), [“Collar Type”,”Attribute”]], on=”Collar Type”, how=”left”)fashion_trainval = fashion_trainval.

rename(columns={“Attribute”:”Collar_Type_Attr”})fashion_trainval = pd.

merge(fashion_trainval, fashion_ref.

loc[~np.

isnan(fashion_ref[“Fashion Trend”]), [“Fashion Trend”,”Attribute”]], on=”Fashion Trend”, how=”left”)fashion_trainval = fashion_trainval.

rename(columns={“Attribute”:”Fashion_Trend_Attr”})fashion_trainval = pd.

merge(fashion_trainval, fashion_ref.

loc[~np.

isnan(fashion_ref[“Clothing Material”]), [“Clothing Material”,”Attribute”]], on=”Clothing Material”, how=”left”)fashion_trainval = fashion_trainval.

rename(columns={“Attribute”:”Clothing_Material_Attr”})fashion_trainval = pd.

merge(fashion_trainval, fashion_ref.

loc[~np.

isnan(fashion_ref[“Sleeves”]), [“Sleeves”,”Attribute”]], on=”Sleeves”, how=”left”)fashion_trainval = fashion_trainval.

rename(columns={“Attribute”:”Sleeves_Attr”})Next, we replaced all NaN values with a universal “default” sentiment class as a placeholder, and apply a function to sum all attribute entries into a single one-line array, which we will term the [Label] column.

fashion_trainval = fashion_trainval.

fillna(“default”)fashion_trainval[“Label”] = fashion_trainval[[“Pattern_Attr”,”Collar_Type_Attr”,”Fashion_Trend_Attr”, “Clothing_Material_Attr”,”Sleeves_Attr”]].

apply(lambda x: tuple([attr for attr in x.

values]), axis=1)fashion_trainval.

head()An example data row would now look something like this:Next, we one-hot-encoded our feature target labels using SKLearn’s MultiLabelBinarizer class.

To be more specific,e MultiLabelBinarizer converts all of our features (across all attribute categories) into a single binary array to indicate the presence of the feature for certain row of data.

from sklearn.

preprocessing import MultiLabelBinarizermultilabel_binarizer = MultiLabelBinarizer()multilabel_binarizer.

fit(fashion_trainval[‘Label’])A representation of a row of data possessing only two labels, say “floral” and “colorful”, may look something like this:With our labels ready, we cleaned up our title sequences by standardizing their format, and removing any special characters.

In order to reduce the number of inflectional and derivationally related forms of the same base word, our sequences underwent lemmatization, which aims to associate words of the same meaning into a single token.

This process acts as a form of normalization for the text sequences in our dataset.

from nltk.

corpus import stopwordsfrom nltk.

tokenize import word_tokenizefrom nltk.

stem import WordNetLemmatizerimport relemmatizer = WordNetLemmatizer()strip_special_chars = re.

compile(“[^A-Za-z0–9 ]+”)stop_words = set(stopwords.

words(“english”))def cleanUpSentence(r, stop_words = None): r = r.

lower().

replace(“<br />”, “ “) r = re.

sub(strip_special_chars, “”, r.

lower()) if stop_words is not None: words = word_tokenize(r) filtered_sentence = [] for w in words: w = lemmatizer.

lemmatize(w) if w not in stop_words: filtered_sentence.

append(w) return “ “.

join(filtered_sentence) else: return rtotalX = []totalY = np.

array(fashion_trainval[‘Label’])totalY = multilabel_binarizer.

fit_transform(totalY)for i, doc in enumerate(fashion_trainval[‘title’]): totalX.

append(cleanUpSentence(doc, stop_words))Finally, we ensured that our tokens fit within a maximum dictionary size of 50,000 words, while padding our sequences to ensure a uniform arbitrary sequence length of 150 elements.

This was done using Keras’s internal methods.

from keras.

preprocessing.

text import Tokenizerfrom keras.

preprocessing.

sequence import pad_sequencesmaxLength = 150max_vocab_size = 50000input_tokenizer = Tokenizer(max_vocab_size)input_tokenizer.

fit_on_texts(totalX)input_vocab_size = len(input_tokenizer.

word_index) + 1print(“input_vocab_size:”,input_vocab_size)totalX = np.

array(pad_sequences(input_tokenizer.

texts_to_sequences(totalX), maxlen=maxLength))Now that all of our preprocessing has been done, let’s build our sequential network model.

As we were building an RNN, wee switched out our traditional layers with the aforementioned gated-recurrent units, before feeding the results into a densely-connected Sigmoid layer designed to give probabilities across all of our feature sentiment categories.

We defined our model’s optimizer and loss function as ADAM and binary crossentropy, respectively.

The use of binary crossentropy means that our predictions will be of the One Vs All format — You could also utilize categorical crossentropy for a more pure multiclassification-based approach, but we felt that the high number of “sentiments” would make that approach less accurate.

Finally, we trained our model across 10 epochs using a 10% validation set, primarily due to a lack of resources.

Ideally, you’re going to want to train for more epochs, but even with the limited training time you’ll notice that our validation accuracies are still high.

As done in our previous tutorials, we use Matplotlib to inspect the change in our training and validation accuracies and loss values.

from keras.

models import Sequentialfrom keras.

layers.

embeddings import Embeddingfrom keras.

layers.

recurrent import GRUfrom keras.

preprocessing.

text import Tokenizerfrom keras.

preprocessing.

sequence import pad_sequencesfrom keras.

layers import Denseembedding_dim = 256num_categories = len(y)model = Sequential()model.

add(Embedding(input_vocab_size, embedding_dim,input_length = maxLength))model.

add(GRU(256, dropout=0.

9, return_sequences=True))model.

add(GRU(256, dropout=0.

9))model.

add(Dense(num_categories, activation=’sigmoid’))model.

compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])model.

save(“fashion_text_model.

h5”)from keras.

models import load_modelmodel = load_model(‘fashion_text_model.

h5’)import matplotlib.

pyplot as plt%matplotlib inlineacc = history.

history[‘acc’]val_acc = history.

history[‘val_acc’]loss = history.

history[‘loss’]val_loss = history.

history[‘val_loss’]epochs = range(len(acc))plt.

plot(epochs, acc, ‘bo’, label=’Training acc’)plt.

plot(epochs, val_acc, ‘b’, label=’Validation acc’)plt.

title(‘Training and validation accuracy’)plt.

legend()plt.

figure()plt.

plot(epochs, loss, ‘bo’, label=’Training loss’)plt.

plot(epochs, val_loss, ‘b’, label=’Validation loss’)plt.

title(‘Training and validation loss’)plt.

legend()plt.

show()Figure 3.

Accuracy and Loss values over 10 training epochs using the RNN model.

Our validation accuracy approaches 99%!.Wow!.You may be lulled to thinking that our model performs great — but the small differences between training and validation accuracies suggest that our model is overfitting to the dataset.

It could be argued however, that this correlation is simply due to the data not exhibiting significant variation.

As all of our feature probability values are exhibited in a single array (see below), it’s important to enumerate through them.

First, we set a cut-off of 50%, and then separate the resulting features into their original attribute categories.

Let’s demonstrate a quick prediction using row 220 of the fashion dataset (which we sadly cannot show you!)textArray = np.

array(pad_sequences(input_tokenizer.

texts_to_sequences([input_x_220]), maxlen=maxLength))predicted = model.

predict(textArray)[0]print(predicted)# predicted classfor i, prob in enumerate(predicted): if prob > 0.

5: print(y[i])predicted_top = y[sorted(range(len(predicted)), key=lambda i: predicted[i],reverse=True)[:3]]if ‘default’ in predicted_top: predicted_top = [i for i in predicted_top if i is not ‘default’]else: predicted_top = predicted_top[:2]predicted_topYour raw feature prediction array would look something like this:While your final two predictions may look something like this (with default class removed):[‘floral’, ’dress’]To make a submission for the challenge, we had to extract the attributes from the raw prediction grid into their corresponding attribute categories.

T o summarize, we selected the top two predicted features for each category, or with a default placeholder if only one prediction had scored above the prediction boundary.

def process_prediction(data): predicted_ordered = y[sorted(range(len(data)), key=lambda i: data[i],reverse=True)] if ‘default’ in predicted_ordered: predicted_ordered = [i for i in predicted_ordered if i is not ‘default’] # get attribute of each prediction #predicted_attribute = [get_attribute(temp_predict) for temp_predict in predicted_ordered] predicted_attribute = [] for i, temp_predict in enumerate(predicted_ordered): predicted_attribute.

append(get_attribute(temp_predict)) # keep top 2 for each attribute return_prediction = [] for i, attr in enumerate(list(set(predicted_attribute))): temp_predicted_attr = [predicted_ordered[i] for i in [index for index, value in enumerate(predicted_attribute) if value == attr]] if len(temp_predicted_attr) > 1: temp_predicted_attr = temp_predicted_attr[:2] else: temp_predicted_attr = temp_predicted_attr + temp_predicted_attr return_prediction.

append([attr]+temp_predicted_attr) return return_prediction#predicted_new_top2 = [process_prediction(predicted_item) for predicted_item in predicted_new]predicted_new_top2 = []for i, predicted_item in enumerate(predicted_new): predicted_new_top2.

append(process_prediction(predicted_item))Our overall MAP@K value (K=2), was 0.

45262, compared to the overall winner’s 0.

46869.

In fact, Shopee’s own reference solution utilized a multimodal approach, where CNN-based image prediction results supplemented heuristic algorithms and densely connected classifiers.

Some of the issues and weaknesses with our approach include:1.

We assumed that the sequences have intrinsic meaning when titles were arrayed in the style of a bag of words.

However, while certain rules in grammar can lend to the separation of adjectives from nouns in our titles, this is also not guaranteed to be consistent.

2.

Our model lacked sufficient hyperparameter and probability cut-off tuning, possibly negatively affecting accuracy.

3.

We observed a strong tendency to overfit to our training dataset, which would require large amounts of regularization to be overcome.

4.

Our dictionary may be too large, holding too many words that are under-represented in the dataset or simply filler.

In the end, this challenge was a great learning experience, and if nothing else, taught us four important data science lessons that I hope you’ll take with you!Don’t overengineer your solutions!Don’t neglect any available data!Don’t put all of your eggs in a unimodal basket!Don’t forget to allocate enough resources for training!ReferencesAndrew Ng, Recurrent Neural NetworksColah, Understanding LSTMsWildML, Recurrent Neural Networks.

. More details

Leave a Reply