[NLP] Performance of Different Word Embeddings on Text Classification

Photo by Kate Stone Matheson on Unsplash[NLP] Performance of Different Word Embeddings on Text Classificationcompared among word2vec, TF-IDF weighted, GloVe and doc2vecTom LinBlockedUnblockFollowFollowingJul 10The IncentiveIt’s been a while not able to write new posts, so sad, but now finally I am back again to share some of the knowledge I’ve just acquired.

This time is about NLP.

As a fresh rookie in NLP, I’d like to play around and test out how different methods of creating doc vector perform on text classification.

This post will be highly focused on feature engineering side, that is word vectorization, and less on modeling.

Thus, without further due, let’s get started.

Brief IntroductionThe word embeddings being investigated here are word2vec, TF-IDF weighted word2vec, pre-train GloVe word2vec and doc2vec.

The packages needed are Gensim, Spacy and Scikit-Learn.

Spacy is used in doc preprocessing, including stop word removal and custom token selection based on its part of speech.

Gensim is heavily applied for training word2vec and doc2vec, and lastly, Scikit-Learn is for classifier building and training.

Quick SummaryAfter a series of comparison on different word embedding/averaging methods, it turns out that custom-trained word embedding and its averaging method, either simple mean or TF-IDF weighted has the best performance, while on the contrary, GloVe word embedding or custom-trained Doc2vec perform slightly worse than the former word embedding.

Besides, even if we try to concatenate both word2vec and doc2vec as a whole feature set, it performs equally the same to just using averaging word embedding alone.

In other words, no need to use both word2vec and doc2vec at the same time.

Special Credits to the Following Posts and AuthorsIn creating my python class object used for text preprocessing, I referred from these well-written posts.

The post “Text Classification with Word2vec” by nadbor demos how to write your own class to compute average word embedding for doc, either simple averaging or TF-IDF weighted one.

“Multi-Class Text Classification Model Comparison and Selection” by Susan Li teaches me how to write beautiful averaging function for word embedding.

This tutorial “Gensim Doc2vec Tutorial on the IMDB Sentiment Dataset” has step by step guidance on how to create doc2vec via Gensim.

“Distributed representations of sentences and documents” by Le & Mikolov presents a clear and easy-to-understand explanation on what’s going under doc2vec.

Data PreparationThe dataset I am gonna use here is consumer complaints dataset on financial product/service as referred from the post[1].

The dataset is collected and published by US GOV CFPB, while we can also download the dataset from Kaggle.

The original dataset contains more than 500 thousands records, and columns include product, sub_product, issue, consumer_complaint_narrative, and company_response_to_consumer etc.

We will just use product as text label and consumer_complaint_narrative as text itself.

After dropping rows of missing values on consumer complaint we are left with around 60 thousands records.

In order to lessen the computing pressure, I will just experiment on the first 25 thousands records only.

Now, let’s see how frequency distributed among each label.

Distribution of Each Label in the DatasetWe can tell that it’s a highly imbalanced dataset, where Debt Collection and Mortgage account for half of the total records, while the most scarce class, Prepaid Card and Other Financial Service account for less than 1% in the dataset.

Following is the demo of (label, text) examples.

Demo of Product(Label), Consumer Complaints(Text)Document PreprocessingNow comes the first step —Doc Preprocessing.

Before we create our own word embedding based on the input texts, we need to preprocess the text so that it complies with the input format as Gensim requires.

It involves multiple steps starting from word tokenization, bi-gram detection, lemmatization etc.

Here, I wrote a python class called DocProcess.

This class implements all the nitty-gritty jobs mentioned above for us under the hood, such as:First, the class takes in a series of texts, then tokenizes the text and removes all punctuations.

It has the option build_bi, meaning whether to build up bi-gram, function adopted from Gensim.

The default is False, if option build_bi is set to True, then the class will train a bi-gram detector and create bi-gram words for the text.

Now, all the processed tokens are concatenated back to form a sentence again.

The texts are tokenized once again, but this time, both stop words and parts of speech that are not allowed in the text will be removed and all tokens are lemmatized.

These tokens are stored as self.

doc_words— list of the tokens for each text(doc).

Finally, these self.

doc_words are wrapped up into TaggedDocument, a object type in Gensim for later use in doc2vec training.

It’s stored in self.

tagdocsSnippet of Class “DocPreprocess”With the class, I can easily implement doc preprocess with just one line.

from UtilWordEmbedding import DocPreprocessimport spacynlp = spacy.

load('en_core_web_md')stop_words = spacy.




STOP_WORDSall_docs = DocPreprocess(nlp, stop_words, df['consumer_complaint_narrative'], df['product'])Now let’s inspect what the output of doc preprocess is like.

The Content Stored in DocPreprocess ClassFrom above, we can tell it’s very handy that the class has stored tokenized words, labels and tagged document, which all are ready for use later.

Word Model — Word2vec TrainingSince the text are properly processed, we’re ready to train our word2vec via Gensim.

Here I chose the dimension size 100 for each word embedding and window size of 5.

The training iterates for 100 times.

word_model = Word2Vec(all_docs.

doc_words, min_count=2, size=100, window=5, workers=workers, iter=100)Photo by Daria Nepriakhina on UnsplashIt’s Break Time, and shortly, let’s continue…Averaging Word Embedding for Each DocOK!.Now we have the word embedding at hand, we’ll be using the word embedding to compute for representative vector for whole text.

It then serves as feature input for text classification model.

There are various ways to come up with doc vector.

First, let’s start with the simple one.

(1) Simple Averaging on Word EmbeddingThis is a rather straightforward method.

It directly averages all word embedding occurred in the text.

Here I adapted the code from these two posts [2][3] and created the class MeanWordEmbeddingVectorizer.

Class of MeanWordEmbeddingVectorizerIt has both self.

fit() and self.

transform() method so that to be compatible with other functionalities in scikit-learn.

What the class does is rather simple.

Initiate the class with the word model(trained word embedding), it then can transforms all tokens in the text into vectors and does the averaging to come up with representative doc vector.

If the doc has no tokens, then it will return a zero vector.

Just one reminder that the input for self.

transform() must be list of doc tokens, instead of doc text itself.

from UtilWordEmbedding import MeanEmbeddingVectorizermean_vec_tr = MeanEmbeddingVectorizer(word_model)doc_vec = mean_vec_tr.


doc_words)(2) TF-IDF Weighted Averaging on Word EmbeddingNot just satisfied with simple averaging?.We can further adopt TF-IDF as weights for each word embedding.

This will amplify the role of significant word in computing doc vector.

Here, the whole process is implemented under class TfidfEmbeddingVectorizer.

Again, the code is adapted from the same post source.

One thing worth noted is that, the Term Frequency has already been considered when we conduct averaging over the text, but not Inverse Document Frequency, thus the weight literally being the IDF, and the unseen word is assigned the max IDF in default setting.

The snippet of code can be checked in this gist.

And the other thing to note is that we need to fit the class with tokens first, for it must loop through all the words before hand in order to compute IDF.

from UtilWordEmbedding import TfidfEmbeddingVectorizertfidf_vec_tr = TfidfEmbeddingVectorizer(word_model)tfidf_vec_tr.


doc_words) # fit tfidf model firsttfidf_doc_vec = tfidf_vec_tr.


doc_words)(3) Leverage Pre-train GloVe Word EmbeddingLet’s include another option — leveraging the existing pre-trained word embedding and see how it performs in text classification.

Here I follow up the instructions from Stanford NLP course(CS224N) notebook, importing GloVe word embedding into Gensim to compute for averaging word embedding on text.

As a side note, I’ve also tried to apply Tf-IDF weighted method on GloVe vector, but found out that the result is basically the same as the ones from TF-IDF weighted averaging doc vector.

Thus, I omit the demonstration and just include simple averaging on GloVe word vector here.

# Apply word averaging on GloVe word vector.

glove_mean_vec_tr = MeanEmbeddingVectorizer(glove_word_model)glove_doc_vec = glove_mean_vec_tr.


doc_words)(4) Apply Doc2vec Training DirectlyLast but not least, we still have one more option — to directly train doc2vec, and no need to average all word embeddings.

Here I chose PV-DM model to train my doc2vec.

The script is mostly referred from Gensim tutorial[4].

And again, to save all the labor, I create a class DocModel for it.

The class just needs to take in the TaggedDocument and then we call self.

custom_train() method, the doc model will train itself.

Class of DocModelNoted that self.

custom_train() has the option to use fixed learning rate.

It’s said that fixed learning rate reaches better result[5] as quoted here,1.

randomizing the order of input sentences, or2.

manually controlling the learning rate over the course of several iterations.

but it didn’t happen on my experiment.

As I manually decrease the learning rate (code down below), I found out doc2vec model was not able to infer most similar doc correctly.

That says, if I feed in the doc vector from the same doc, self.

test_orig_doc_infer() didn’t return that same doc as the most similar doc, while it’s supposed to do so.

A side note, the self.

test_orig_doc_infer() method is used to test if the predicted doc given the doc vector from the original doc really return the same doc as most similar doc.

If so, we can fairly judge that the model successfully captures the hidden meaning of whole doc, and thus giving representative doc vector.

# Failed Attempt (Not achieving better result.

)for _ in range(fixed_lr_epochs): self.



shuffle([x for x in self.

docs]), total_examples=len(self.

docs), epochs=1) self.


alpha -= 0.

002 self.


min_alpha = self.


alpha # fixed learning rateTherefore, instead, just leave the default setting is suffice to achieve better result.

Here, the learning rate is set 0.

025, training epochs is 100 and negative sampling is applied.

from UtilWordEmbedding import DocModel# Configure keyed arguments for Doc2Vec model.

dm_args = { 'dm': 1, 'dm_mean': 1, 'vector_size': 100, 'window': 5, 'negative': 5, 'hs': 0, 'min_count': 2, 'sample': 0, 'workers': workers, 'alpha': 0.

025, 'min_alpha': 0.

025, 'epochs': 100, 'comment': 'alpha=0.

025'}# Instantiate a pv-dm model.

dm = DocModel(docs=all_docs.

tagdocs, **dm_args)dm.

custom_train()(5) LabelsAnd finally, don’t forget the LABELS!!!target_labels = all_docs.

labelsPrepare the Classification ModelNow, we’ve prepared all the necessary ingredients — different types of features.

Let’s experiment to observe their effect on classification performance.

Here, I’ll use basic logistic model as the base model and feed in different kind of features created earlier.

Hence, to compare their effectiveness.

In addition to compare effects of each word embedding averaging method, I also try to concatenate word2vec and doc2vec together, and see if it can boost up the performance even more.

I used TF-IDF weighted word embedding and PV-DM doc2vec together.

The result shows that it increases the accuracy on training dataset (perhaps a sign of over-fitting?), but not so significant improvement on testing dataset compared using TF-IDF word2vec alone.

ReflectionsLet’s inspect which word embedding performs the worst.

Surprisingly, the pre-train GloVe word embedding and doc2vec perform relatively worse on text classification, with accuracy of 0.

73 and 0.

78 respectively, while other are above 0.


Perhaps, it’s because the custom trained word2vec is specifically fitted for this dataset, and thus provides most relevant information to the docs at hand.

It doesn’t necessarily mean that we should not use GloVe word embedding or doc2vec anymore, for in the phase of inference, we might probably run into new words that haven’t had word embedding in our word model.

In this case, GloVe word embedding would be a great help for its coverage on wide vocabulary.

As for doc2vec, we could say that it can assist the trained word embedding to further boost up the performance of text classification model, though pretty small and fine to opt-out as well.

Table of Classification Performance over Different Word EmbeddingThe full jupyter notebook can be found under this link.

Reference[1] Susan Li, Multi-Class Text Classification with Doc2Vec & Logistic Regression (2018), Towards Data Science[2] nadbor, Text Classification With Word2Vec (2016), DS lore[3] Susan Li, Multi-Class Text Classification Model Comparison and Selection (2018), Towards Data Science[4] Gensim Doc2Vec Tutorial on the IMDB Sentiment Dataset (2018), github[5] Radim Řehůřek’s, Doc2vec tutorial (2014), Rare Technologies.

. More details

Leave a Reply