Building a Recommendation System using Word2vec: A Unique Tutorial with Case Study in Python

Just imagine the buying history of a consumer as a sentence and the products as its words: Taking this idea further, let’s work on online retail data and build a recommendation system using word2vec embeddings.

  Case Study: Using word2vec in Python for Online Product Recommendation Let’s set up and understand our problem statement.

We are asked to create a system that automatically recommends a certain number of products to the consumers on an E-commerce website based on the past purchase behavior of the consumers.

We are going to use an Online Retail Dataset that you can download from this link.

Let’s fire up our Jupyter Notebook and quickly import the required libraries and load the dataset.

View the code on Gist.

View the code on Gist.

Here is the description of the fields in this dataset: InvoiceNo: Invoice number.

a unique number assigned to each transaction StockCode: Product/item code.

a unique number assigned to each distinct product Description: Product description Quantity: The quantities of each product per transaction InvoiceDate: Invoice Date and time.

The day and time when each transaction was generated CustomerID: Customer number.

a unique number assigned to each customer df.

shape Output: (541909, 8) The dataset contains 541,909 transactions.

That is a pretty good number for us to build our model.

  Treat Missing Data # check for missing values df.


sum()           Since we have sufficient data, we will drop all the rows with missing values.

# remove missing values df.

dropna(inplace=True) Data Preparation Let’s convert the StockCode to string datatype: df[StockCode]= df[StockCode].

astype(str) Let’s check out the number of unique customers in our dataset: customers = df[“CustomerID”].


tolist() len(customers) Output: 4372 There are 4,372 customers in our dataset.

For each of these customers, we will extract their buying history.

In other words, we can have 4,372 sequences of purchases.

It is a good practice to set aside a small part of the dataset for validation purposes.

Therefore, I will use the data of 90% of the customers to create word2vec embeddings.

Let’s split the data.

View the code on Gist.

We will create sequences of purchases made by the customers in the dataset for both the train and validation set.

View the code on Gist.

View the code on Gist.

Build word2vec Embeddings for Products View the code on Gist.

Since we are not planning to train the model any further, we are calling init_sims( ) here.

This will make the model much more memory-efficient: model.

init_sims(replace=True) Let’s check out the summary of “model”: print(model) Output: Word2Vec(vocab=3151, size=100, alpha=0.

03) Our model has a vocabulary of 3,151 unique words and their vectors of size 100 each.

Next, we will extract the vectors of all the words in our vocabulary and store it in one place for easy access.

View the code on Gist.

Output: (3151, 100)   Visualize word2vec Embeddings It is always quite helpful to visualize the embeddings that you have created.

Over here, we have 100-dimensional embeddings.

We can’t even visualize 4 dimensions let alone 100.

What in the world can we do?.We are going to reduce the dimensions of the product embeddings from 100 to 2 by using the UMAP algorithm.

It is popularly used for dimensionality reduction.

View the code on Gist.

Every dot in this plot is a product.

As you can see, there are several tiny clusters of these data points.

These are groups of similar products.

  Start Recommending Products Congratulations!.We are finally ready with the word2vec embeddings for every product in our online retail dataset.

Now, our next step is to suggest similar products for a certain product or a product’s vector.

Let’s first create a product-ID and product-description dictionary to easily map a product’s description to its ID and vice versa.

View the code on Gist.

# test the dictionary products_dict[84029E] Output: [‘RED WOOLLY HOTTIE WHITE HEART.

’] I have defined the function below.

It will take a product’s vector (n) as input and return top 6 similar products: View the code on Gist.

Let’s try out our function by passing the vector of the product ‘90019A’ (‘SILVER M.


P ORBIT BRACELET’): similar_products(model[90019A]) Output: [(‘SILVER M.



766798734664917), (‘PINK HEART OF GLASS BRACELET’, 0.

7607438564300537), (‘AMBER DROP EARRINGS W LONG BEADS’, 0.

7573930025100708), (‘GOLD/M.



7413625121116638), (‘ANT COPPER RED BOUDICCA BRACELET’, 0.

7289256453514099), (‘WHITE VINT ART DECO CRYSTAL NECKLAC’, 0.

7265784740447998)] Cool!.The results are pretty relevant and match well with the input product.

However, this output is based on the vector of a single product only.

What if we want to recommend products based on the multiple purchases he or she has made in the past?.One simple solution is to take the average of all the vectors of the products the user has bought so far and use this resultant vector to find similar products.

We will use the function below that takes in a list of product IDs and gives out a 100-dimensional vector which is a mean of vectors of the products in the input list: View the code on Gist.

Recall that we have already created a separate list of purchase sequences for validation purposes.

Now let’s make use of that.

len(purchases_val[0]) Output: 314 The length of the first list of products purchased by a user is 314.

We will pass this products’ sequence of the validation set to the function aggregate_vectors.


shape Output: (100, ) Well, the function has returned an array of 100 dimensions.

It means the function is working fine.

Now we can use this result to get the most similar products: similar_products(aggregate_vectors(purchases_val[0])) Output: [(‘PARTY BUNTING’, 0.

661663293838501), (‘ALARM CLOCK BAKELIKE RED ‘, 0.

640213131904602), (‘ALARM CLOCK BAKELIKE IVORY’, 0.

6287959814071655), (‘ROSES REGENCY TEACUP AND SAUCER ‘, 0.

6286610960960388), (‘SPOTTY BUNTING’, 0.

6270893216133118), (‘GREEN REGENCY TEACUP AND SAUCER’, 0.

6261675357818604)] As it turns out, our system has recommended 6 products based on the entire purchase history of a user.

Moreover, if you want to get product suggestions based on the last few purchases, only then also you can use the same set of functions.

Below I am giving only the last 10 products purchased as input: similar_products(aggregate_vectors(purchases_val[0][-10:])) Output: [(‘PARISIENNE KEY CABINET ‘, 0.

6296610832214355), (‘FRENCH ENAMEL CANDLEHOLDER’, 0.

6204789876937866), (‘VINTAGE ZINC WATERING CAN’, 0.

5855435729026794), (‘CREAM HANGING HEART T-LIGHT HOLDER’, 0.

5839680433273315), (‘ENAMEL FLOWER JUG CREAM’, 0.

5806118845939636)] Feel free to play around this code and try to get product recommendations for more sequences from the validation set.

I would be thrilled if you can further optimize this code or make it better.

  End Notes I had a great time writing this article and sharing my experience of working with word2vec for making product recommendations.

You can try to implement this code on similar non-textual sequence data.

Music recommendation can be a good use case, for example.

This experiment has inspired me to try other NLP techniques and algorithms to solve more non-NLP tasks.

Feel free to use the comments section below if you have any doubts or want to share your feedback.

You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.. More details

Leave a Reply