Building a Collaborative Filtering Recommender System with ClickStream Data

Photo credit: PaxabayBuilding a Collaborative Filtering Recommender System with ClickStream DataHow to implement a recommendation algorithm based on prior implicit feedback.

Susan LiBlockedUnblockFollowFollowingApr 19Recommender systems are everywhere, helping you find everything from books to romantic dates, hotels to restaurants.

There are all kinds of recommender systems for all sorts of situations, depends on your needs and available data.

Explicit vs ImplicitLet’s face it, explicit feedback is hard to collect as they require additional input from the users.

The users give explicit feedback only when they choose to do so.

As a result, most of the time, people don’t provide ratings at all (I myself totally guilty of this on Amazon!).

Therefore, the amount of explicit data collected are extremely scarce.

On the other hand, implicit data is easy to collect in large quantities without any effort from the users.

The goal is to convert user behavior into user preferences which indirectly reflect opinion through observing user behavior.

For example, a user that bookmarked many articles by the same author probably likes that author.

The DataOur goal today is to develop a recommender system with implicit data collection which is clickstream data, in our case.

It is very hard to find public available data for this project.

I am using data from Articles sharing and reading from CI&T DeskDrop.

Deskdrop is an internal communications platform that allows companies employees to share relevant articles with their peers, and collaborate around them.

The data contains about 73k users interactions on more than 3k public articles shared in the platform, more importantly, it contains rich implicit feedback, different interaction types were logged, making it possible to infer the user’s level of interest in the articles.

And we will be using Implicit Library, a Fast Python Collaborative Filtering for Implicit Datasets, for our matrix factorization.

Data Pre-processingRemove columns that we do not need.

Remove eventType == 'CONTENT REMOVED' from articles_df.

Merge interactions_df with articles_df.

implicit_rec_preprocess.

pyThis is the data set that will get us to start:Table 1This tells us what event type each person has with each content.

There are many duplicated records and we will remove them shortly.

df['eventType'].

value_counts()Figure 1The eventType values are:VIEW: The user has opened the article.

A page view in a content site can mean many things.

It can mean that the user is interested, or maybe user is just lost or clicking randomly.

LIKE: The user has liked the article.

BOOKMARK: The user has bookmarked the article for easy return in the future.

This is a strong indication that the user finds something of interest.

COMMENT CREATED: The user left a comment on the article.

FOLLOW: The user chose to be notified on any new comment about the article.

We are going to associate each eventType with a weight or strength.

It is reasonable to assume that for example, a bookmark on an article indicates a higher interest of the user on that article than a like.

event_type_strength = { 'VIEW': 1.

0, 'LIKE': 2.

0, 'BOOKMARK': 3.

0, 'FOLLOW': 4.

0, 'COMMENT CREATED': 5.

0, }df['eventStrength'] = df['eventType'].

apply(lambda x: event_type_strength[x])Table 2Drop duplicated records.

Group eventStrength together with person and content.

df = df.

drop_duplicates()grouped_df = df.

groupby(['personId', 'contentId', 'title']).

sum().

reset_index()grouped_df.

sample(10)We get the final result of grouped eventStrength.

Table 3Alternating Least Squares Recommender Model FittingInstead of representing an explicit rating, the eventStrength can represent a “confidence” in terms of how strong the interaction was.

Articles with a larger number of eventStrength by a person can carry more weight in our ratings matrix of eventStrength.

To get around “negative integer” warning, I will have to create numeric person_id and content_id columns.

Create two matrices, one for fitting the model (content-person) and one for recommendations (person-content).

Initialize the Alternating Least Squares (ALS) recommendation model.

Fit the model using the sparse content-person matrix.

We set the type of our matrix to double for the ALS function to run properly.

implicit_als_model.

pyFinding the Similar ArticlesWe are going to find the top 10 most similar articles for content_id = 450, titled “Google’s fair use victory is good for open source”, this article seems talk about Google and open source.

Get the person and content vectors from our trained model.

Calculate the vector norms.

Calculate the similarity score.

Get the top 10 contents.

Create a list of content-score tuples of most similar articles with this article.

similar_content.

pyFigure 2The 1st article is itself.

The other 9 articles are about Google, or opensource software, or cloud, or AI, or the other tech companies.

I am sure you will agree with me that they are all some what similar with the first one!Recommend Articles to PersonsThe following function will return the top 10 recommendations chosen based on the person / content vectors for contents never interacted with for any given person.

Get the interactions score from the sparse person content matrix.

Add 1 to everything, so that articles with no interaction yet become equal to 1.

Make articles already interacted zero.

Get dot product of person vector and all content vectors.

Scale this recommendation vector between 0 and 1.

Content already interacted have their recommendation multiplied by zero.

Sort the indices of the content into order of best recommendations.

Start empty list to store titles and scores.

Append titles and scores to the list.

Get the trained person and content vectors.

We convert them to csr matrices.

Create recommendations for person with id 50.

implicit_rec_als_id_50.

pyFigure 3Here we have top 10 recommendations for person_id = 50.

Do they make sense?.Let’s get top 10 articles this person has interacted with.

grouped_df.

loc[grouped_df['person_id'] == 50].

sort_values(by=['eventStrength'], ascending=False)[['title', 'person_id', 'eventStrength']].

head(10)Table 4Apparently, this person is interested in articles on open source CMS such as Drupal, she also reads software development and business related articles, namely “Google”, “Slack” or “Johnson Johnson”.

The articles we recommended to her includes Drupal for digital experience, information technology vs.

humanity, software development, and business articles about Google.

Pretty impressive!.Let’s try one more.

We recommended the following articles to person_id = 1:person_id = 1recommendations = recommend(person_id, sparse_person_content, person_vecs, content_vecs)print(recommendations)Figure 4The following are the articles person_id = 1 has interacted with:grouped_df.

loc[grouped_df['person_id'] == 1].

sort_values(by=['eventStrength'], ascending=False)[['title', 'eventStrength', 'person_id']]Table 5Apparently, this person has only interacted with 5 articles and she seems to have very limited interest.

The articles she interacted was about learning Japanese language and / or android development.

The articles we recommended to her includes learning Japanese language, android development and user interface design.

Cool!Jupyter notebook can be found on Github.

Happy Easter!References:AlternatingLeastSquares – Implicit 0.

3.

8 documentationA Recommendation Model based off the algorithms described in the paper 'Collaborative Filtering for Implicit Feedback…implicit.

readthedocs.

ioALS Implicit Collaborative FilteringContinuing on the collaborative filtering theme from my collaborative filtering with binary data example i’m going to…medium.

comRecommender Systems in Python 101Using data from Articles sharing and reading from CI&T DeskDropwww.

kaggle.

com.

. More details

Leave a Reply