Building A Movie Recommendation Engine Using Pandas

Building A Movie Recommendation Engine Using PandasExploring the basic intuition behind the recommendation engines.

Nishit JainBlockedUnblockFollowFollowingApr 20OverviewRecommendation Engines are the programs which basically compute the similarities between two entities and on that basis, they give us the targeted output.

If we look at the root of any recommendation engine, they all are trying to find out the amount of similarity between two entities.

Then, the computed similarities can be used to deduce various kinds of recommendations and relationships between them.

Recommendation Engines are mostly based on the following techniques:Popularity Based Filtering.

Collaborative Filtering (User Based / Item Based).

Hybrid User-Item Based Collaborative Filtering.

Content Based Filtering.

Popularity Based FilteringThe most basic form of a recommendation engine would be where the engine recommends the most popular items to all the users.

That would be generalized as everyone would be getting similar recommendations as we didn’t personalize the recommendations.

These kinds of recommendation engines are based on the Popularity Based Filtering.

The use case for this model would be the ‘Top News’ Section for the day on a news website where the most popular new for everyone is same irrespective of the interests of every user because that makes a logical sense because News is a generalized thing and it has got nothing to do with user’s interests.

Collaborative FilteringIn collaborative filtering, two entities collaborate to deduce recommendations on the basis of certain similarities between them.

These filtering techniques are broadly of two types:User Based Collaborative Filtering: In user based collaborative filtering, we find out the similarity score between the two users.

On the basis of similarity score, we recommend the items bought/liked by one user to other user assuming that he might like these items on the basis of similarity.

This will be more clear when we go ahead and implement this.

Major online streaming service, Netflix have their recommendation engine based on user based collaborative filtering.

Item Based Collaborative Filtering: In item based collaborative filtering, the similarity of an item is calculated with the existing item being consumed by the existing users.

Then on the basis of amount of similarity, we can say that if user X likes item A and a new item P is most similar to item A then it highly makes sense for us to recommend item P to user X.

Hybrid User-Item Based Collaborative Filtering: This technique is basically a proper mixture of both the above techniques wherein the recommendations are not solely based on either.

E-commerce websites like Amazon employ this technique to recommend item(s) to their customer.

Content Based Filtering: In this technique, the users are recommended the similar content which they have used/watched/liked the most before.

For example, if a user has been mostly listening to songs of similar type (bit rate, bps, tunes etc.

), he will be recommended the songs falling under the same category decided based on certain features.

The best example of this category would be Pandora Radio which is a music streaming and automated music recommendation internet radio service.

Coding & ImplementationWe have a movie lens database and our objective is to apply various kinds of recommendation techniques from scratch using pandas and find out similarities between the users, most popular movies, and personalized recommendations for the targeted user based on user based collaborative filtering.

(We are exploring only one of the types because these article is about getting the basic intuition behind the recommendation engines.

)We are importing pandas and some basic mathematical functions from math library and importing the dataset into a dataframe object.

# Importing the required libraries.

import pandas as pdfrom math import pow, sqrt# Reading ratings dataset into a pandas dataframe object.

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']ratings = pd.

read_csv('data/ratings.

dat', sep='::', names=r_cols, encoding='latin-1')# Getting number of users and movies from the dataset.

user_ids = ratings.

user_id.

unique().

tolist()movie_ids = ratings.

movie_id.

unique().

tolist()print('Number of Users: {}'.

format(len(user_ids)))print('Number of Movies: {}'.

format(len(movie_ids)))Output:Number of Users: 6040Number of Movies: 3706Here is how the first 5 rows of our dataset look like.

Ratings DatasetIn this dataset, we have 4 columns and around 1M rows.

Except, unix_timestamp, all the columns are self explanatory.

We anyway won’t be using this column in our code.

Next, we let’s see out how our movies dataset looks like.

# Reading movies dataset into a pandas dataframe object.

m_cols = ['movie_id', 'movie_title', 'genre']movies = pd.

read_csv('data/movies.

dat', sep='::', names=m_cols, encoding='latin-1')Movie DatasetAll the column names are self explanatory.

As seen in the above dataframe, the genre column has data with pipe separators which cannot be processed for recommendations as such.

Hence, we need to generate columns for every genre type such that if the movie belongs to that genre its value will be 1 otherwise 0 (Sort of one hot encoding).

Also, we need to split the release of year out of the movie_title column and generate a new column for it which is again a new and an important feature.

# Getting series of lists by applying split operation.

movies.

genre = movies.

genre.

str.

split('|')# Getting distinct genre types for generating columns of genre type.

genre_columns = list(set([j for i in movies['genre'].

tolist() for j in i]))# Iterating over every list to create and fill values into columns.

for j in genre_columns: movies[j] = 0for i in range(movies.

shape[0]): for j in genre_columns: if(j in movies['genre'].

iloc[i]): movies.

loc[i,j] = 1# Separting movie title and year part using split function.

split_values = movies['movie_title'].

str.

split("(", n = 1, expand = True)# setting 'movie_title' values to title part.

movies.

movie_title = split_values[0]# creating 'release_year' column.

movies['release_year'] = split_values[1]# Cleaning the release_year series.

movies['release_year'] = movies.

release_year.

str.

replace(')','')# dropping 'genre' columns as it has already been one hot encoded.

movies.

drop('genre',axis=1,inplace=True)Here’s how the dataframe looks like after processing it:Data Frame View for Movies Dataset After Pre-ProcessingNow, let us write down a few getter functions which will be frequently used in our code so that we do not need to write them again and again and it also increases readability and re-usability of the code.

# Getting the rating given by a user to a movie.

def get_rating_(userid,movieid): return (ratings.

loc[(ratings.

user_id==userid) & (ratings.

movie_id == movieid),'rating'].

iloc[0])# Getting the list of all movie ids the specified user has rated.

def get_movieids_(userid): return (ratings.

loc[(ratings.

user_id==userid),'movie_id'].

tolist())# Getting the movie titles against the movie id.

def get_movie_title_(movieid): return (movies.

loc[(movies.

movie_id == movieid),'movie_title'].

iloc[0])Similarity ScoresIn this implementation the similarity between the two users will be calculated on the basis of the distance between the two users (i.

e.

Euclidean distances) and by calculating Pearson Correlation between the two users.

We will write two functions, one to calculate the similarity on the basis of euclidean distances and other on the basis of Pearson correlation and you will know why we are writing two functions.

def distance_similarity_score(user1,user2): ''' user1 & user2 : user ids of two users between which similarity score is to be calculated.

''' # Count of movies watched by both the users.

both_watch_count = 0 for element in ratings.

loc[ratings.

user_id==user1,'movie_id'].

tolist(): if element in ratings.

loc[ratings.

user_id==user2,'movie_id'].

tolist(): both_watch_count += 1 if both_watch_count == 0 : return 0 # Calculating distance based similarity between both the users.

distance = [] for element in ratings.

loc[ratings.

user_id==user1,'movie_id'].

tolist(): if element in ratings.

loc[ratings.

user_id==user2,'movie_id'].

tolist(): rating1 = get_rating_(user1,element) rating2 = get_rating_(user2,element) distance.

append(pow(rating1 – rating2, 2)) total_distance = sum(distance) # Adding one to the denominator to avoid divide by zero error.

return 1/(1+sqrt(total_distance))print('Distance based similarity between user ids 1 & 310: {}'.

format(distance_similarity_score(1,310)))Output:Distance based similarity between user ids 1 & 310: 0.

14459058185587106Calculating similarity scores based on the distances have an inherent problem.

We do not have a threshold to decide how much distance between two users is to be considered for calculating whether the users are close enough or far enough.

On the other side, this problem is resolved by pearson correlation method as it always returns a value between -1 & 1 which clearly provides us with the boundaries for closeness as we prefer.

def pearson_correlation_score(user1,user2): ''' user1 & user2 : user ids of two users between which similarity score is to be calculated.

''' # A list of movies watched by both the users.

both_watch_count = [] # Finding movies watched by both the users.

for element in ratings.

loc[ratings.

user_id==user1,'movie_id'].

tolist(): if element in ratings.

loc[ratings.

user_id==user2,'movie_id'].

tolist(): both_watch_count.

append(element) # Returning '0' correlation for bo common movies.

if len(both_watch_count) == 0 : return 0 # Calculating Co-Variances.

rating_sum_1 = sum([get_rating_(user1,element) for element in both_watch_count]) rating_sum_2 = sum([get_rating_(user2,element) for element in both_watch_count]) rating_squared_sum_1 = sum([pow(get_rating_(user1,element),2) for element in both_watch_count]) rating_squared_sum_2 = sum([pow(get_rating_(user2,element),2) for element in both_watch_count]) product_sum_rating = sum([get_rating_(user1,element) * get_rating_(user2,element) for element in both_watch_count]) # Returning pearson correlation between both the users.

numerator = product_sum_rating – ((rating_sum_1 * rating_sum_2) / len(both_watch_count)) denominator = sqrt((rating_squared_sum_1 – pow(rating_sum_1,2) / len(both_watch_count)) * (rating_squared_sum_2 – pow(rating_sum_2,2) / len(both_watch_count))) # Handling 'Divide by Zero' error.

if denominator == 0: return 0 return numerator/denominatorprint('Pearson Corelation between user ids 11 & 30: {}'.

format(pearson_correlation_score(11,30)))Output:Pearson Corelation between user ids 11 & 30: 0.

2042571684752679Most Similar UsersThe objective is to find out Most Similar Users to the targeted user.

Here we have two metrics to find the score i.

e.

distance and correlation.

Now, we will write a function for this.

def most_similar_users_(user1,number_of_users,metric='pearson'): ''' user1 : Targeted User number_of_users : number of most similar users you want to user1.

metric : metric to be used to calculate inter-user similarity score.

('pearson' or else) ''' # Getting distinct user ids.

user_ids = ratings.

user_id.

unique().

tolist() # Getting similarity score between targeted and every other suer in the list(or subset of the list).

if(metric == 'pearson'): similarity_score = [(pearson_correlation_score(user1,nth_user),nth_user) for nth_user in user_ids[:100] if nth_user != user1] else: similarity_score = [(distance_similarity_score(user1,nth_user),nth_user) for nth_user in user_ids[:100] if nth_user != user1] # Sorting in descending order.

similarity_score.

sort() similarity_score.

reverse() # Returning the top most 'number_of_users' similar users.

return similarity_score[:number_of_users]print(most_similar_users_(23,5))Output:[(0.

936585811581694, 61), (0.

7076731463403717, 41), (0.

6123724356957956, 21), (0.

5970863767331771, 25), (0.

5477225575051661, 64)]As we can see, the output is list of tuples indicating the similarity scores of the top 5 similar number of the users asked for with user id against the targeted user.

The metric used here is Pearson Correlation.

I don’t know if few of the people have noticed that the most similar users’ logic can be strengthened more by considering other factors as well such as age, sex, occupation etc.

Here, we have created our logic on the basis of only one feature i.

e.

rating.

Getting Movie Recommendations for Targeted UserThe concept is very simple.

First, we need to iterate over only those movies not watched(or rated) by the targeted user and the sub-setting items based on the users highly correlated with targeted user.

Here, we have used a weighted similarity approach where we have taken product of rating and score into account to make sure that the highly similar users affect the recommendations more than those less similar.

Then, we have sorted the list on the basis of score along with movie ids and returned the movie titles against those movie ids.

Let us write a function for the same.

def get_recommendation_(userid): user_ids = ratings.

user_id.

unique().

tolist() total = {} similariy_sum = {} # Iterating over subset of user ids.

for user in user_ids[:100]: # not comparing the user to itself (obviously!) if user == userid: continue # Getting similarity score between the users.

score = pearson_correlation_score(userid,user) # not considering users having zero or less similarity score.

if score <= 0: continue # Getting weighted similarity score and sum of similarities between both the users.

for movieid in get_movieids_(user): # Only considering not watched/rated movies if movieid not in get_movieids_(userid) or get_rating_(userid,movieid) == 0: total[movieid] = 0 total[movieid] += get_rating_(user,movieid) * score similariy_sum[movieid] = 0 similariy_sum[movieid] += score # Normalizing ratings ranking = [(tot/similariy_sum[movieid],movieid) for movieid,tot in total.

items()] ranking.

sort() ranking.

reverse() # Getting movie titles against the movie ids.

recommendations = [get_movie_title_(movieid) for score,movieid in ranking] return recommendations[:10]print(get_recommendation_(32))Output:['Invisible Man, The ', 'Creature From the Black Lagoon, The ', 'Hellraiser ', 'Almost Famous ', 'Way of the Gun, The ', 'Shane ', 'Naked Gun 2 1/2: The Smell of Fear, The ', "Kelly's Heroes ", 'Official Story, The ', 'Everything You Always Wanted to Know About Sex ']As we can see in the output, we have got the top 10 highly recommended movie for the user with user id 32 using the metric Pearson Correlation.

You can do the same exercise with the euclidean distances as metric and I’m sure the results will differ.

Learning & ConclusionWe implemented a movie recommendation engine by just using Pandas and basic math library functions.

Also, we got to know the basic intuition behind the recommendation engines.

Obviously, there’s a lot more to the recommendation engines than this as there are multiple features and factors which influence the recommendations and not just the ratings.

Further more, we will be implementing and deducing our recommendations also based on other features of the users and the movies in the next blog and also explore the infamous technique for recommendation engines i.

e.

Matrix Factorization using the turicreate library.

The GitHub repository for the code in the blog can be found here.

.

. More details

Leave a Reply