What are the ratio?Below code will do the job to answer that.
# Find out how many products have been reordered before.
print(len(order_products_total[order_products_total.
reordered == 1]), 'products have reordered before')print(len(order_products_total[order_products_total.
reordered == 0]), 'products haven't reordered before')# Find out the ratio.
print(len(order_products_total[order_products_total.
reordered == 1])/order_products_total.
shape[0], 'have reordered before')print(len(order_products_total[order_products_total.
reordered == 0])/order_products_total.
shape[0], 'haven't reordered before')19955360 products have reordered before .
13863746 products haven’t reordered before.
0.
59 have reordered before .
0.
41 haven’t reordered before.
How many items is in each product category and what is the top ten product category in terms of having the most items?To answer this, we will explore the aisle, department and products tables.
snapshot of aisle, department and products tables# Merging tables together.
products_departments = products.
merge(departments, left_on='department_id', right_on='department_id', how='left')products_departments_aisles = products_departments.
merge(aisles, left_on='aisle_id', right_on='aisle_id', how='left')products_departments_aisles.
head()# Counting how many items is in each product category.
products_departments_aisles.
groupby('department')['product_id'].
count().
reset_index().
sort_values(by='product_id', ascending=False).
head(10)top 10 product categories that have most items.
What are the items that people purchase the most?To explore this question, we need to merge products_departments_aisles and order_products_total tables together.
# Merging products_departments_aisles and order_products_total.
df = order_products_total.
merge(products_departments_aisles, left_on='product_id', right_on='product_id', how='left')df.
head()# Find out the top 15 items people purchased the most.
top_15_products = df.
product_name.
value_counts(ascending=False).
reset_index().
head(15)top_15_products.
columns = ['product_name', 'count']top_15_productstop 15 items people purchased the most.
The top 15 items that people purchase the most are above.
Most of them are organic fruits/veggies.
All of them are fruits/veggies.
Using the same logic, we can find out the top 15 aisle and top 15 department that people has the most purchases in.
# Finding top 15 aisles.
top_15_aisles = df.
aisle.
value_counts(ascending=False).
reset_index().
head(15)top_15_aisles.
columns = ['aisle_name', 'count']top_15_aisles# Finding top 15 departments.
top_15_department = df.
department.
value_counts(ascending=False).
reset_index().
head(15)top_15_department.
columns = ['department_name', 'count']top_15_departmenttop 15 aisles and top 15 departments that has the most purchases in.
What are the reorder ratio per department and aisle?# Find out reorder ratio per department.
reorder_ratio_per_dep = df.
groupby('department')['reordered'].
mean().
reset_index()reorder_ratio_per_dep.
columns = ['department', 'reorder_ratio']reorder_ratio_per_dep.
sort_values(by='reorder_ratio', ascending=False)# Find out reorder ration per aisle.
reorder_ratio_per_aisle = df.
groupby('aisle')['reordered'].
mean().
reset_index()reorder_ratio_per_aisle.
columns = ['aisle', 'reorder_ratio']reorder_ratio_per_aisle.
sort_values(by='reorder_ratio', ascending=False)dairy eggs has the most reorder ratio per department.
personal care has the least reorder ratio per department.
milk aisle has the highest reorder ratio per aisle.
spices seasonings has the least reorder ratio per aisle.
Define MetricsI want to define a metric for my shopping cart recommender.
There are different common metrics that people like to use for evaluation.
For example:PrecisionRecallprecision vs.
recallFor my project, I decide to use recall as my evaluation metric.
That means what percentage of the items the customer had purchased are actually from the recommender?For example, if I recommend 5 items to the customer and he/she bought 4 of them, that means recall = 0.
80.
In order words, there are 80% of my recommended items is in the customer’s shopping cart and 20% of items are new items that I recommend to the customers.
Shopping Cart RecommenderThere are different ways to build a recommendation system, for example, content based filtering, collaborative (item-based or user-based) filtering, and hybrid of both.
In this project, I will explore how to build a shopping cart recommender using cosine similarity, which is a method under collaborative filtering.
The design of the recommender is to first find out the customers that have reordered before and the items that have been reordered before.
Then calculate the cosine similarity between all those users and products.
Then generate a list of 5 recommendations.
Below is the code of how to find the customers that have reordered before and calculate cosine similarity.
# get the list of orders that have been reordered beforereorders = order_products_total[order_products_total['reordered'] == 1]orders2 = orders[['order_id', 'user_id']]# merge to get user_id and product_iduser_orders = reorders.
merge(orders2, on='order_id')# filtering out the high volumn products that user reordered more than onceuser_orders['high_volume'] = (user_orders['product_id'].
value_counts().
sort_values(ascending=False)>1)high_volume = user_orders[user_orders['high_volume'] == True]# get a matrix of different high volume items that particular user purchasedhigh_volume_users = high_volume.
groupby(['user_id', 'product_name']).
size().
sort_values(ascending=False).
unstack().
fillna(0)# calculate similarity between each usercosine_dists = pd.
DataFrame(cosine_similarity(high_volume_users),index=high_volume_users.
index, columns=high_volume_users.
index)cosine_dists.
head()snapshot of how the matrix looks like.
This is to define a function for the recommendation system.
And trying to recommend 5 items based on similar profiles that have similar purchase history as our target customer.
def Recommender_System(user_id): ''' enter user_id and return a list of 5 recommendations.
''' u = high_volume.
groupby(['user_id','product_name']).
size().
sort_values(ascending=False).
unstack().
fillna(0) u_sim = pd.
DataFrame(cosine_similarity(u), index=u.
index, columns=u.
index) p = high_volume.
groupby(['product_name','user_id']).
size().
sort_values(ascending=False).
unstack().
fillna(0) recommendations = pd.
Series(np.
dot(p.
values,cosine_dists[user_id]), index=p.
index) return recommendations.
sort_values(ascending=False).
head()# recommendation for customer id 175965.
Recommender_System(175965)5 recommendations generated for customer id 175965EvaluationTo better illustrate, I used myself as an example:purchase history and recommendation for user Ka.
On the left, it is a list of top 20 items that the user, Ka, has purchased before.
On the right, it is a list of 5 recommendations that my recommender generated.
As you can see 4 out of 5 items from my recommendations matches Ka’s top 20 purchase history.
That means there is 80% chance of my recommended items is in this users’ list of top 20 items for this case.
Using the same logic as an evaluation metric.
To calculate the recall score for the recommender, I define a function to do the job and below is the code.
The dataframe is too large to process, so I divided the dataframe into 7 for metric calculation.
# filter 1000 users for calculation# because the dataframe is too large users = high_volume.
user_id.
unique().
tolist()# calculate recall for the :1000 usersdef how_match(): res = [] for user in sorted(users)[:1000]: recommendations = Recommender_System(user) top_20_itmes = _[_.
user_id == user].
product_name.
value_counts().
head(20) recommendations_list = recommendations.
index.
tolist() top_20_items_list = top_20_itmes.
index.
tolist() res.
append((len(set(recommendations_list) & set(top_20_items_list)))/5) return np.
mean(res)# get metric for the :1000 usershow_match()# calculate the mean of all the metric from all the Recommender notebooks.
print('The Final Score for Metric is', (0.
531 + 0.
522 + 0.
519 + 0.
530 + 0.
523 + 0.
519 + 0.
526)/7)The final recall score is 0.
524, that means about 52% chance of my recommended items will be in the users’ list of top 20 items, which is a lot better than just randomly guessing from more than 50k products.
And about half of the items will be new items that I recommend to the customers.
To better understand how effective/powerful my recommender is, I think the best way to do is running A/B test experiment.
Lessons LearnedThere are many studies, types and different ways on how to build a recommender.
You really need to spend a lot of time to do your research and study.
There is no short-cut and you need to try out different ways using your data to build recommenders for comparison.
Expect your dataframe will be huge after you created dummy for each customers and items.
So need to explore ways, for example: matrix, to make your input more readable or more time efficient for your machine to do the calculation.
In The FutureCreate a Flask application and deploy it online for people to experience the recommender.
Do a project that could exercise/explore A/B testing using this recommender.
Try using the dataset to build different types of recommender.
Explore Turi Create, an open source ML tool owned by Apple now, for building recommenders and compare with different types of recommender.
Thank you so much for reading, and if you are interested to explore my code and resources I used, this project is on my github repo.
.. More details