Market Basket Analysis with recommenderlab

Market Basket Analysis with recommenderlabMy take on Market Basket Analysis — Part 2 of 3Diego UsaiBlockedUnblockFollowFollowingMar 25Photo by Victoriano Izquierdo on UnsplashOverviewRecently I wanted to learn something new and challenged myself to carry out an end-to-end Market Basket Analysis.

To continue to challenge myself, I’ve decided to put the results of my efforts before the eyes of the data science community.

This is the second of three post arranged as follows:Part 1: (which can be found here) explore and cleanse a dataset suitable for modelling with recommendation algorithmsPart 2: apply various Product Recommendation models with the recommenderlab R packagePart 3: implement the best performing model in a Shiny Web ApplicationLoading the Packages# Importing librarieslibrary(data.

table)library(tidyverse) library(knitr)library(recommenderlab)DataFor the analysis I will be using the retail dataset prepared and cleansed in Part 1.

If you want to follow along with this post, make sure you get the dataset and run R code for Part 1 , which you can find on my Github profile.

glimpse(retail)## Observations: 528,148## Variables: 10## $ InvoiceNo <dbl> 536365, 536365, 536365, 536365, .

## $ StockCode <chr> "85123A", "71053", "84406B", ".

## $ Description <fct> WHITE HANGING HEART T-LIGHT HOLDER, .

## $ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, .

## $ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12.

## $ UnitPrice <dbl> 2.

55, 3.

39, 2.

75, 3.

39, 3.

39, 7.

65, .

## $ CustomerID <dbl> 17850, 17850, 17850, 17850, 17850, .

## $ Country <fct> United Kingdom, United Kingdom, .

## $ Date <date> 2010-12-01, 2010-12-01, 2010-12-01, .

## $ Time <fct> 08:26:00, 08:26:00, 08:26:00, 08:26:00, .

ModellingFor the analysis part of this project I am using recommenderlab, an R package which provides a convenient framework to evaluate and compare various recommendation algorithms and quickly establish the best suited approach.

Create the Rating MatrixBefore I can start, I need to arrange the purchase history in a rating matrix, with orders in rows and products in columns.

This format is often called a user_item matrix because “users” (e.

g.

customers or orders) tend to be on the rows and “items” (e.

g.

products) on the columns.

recommenderlab accepts 2 types of rating matrix for modelling:real rating matrix consisting of actual user ratings, which requires normalisation.

binary rating matrix consisting of 0’s and 1’s, where 1’s indicate if the product was purchased.

This is the matrix type needed for the analysis and it does not require normalisation.

However, when creating the rating matrix, it becomes apparent that some orders contain the same item more than once, as shown in the example below.

# Filtering by an order number which contains the same stock code more than onceretail %>% filter(InvoiceNo == 557886 & StockCode == 22436) %>% select(InvoiceNo, StockCode, Quantity, UnitPrice, CustomerID)## # A tibble: 2 x 5## InvoiceNo StockCode Quantity UnitPrice CustomerID## <dbl> <chr> <dbl> <dbl> <dbl>## 1 557886 22436 1 0.

65 17799## 2 557886 22436 3 0.

65 17799It is possible that the company that donated this dataset to the UCI Machine Learning Repository had an order processing system that allowed for an item to be added multiple times to the same order.

For this analysis, I only need to know if an item was included in an order, so these duplicates need to be removed.

retail <- retail %>% # Create unique identifier mutate(InNo_Desc = paste(InvoiceNo, Description, sep = ' ')) # Filter out duplicates and drop unique identifier retail <- retail[!duplicated(retail$InNo_Desc), ] %>% select(-InNo_Desc)# CHECK: total row count – 517,354I can now create the rating matrix.

ratings_matrix <- retail %>%# Select only needed variables select(InvoiceNo, Description) %>% # Add a column of 1s mutate(value = 1) %>%# Spread into user-item format spread(Description, value, fill = 0) %>% select(-InvoiceNo) %>%# Convert to matrix as.

matrix() %>%# Convert to recommenderlab class 'binaryRatingsMatrix' as("binaryRatingMatrix")ratings_matrix## 19792 x 4001 rating matrix of class 'binaryRatingMatrix' with 517354 ratings.

Evaluation Scheme and Model ValidationIn order to establish the models effectiveness, recommenderlab implements a number of evaluation schemes.

In this scheme, I split the data into a train and a test set by selecting train = 0.

8 for a 80/20 train/test split.

I have also set method = “cross” and k = 5 for a 5-fold cross validation.

This means that the data is divided into k subsets of equal size, with 80% of the data used for training and the remaining 20% used for evaluation.

The models are recursively estimated 5 times, each time using a different train/test split, which ensures that all users and items are considered for both training and testing.

The results can then be averaged to produce a single evaluation set.

Selecting given = -1 means that for the test users ‘all but 1’ randomly selected item is withheld for evaluation.

scheme <- ratings_matrix %>% evaluationScheme(method = "cross", k = 5, train = 0.

8, given = -1)scheme## Evaluation scheme using all-but-1 items## Method: 'cross-validation' with 5 run(s).

## Good ratings: NA## Data set: 19792 x 4001 rating matrix of class 'binaryRatingMatrix' with 517354 ratings.

Set up List of AlgorithmsOne of recommenderlab main features is the ability to estimate multiple algorithms in one go.

First, I create a list with the algorithms I want to estimate, specifying all the models parameters.

Here, I consider schemes which evaluate on a binary rating matrix.

I include the random items algorithm for benchmarking purposes.

algorithms <- list( "association rules" = list(name = "AR", param = list(supp = 0.

01, conf = 0.

01)), "random items" = list(name = "RANDOM", param = NULL), "popular items" = list(name = "POPULAR", param = NULL), "item-based CF" = list(name = "IBCF", param = list(k = 5)), "user-based CF" = list(name = "UBCF", param = list(method = "Cosine", nn = 500)) )Estimate the ModelsAll I have to do now is to pass scheme and algoritms to the evaluate() function, select type = topNList to evaluate a Top N List of product recommendations and specify how many recommendations to calculate with the parameter n = c(1, 3, 5, 10, 15, 20).

Please note that the CF based algorithms take a few minutes each to estimate.

results <- recommenderlab::evaluate(scheme, algorithms, type = "topNList", n = c(1, 3, 5, 10, 15, 20) )## AR run fold/sample [model time/prediction time]## 1 [0.

32sec/73.

17sec] ## 2 [0.

24sec/72.

72sec] ## 3 [0.

23sec/72.

27sec] ## 4 [0.

24sec/72.

82sec] ## 5 [0.

24sec/72.

69sec] ## RANDOM run fold/sample [model time/prediction time]## 1 [0sec/20.

08sec] ## 2 [0sec/19.

01sec] ## 3 [0sec/18.

69sec] ## 4 [0sec/19.

26sec] ## 5 [0.

02sec/19.

41sec] ## POPULAR run fold/sample [model time/prediction time]## 1 [0.

01sec/15.

94sec] ## 2 [0sec/16.

34sec] ## 3 [0sec/15.

91sec] ## 4 [0.

02sec/16.

02sec] ## 5 [0.

01sec/15.

86sec] ## IBCF run fold/sample [model time/prediction time]## 1 [515.

11sec/3.

11sec] ## 2 [513.

94sec/2.

88sec] ## 3 [509.

98sec/3.

05sec] ## 4 [513.

94sec/3.

13sec] ## 5 [512.

58sec/2.

81sec] ## UBCF run fold/sample [model time/prediction time]## 1 [0sec/296.

54sec] ## 2 [0sec/291.

54sec] ## 3 [0sec/292.

68sec] ## 4 [0sec/293.

33sec] ## 5 [0sec/300.

35sec]The output is stored as a list containing all the evaluations.

results## List of evaluation results for 5 recommenders:## Evaluation results for 5 folds/samples using method 'AR'.

## Evaluation results for 5 folds/samples using method 'RANDOM'.

## Evaluation results for 5 folds/samples using method 'POPULAR'.

## Evaluation results for 5 folds/samples using method 'IBCF'.

## Evaluation results for 5 folds/samples using method 'UBCF'.

Visualise the Resultsrecommenderlab has a basic plot function that can be used to compare models performance.

However, I prefer to sort out the results in a tidy format for added flexibility and charting customisation.

First, I arrange the confusion matrix output for one model in a convenient format.

# Pull into a list all confusion matrix information for one model tmp <- results$`user-based CF` %>% getConfusionMatrix() %>% as.

list() # Calculate average value of 5 cross-validation rounds as.

data.

frame( Reduce("+",tmp) / length(tmp)) %>% # Add a column to mark the number of recommendations calculated mutate(n = c(1, 3, 5, 10, 15, 20)) %>%# Select only columns needed and sorting out order select('n', 'precision', 'recall', 'TPR', 'FPR')## n precision recall TPR FPR## 1 1 0.

06858938 0.

07420981 0.

07420981 0.

0002327780## 2 3 0.

04355442 0.

14137351 0.

14137351 0.

0007171045## 3 5 0.

03354715 0.

18148235 0.

18148235 0.

0012076795## 4 10 0.

02276376 0.

24627561 0.

24627561 0.

0024423093## 5 15 0.

01762715 0.

28605934 0.

28605934 0.

0036827205## 6 20 0.

01461690 0.

31627924 0.

31627924 0.

0049253407Then, I put the previous steps into a formula.

avg_conf_matr <- function(results) { tmp <- results %>% getConfusionMatrix() %>% as.

list() as.

data.

frame(Reduce("+",tmp) / length(tmp)) %>% mutate(n = c(1, 3, 5, 10, 15, 20)) %>% select('n', 'precision', 'recall', 'TPR', 'FPR') }Next, I use the map() function from the purrr package to get all results in a tidy format, ready for charting.

# Using map() to iterate function across all modelsresults_tbl <- results %>% map(avg_conf_matr) %>% # Turning into an unnested tibble enframe() %>%# Unnesting to have all variables on same level unnest()results_tbl## # A tibble: 30 x 6## name n precision recall TPR FPR## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 association rules 1 0.

0428 0.

0380 0.

0380 0.

000197## 2 association rules 3 0.

0306 0.

0735 0.

0735 0.

000579## 3 association rules 5 0.

0266 0.

0979 0.

0979 0.

000944## 4 association rules 10 0.

0224 0.

139 0.

139 0.

00179 ## 5 association rules 15 0.

0202 0.

162 0.

162 0.

00255 ## 6 association rules 20 0.

0188 0.

176 0.

176 0.

00325 ## 7 random items 1 0.

000202 0.

000219 0.

000219 0.

000250## 8 random items 3 0.

000253 0.

000820 0.

000820 0.

000750## 9 random items 5 0.

000242 0.

00131 0.

00131 0.

00125 ## 10 random items 10 0.

000222 0.

00241 0.

00241 0.

00250 ## # .

with 20 more rowsROC curveClassification models performance can be compared using the ROC curve, which plots the true positive rate (TPR) against the false positive rate (FPR).

The item-based collaborative filtering model is the clear winner as it achieves the highest TPR for any given level of FPR.

This means that the model is producing the highest number of relevant recommendations (true positives) for the same level of non-relevant recommendations (false positives).

Note that using fct_reorder2() arranges plot legend entries by best final FPR and TPR, alligning them with the curves and making the plot easier to read.

results_tbl %>% ggplot(aes(FPR, TPR, colour = fct_reorder2(as.

factor(name), FPR, TPR))) + geom_line() + geom_label(aes(label = n)) + labs(title = "ROC curves", colour = "Model") + theme_grey(base_size = 14)Precision-Recall curveAnother common way to compare classification models performance is with Precision Vs Recall curves.

Precision shows how sensitive models are to False Positives (i.

e.

recommending an item not very likely to be purchased) whereas Recall (which is just another name for the TPR) looks at how sensitive models are to False Negatives (i.

e.

do not suggest an item which is highly likely to be purchased).

Normally, we care about accurately predicting which items are more likely to be purchased because this would have a positive impact on sales and revenue.

In other words, we want to maximise Recall (or minimise FNs) for the same level of Precision.

The plot confirms that item-based Collaborative Filter (IBCF) is the best model because it has higher Recall for any given level of Precision.

This means that IBCF minimises FNs (i.

e.

not suggesting an item highly likely to be purchased) for all level of FPs.

results_tbl %>% ggplot(aes(recall, precision, colour = fct_reorder2(as.

factor(name), precision, recall))) + geom_line() + geom_label(aes(label = n)) + labs(title = "Precision-Recall curves", colour = "Model") + theme_grey(base_size = 14)Predictions For a New UserThe final step is to generate a prediction with the best performing model.

To do that, I need to create a made-up purchase order.

First, I create a string containing 6 products selected at random.

customer_order <- c("GREEN REGENCY TEACUP AND SAUCER", "SET OF 3 BUTTERFLY COOKIE CUTTERS", "JAM MAKING SET WITH JARS", "SET OF TEA COFFEE SUGAR TINS PANTRY", "SET OF 4 PANTRY JELLY MOULDS")Next, I put this order in a format that recommenderlab accept.

new_order_rat_matrx <- retail %>% # Select item descriptions from retail dataset select(Description) %>% unique() %>% # Add a 'value' column with 1's for customer order items mutate(value = as.

numeric(Description %in% customer_order)) %>% # Spread into sparse matrix format spread(key = Description, value = value) %>% # Change to a matrix as.

matrix() %>% # Convert to recommenderlab class 'binaryRatingsMatrix' as("binaryRatingMatrix")Now, I can create a Recommender.

I use getData to retrieve training data and set method = “IBCF” to select the best performing model (“item-based collaborative filtering”).

recomm <- Recommender(getData(scheme, 'train'), method = "IBCF", param = list(k = 5))recomm## Recommender of type 'IBCF' for 'binaryRatingMatrix' ## learned using 15832 users.

Finally, I can pass the Recommender and the made-up order to the predict function to create a top 10 recommendation list for the new customer.

pred <- predict(recomm, newdata = new_order_rat_matrx, n = 10)Lastly, the suggested items can be inspected as a listas(pred, 'list')## $`1`## [1] "ROSES REGENCY TEACUP AND SAUCER" ## [2] "PINK REGENCY TEACUP AND SAUCER" ## [3] "SET OF 3 HEART COOKIE CUTTERS" ## [4] "REGENCY CAKESTAND 3 TIER" ## [5] "JAM MAKING SET PRINTED" ## [6] "RECIPE BOX PANTRY YELLOW DESIGN" ## [7] "SET OF 3 CAKE TINS PANTRY DESIGN" ## [8] "GINGERBREAD MAN COOKIE CUTTER" ## [9] "3 PIECE SPACEBOY COOKIE CUTTER SET"## [10] "SET OF 6 SPICE TINS PANTRY DESIGN"CommentsThis brings to an end the modelling and evaluation part of this project, which I found straightforward and quite enjoyable.

recommenderlab is intuitive and easy to use and I particularly appreciated its ability to estimate and compare several classification algorithms at the same time.

In summary, I have learned how to carry out Market Basket Analysis with recommenderlab in R, to interpret the results and choose the best performing model.

Code RepositoryThe full R code can be found on my GitHub profileReferencesFor Recommenderlab Package see: https://cran.

r-project.

org/package=recommenderlabFor Recommenderlab Package Vignette see: https://cran.

r-project.

org/web/packages/recommenderlab/vignettes/recommenderlab.

pdf.

. More details

Leave a Reply