Clean a complex dataset for modelling with recommendation algorithms

Clean a complex dataset for modelling with recommendation algorithmsMy take on Market Basket Analysis — Part 1 of 3Diego UsaiBlockedUnblockFollowFollowingMar 13OverviewRecently I wanted to learn something new and challenged myself to carry out an end-to-end Market Basket Analysis.

To continue to challenge myself, I’ve decided to put the results of my efforts before the eyes of the data science community.

And what better forum for my first ever series of posts than one of my favourite data science blogs!This is the first of three post arranged as follows:Part 1: explore and cleanse a dataset suitable for modelling with recommendation algorithmsPart 2: apply various Product Recommendation models with the recommenderlab R package Part 3: implement the best performing model in a Shiny Web ApplicationIntroductionMarket Basket Analysis or MBA (also referred to as Affinity Analysis) is a set of data mining and data analysis techniques used to understand customers shopping behaviours and uncover relationships among the products that they buy.

By analysing the likes of retail basket composition, transactional data, or browsing history, these methods can be used to suggest to customers items they might be interested in.

Such recommendations can promote customer loyalty, increase conversion rate by helping customers to find relevant products faster and boost cross-selling and up-selling by suggesting additional items or services.

Recommendation systems are widely employed in several industries.

Online retailer Amazon uses these methods to recommend to their customers products that other people often buy together (‘customers who bought this item also bought…’).

Every Monday since 2015 entertainment company Spotify suggests to their users a tailored playlist of 30 songs they’ve never listened to based on their consumption history.

In 2009 Netflix famously ran a competition (The Netflix Prize) which awarded a $1M Grand Prize to the team that improved the predictions of their movie recommendation system.

Market Basket Analysis also finds applications in such areas as:Promotions: MBA may give indications on how to structure promotion calendars.

When two items (say, torch & batteries) are often bought together, a retailer may decide to have only one item on promotion to boost sales of the other.

Store Design: store layout can be shaped to place closer together items which are more often bought at the same time (e.

g.

milk & cereals, shampoo & conditioner, meat & vegetables, etc.

)Loyalty Programmes: when a customer has been inactive for some time, he/she can be prompted to re-engage with customised offers based on their purchase history.

DataThe data for this project comes from the UCI Machine Learning Repository, an online archive of large datasets which includes a wide variety of data types, analysis tasks, and application areas.

In this project I’m using the Online Retail dataset, donated to UCI in 2015 by the School of Engineering at London South Bank University.

This dataset contains transactions occurring between 01/Dec/2010 and 09/Dec/2011 for a UK-based and registered online retail company.

The company sells mainly unique all-occasion gifts and many of their customers are wholesalers.

I’ve picked this particular dataset because of it’s “real life” nature, which shows in the many manually entered annotations and adjustment codes I found in the Description field.

There are even personal comments left by the company’s employees, who likely used the database on a day-to-day basis.

It’s reasonable to assume that the dataset came straight from the company’s database with very little alterations.

From experience this is consistent with the state in which an analyst is most likely to receive data from a client they are to conduct a piece of analysis for.

Data Preparation# Importing R librarieslibrary(data.

table) library(readxl) library(tidyverse)library(lubridate)library(skimr) library(knitr) library(treemap)Loading and Inspecting the Data# import raw data file and trim leading and trailing whitespacesretail <- read_excel(".

/00_Data/Online Retail.

xlsx", trim_ws = TRUE)The dataset consists of just over 540,000 observations spread across 8 variables.

A few Descriptions and several CustomerIDs are missing and there are also some odd negatives Quantity and UnitPrice that would be worth investigating.

It’s also worth noting that InvoiceDate is of POSIXct format, from which information about Date and Time of purchase can be extracted.

# First glance at the dataretail %>% skim()## Skim summary statistics## n obs: 541909 ## n variables: 8 ## ## — Variable type:character ——————## variable missing complete n min max empty n_unique## Country 0 541909 541909 3 20 0 38## Description 1454 540455 541909 1 35 0 4211## InvoiceNo 0 541909 541909 6 7 0 25900## StockCode 0 541909 541909 1 12 0 4070## ## — Variable type:numeric ——————–## variable missing complete n mean sd p0 p25## CustomerID 135080 406829 541909 15287.

69 1713.

6 12346 13953 ## Quantity 0 541909 541909 9.

55 218.

08 -80995 1 ## UnitPrice 0 541909 541909 4.

61 96.

76 -11062.

06 1.

25## p50 p75 p100## 15152 16791 18287## 3 10 80995## 2.

08 4.

13 38970## ## — Variable type:POSIXct ——————–## variable missing complete n min max median## InvoiceDate 0 541909 541909 2010-12-01 2011-12-09 2011-07-19## n_unique## 23260CancellationsThe very handy Attribute Information tells us that if the InvoiceNo starts with letter ‘C’, it indicates a cancellationretail %>% filter(grepl("C", retail$InvoiceNo)) %>% summarise(Total = n())## # A tibble: 1 x 1## Total## <int>## 1 9288Cancellations are not needed for the analysis so they can be removedretail <- retail %>% filter(!grepl("C", retail$InvoiceNo))# CHECK: total row count – 532,621Negative QuantitiesWhen filtering by non positive Quantity, the Description shows what looks like a series of manually entered comments (e.

g.

“thrown away”, “Unsaleable”, “damaged”,“?”).

Given that UnitPrice is also set to zero for all of them, it is safe to assume that these were adjustments codes.

retail %>% filter(Quantity <= 0) %>% group_by(Description, UnitPrice) %>% summarise(count =n()) %>% arrange(desc(count)) %>% ungroup()## # A tibble: 139 x 3## Description UnitPrice count## <chr> <dbl> <int>## 1 <NA> 0 862## 2 check 0 120## 3 damages 0 45## 4 damaged 0 42## 5?.0 41## 6 sold as set on dotcom 0 20## 7 Damaged 0 14## 8 thrown away 0 9## 9 Unsaleable, destroyed.

0 9## 10 ??.0 7## # .

with 129 more rowsWere this a live project, I would typically sense check this assumption with whomever provided the data.

In this case, I will simply remove all rows with non-positive Quantity.

retail <- retail %>% filter(Quantity > 0)# CHECK: total row count – 531,285Non-Product StockCodesTurning my attention to StockCode, I notice a handful of non-product related codes (‘Postage’, ‘Bank Charges’, ‘Gift Vouchers’, etc.

).

# Non-product related codesstc <- c('AMAZONFEE', 'BANK CHARGES', 'C2', 'DCGSSBOY', 'DCGSSGIRL', 'DOT', 'gift_0001_', 'PADS', 'POST')retail %>% filter(grepl(paste(stc, collapse="|"), StockCode)) %>% group_by(StockCode, Description) %>% summarise(count =n()) %>% arrange(desc(count)) %>% ungroup()## # A tibble: 19 x 3## StockCode Description count## <chr> <chr> <int>## 1 POST POSTAGE 1126## 2 DOT DOTCOM POSTAGE 708## 3 C2 CARRIAGE 141## 4 DCGSSGIRL GIRLS PARTY BAG 13## 5 BANK CHARGES Bank Charges 12## 6 DCGSSBOY BOYS PARTY BAG 11## 7 gift_0001_20 Dotcomgiftshop Gift Voucher £20.

00 9## 8 gift_0001_10 Dotcomgiftshop Gift Voucher £10.

00 8## 9 gift_0001_30 Dotcomgiftshop Gift Voucher £30.

00 7## 10 gift_0001_50 Dotcomgiftshop Gift Voucher £50.

00 4## 11 PADS PADS TO MATCH ALL CUSHIONS 4## 12 POST <NA> 4## 13 gift_0001_40 Dotcomgiftshop Gift Voucher £40.

00 3## 14 AMAZONFEE AMAZON FEE 2## 15 C2 <NA> 1## 16 DOT <NA> 1## 17 gift_0001_10 <NA> 1## 18 gift_0001_20 to push order througha s stock was 1## 19 gift_0001_30 <NA> 1These can all be removed.

retail <- filter(retail, !grepl(paste(stc, collapse="|"), StockCode))# CHECK: total row count – 529,228DescriptionFocusing now on the Description field, there are an additional 50 manually entered annotations that need removing.

In one case an employee has even vented out their frustration at one of their co-workers (“alan hodge cant mamage this section”), with misspelling and all!# Additional adjustment codes to removedescr <- c( "check", "check?", "?", "??", "damaged", "found", "adjustment", "Amazon", "AMAZON", "amazon adjust", "Amazon Adjustment", "amazon sales", "Found", "FOUND","found box", "Found by jackie ","Found in w/hse","dotcom", "dotcom adjust", "allocate stock for dotcom orders ta", "FBA", "Dotcomgiftshop Gift Voucher £100.

00", "on cargo order","wrongly sold (22719) barcode", "wrongly marked 23343","dotcomstock", "rcvd be air temp fix for dotcom sit", "Manual", "John Lewis", "had been put aside", "for online retail orders", "taig adjust", "amazon", "incorrectly credited C550456 see 47", "returned", "wrongly coded 20713", "came coded as 20713", "add stock to allocate online orders", "Adjust bad debt", "alan hodge cant mamage this section", "website fixed","did a credit and did not tick ret", "michel oops","incorrectly credited C550456 see 47", "mailout", "test","Sale error", "Lighthouse Trading zero invc incorr", "SAMPLES","Marked as 23343", "wrongly coded 23343","Adjustment", "rcvd be air temp fix for dotcom sit", "Had been put aside.

" )Filtering out the unwanted entries.

retail <- retail %>% filter(!Description %in% descr)# CHECK: total row count – 528,732Last but not least, there are also some 600 NAs in Description.

sum(is.

na(retail$Description))## [1] 584Given their small number (around 0.

1% of total) I will just remove them.

retail <- retail %>% filter(!is.

na(Description))# CHECK: total row count – 528,148Customer IDThere is still a significant number of NAs in CustomerID, which I will leave as they are.

retail$CustomerID %>% skim()## ## Skim summary statistics## ## — Variable type:numeric ——————–## variable missing complete n mean sd p0 p25 p50 p75## .

131778 396370 528148 15301.

6 1709.

98 12346 13975 15159 16803## p100## 18287As I will discuss in the second post, for the analysis I need to arrange data in a user-item format, where “users” can be either customers or orders.

Given that there are almost 5 times as many Orders as there are Customers, I am going to use InvoiceNo for orders in the analysis, which should make for a richer information set.

sapply(retail[ ,c('InvoiceNo','CustomerID')], function(x) length(unique(x)))## InvoiceNo CustomerID ## 19792 4336Final touchesThere are a couple of housekeeping tasks to sort out and I am ready to go!retail <- retail %>%# Setting 'Description' and 'Country' as factors mutate(Description = as.

factor(Description)) %>% mutate(Country = as.

factor(Country)) %>% # Changing 'InvoiceNo' type to numeric mutate(InvoiceNo = as.

numeric(InvoiceNo)) %>% # Extracting 'Date' and 'Time' from 'InvoiceDate' mutate(Date = as.

Date(InvoiceDate)) %>% mutate(Time = as.

factor(format(InvoiceDate,"%H:%M:%S")))glimpse(retail)## Observations: 528,148## Variables: 10## $ InvoiceNo <dbl> 536365, 536365, 536365, 536365, 536365,.

## $ StockCode <chr> "85123A", "71053", "84406B", "84029G",.

## $ Description <fct> WHITE HANGING HEART T-LIGHT HOLDER,.

## $ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8,.

## $ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00,.

## $ UnitPrice <dbl> 2.

55, 3.

39, 2.

75, 3.

39, 3.

39, 7.

65, .

## $ CustomerID <dbl> 17850, 17850, 17850, 17850, 17850, .

## $ Country <fct> United Kingdom, United Kingdom, .

## $ Date <date> 2010-12-01, 2010-12-01, 2010-12-01,.

## $ Time <fct> 08:26:00, 08:26:00, 08:26:00, 08:26:00, .

Exploring the datasetI am now ready to take a look at the dataset different features.

What items do people buy more often?retail %>% group_by(Description) %>% summarize(count = n()) %>% top_n(10, wt = count) %>% arrange(desc(count)) %>% ggplot(aes(x = reorder(Description, count), y = count))+ geom_bar(stat = "identity", fill = "royalblue", colour = "blue") + labs(x = "", y = "Top 10 Best Sellers", title = "Most Ordered Products") + coord_flip() + theme_grey(base_size = 12)The heart-shaped tea light holder is the most popular item.

Top 10 most sold products represent around 3% of total items sold by the companyretail %>% group_by(Description) %>% summarize(count = n()) %>% mutate(pct=(count/sum(count))*100) %>% arrange(desc(pct)) %>% ungroup() %>% top_n(10, wt=pct)## # A tibble: 10 x 3## Description count pct## <fct> <int> <dbl>## 1 WHITE HANGING HEART T-LIGHT HOLDER 2327 0.

441## 2 JUMBO BAG RED RETROSPOT 2115 0.

400## 3 REGENCY CAKESTAND 3 TIER 2019 0.

382## 4 PARTY BUNTING 1707 0.

323## 5 LUNCH BAG RED RETROSPOT 1594 0.

302## 6 ASSORTED COLOUR BIRD ORNAMENT 1489 0.

282## 7 SET OF 3 CAKE TINS PANTRY DESIGN 1399 0.

265## 8 PACK OF 72 RETROSPOT CAKE CASES 1370 0.

259## 9 LUNCH BAG BLACK SKULL.

1328 0.

251## 10 NATURAL SLATE HEART CHALKBOARD 1263 0.

239What time of day do people buy more often?retail %>% ggplot(aes(hour(hms(Time)))) + geom_histogram(stat = "count",fill = "#E69F00", colour = "red") + labs(x = "Hour of Day", y = "") + theme_grey(base_size = 12)Lunchtime is the preferred time for shopping online, with the majority of orders places between 12 noon and 3pm.

What day of the week do people buy more often?retail %>% ggplot(aes(wday(Date, week_start = getOption("lubridate.

week.

start", 1)))) + geom_histogram(stat = "count" , fill = "forest green", colour = "dark green") + labs(x = "Day of Week", y = "") + scale_x_continuous(breaks = c(1,2,3,4,5,6,7), labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")) + theme_grey(base_size = 14)Orders peaks on Thursdays with no orders processed on Saturdays.

How many items does each customer buy?retail %>% group_by(InvoiceNo) %>% summarise(n = mean(Quantity)) %>% ggplot(aes(x=n)) + geom_histogram(bins = 100000,fill = "purple",colour = "black") + coord_cartesian(xlim=c(0,100)) + scale_x_continuous(breaks=seq(0,100,10)) + labs(x = "Average Number of Items per Purchase", y = "") + theme_grey(base_size = 14)The large majority of customers typically purchase between 2 and 15 items, with a peak at 2.

What is the average value per order?retail %>% mutate(Value = UnitPrice * Quantity) %>% group_by(InvoiceNo) %>% summarise(n = mean(Value)) %>% ggplot(aes(x=n)) + geom_histogram(bins = 200000, fill="firebrick3", colour = "sandybrown") + coord_cartesian(xlim=c(0,100)) + scale_x_continuous(breaks=seq(0,100,10)) + labs(x = "Average Value per Purchase", y = "") + theme_grey(base_size = 14)The bulk of orders have a value below £20, with the distribution showing a double peak, one at £6 and a more pronounced one at £17.

Which countries do they sell their goods to?treemap(retail, index = c("Country"), vSize = "Quantity", title = "", palette = "Set2", border.

col = "grey40")Five sixths of orders come from the United Kingdom.

CommentsThis concludes the data preparation and visualisation part of the project.

So far I have shown how to approach a “real life” dataset to clean it, get rid of unwanted elements, change variable type, and create new variables from existing ones.

In summary, I have removed Cancellations, eliminated negative Quantity and UnitPrice, got rid of NAs in Description and created two new variables, Date and Time.

A total of 13,761 rows (roughly 2.

5% of the initial count) were discarded and the dataset has now 528,148 observations.

Code RepositoryThe full R code can be found on my GitHub profile:ReferencesFor Recommenderlab Package see: https://cran.

r-project.

org/package=recommenderlabFor Recommenderlab Package Vignette see: https://cran.

r-project.

org/web/packages/recommenderlab/vignettes/recommenderlab.

pdf.

. More details

Leave a Reply