Predicting Next Purchase Day

Data Driven Growth with PythonPredicting Next Purchase DayMachine Learning model to predict when the customers will make their next purchaseBarış KaramanBlockedUnblockFollowFollowingJun 2This series of articles was designed to explain how to use Python in a simplistic way to fuel your company’s growth by applying the predictive approach to all your actions.

It will be a combination of programming, data analysis, and machine learning.

I will cover all the topics in the following ten articles:1- Know Your Metrics2- Customer Segmentation3- Customer Lifetime Value Prediction4- Churn Prediction5- Predicting Next Purchase Day6- Demand Prediction with Time-Series Method7- Market Response Models8- Statistical Simulations9- A/B Testing Design and Execution10- AutomationsThe first three articles are live, and the rest will be published weekly.

Articles will have their own code snippets to make you easily apply them.

If you are super new to programming, you can have a good introduction for Python and Pandas (a famous library that we will use on everything) here.

But still without a coding introduction, you can learn the concepts, how to use your data and start generating value out of it:Sometimes you gotta run before you can walk — Tony StarkAs a pre-requisite, be sure Jupyter Notebook and Python are installed on your computer.

The code snippets will run on Jupyter Notebook only.

Alright, let’s start.

Part 5: Predicting Next Purchase DayMost of the actions we explained in Data Driven Growth series have the same mentality behind:Treat your customers in a way they deserve before they expect that (e.


, LTV prediction) and act before something bad happens (e.


, churn).

Predictive analytics helps us a lot on this one.

One of the many opportunities it can provide is predicting the next purchase day of the customer.

What if you know if a customer is likely to make another purchase in 7 days?We can build our strategy on top of that and come up with lots of tactical actions like:No promotional offer to this customer since s/he will make a purchase anywaysNudge the customer with inbound marketing if there is no purchase in the predicted time window (or fire the guy who did the prediction ????‍♀️ ????‍♂️ )In this article, we will be using online retail dataset and follow the steps below:Data Wrangling (creating previous/next datasets and calculate purchase day differences)Feature EngineeringSelecting a Machine Learning ModelMulti-Classification ModelHyperparameter TuningData WranglingLet’s start with importing our data and do the preliminary data work:Importing CSV file and date field transformationWe have imported the CSV file, converted the date field from string to DateTime to make it workable and filtered out countries other than the UK.

To build our model, we should split our data into two parts:Data structure for training the modelWe use six months of behavioral data to predict customers’ first purchase date in the next three months.

If there is no purchase, we will predict that too.

Let’s assume our cut off date is Sep 9th ’11 and split the data:tx_6m = tx_uk[(tx_uk.

InvoiceDate < date(2011,9,1)) & (tx_uk.

InvoiceDate >= date(2011,3,1))].

reset_index(drop=True)tx_next = tx_uk[(tx_uk.

InvoiceDate >= date(2011,9,1)) & (tx_uk.

InvoiceDate < date(2011,12,1))].

reset_index(drop=True)tx_6m represents the six months performance whereas we will use tx_next for the find out the days between the last purchase date in tx_6m and the first one in tx_next.

Also, we will create a dataframe called tx_user to possess a user-level feature set for the prediction model:tx_user = pd.



columns = ['CustomerID']By using the data in tx_next, we need the calculate our label (days between last purchase before cut off date and first purchase after that):Now, tx_user looks like below:As you can easily notice, we have NaN values because those customers haven’t made any purchase yet.

We fill NaN with 999 to quickly identify them later.

We have customer ids and corresponding labels in a dataframe.

Let’s enrich it with our feature set to build our machine learning model.

Feature EngineeringFor this project, we have selected our feature candidates like below:RFM scores & clustersDays between the last three purchasesMean & standard deviation of the difference between purchases in daysAfter adding these features, we need to deal with the categorical features by applying get_dummies method.

For RFM, to not repeat Part 2, we share the code block and move forward:RFM Scores & ClusteringLet’s focus on how we can add the next two features.

We will be using shift() method a lot in this part.

First, we create a dataframe with Customer ID and Invoice Day (not datetime).

Then we will remove the duplicates since customers can do multiple purchases in a day and difference will become 0 for those.

#create a dataframe with CustomerID and Invoice Datetx_day_order = tx_6m[['CustomerID','InvoiceDate']]#convert Invoice Datetime to daytx_day_order['InvoiceDay'] = tx_6m['InvoiceDate'].


datetx_day_order = tx_day_order.

sort_values(['CustomerID','InvoiceDate'])#drop duplicatestx_day_order = tx_day_order.

drop_duplicates(subset=['CustomerID','InvoiceDay'],keep='first')Next, by using shift, we create new columns with the dates of last 3 purchases and see how our dataframe looks like:#shifting last 3 purchase datestx_day_order['PrevInvoiceDate'] = tx_day_order.


shift(1)tx_day_order['T2InvoiceDate'] = tx_day_order.


shift(2)tx_day_order['T3InvoiceDate'] = tx_day_order.


shift(3)Output:Let’s begin calculating the difference in days for each invoice date:tx_day_order['DayDiff'] = (tx_day_order['InvoiceDay'] – tx_day_order['PrevInvoiceDate']).


daystx_day_order['DayDiff2'] = (tx_day_order['InvoiceDay'] – tx_day_order['T2InvoiceDate']).


daystx_day_order['DayDiff3'] = (tx_day_order['InvoiceDay'] – tx_day_order['T3InvoiceDate']).


daysOutput:For each customer ID, we utilize .

agg() method to find out the mean and standard deviation of the difference between purchases in days:tx_day_diff = tx_day_order.


agg({'DayDiff': ['mean','std']}).


columns = ['CustomerID', 'DayDiffMean','DayDiffStd']Now we are going to make a tough decision.

The calculation above is quite useful for customers who have many purchases.

But we can’t say the same for the ones with 1–2 purchases.

For instance, it is too early to tag a customer as frequent who has only 2 purchases but back to back.

We only keep customers who have > 3 purchases by using the following line:tx_day_order_last = tx_day_order.

drop_duplicates(subset=['CustomerID'],keep='last')Finally, we drop NA values, merge new dataframes with tx_user and apply .

get_dummies() for converting categorical values:tx_day_order_last = tx_day_order_last.

dropna()tx_day_order_last = pd.

merge(tx_day_order_last, tx_day_diff, on='CustomerID')tx_user = pd.

merge(tx_user, tx_day_order_last[['CustomerID','DayDiff','DayDiff2','DayDiff3','DayDiffMean','DayDiffStd']], on='CustomerID')#create tx_class as a copy of tx_user before applying get_dummiestx_class = tx_user.

copy()tx_class = pd.

get_dummies(tx_class)Our feature set is ready for building a classification model.

But there are many different models, which one should we use?Selecting a Machine Learning ModelBefore jumping into choosing the model, we need to take two actions.

First, we need to identify the classes in our label.

Generally, percentiles give the right for that.

Let’s use .

describe() method to see them in NextPurchaseDay:Deciding the boundaries is a question for both statistics and business needs.

It should make sense in terms of the first one and be easy to take action and communicate.

Considering these two, we will have three classes:0–20: Customers that will purchase in 0–20 days — Class name: 221–49: Customers that will purchase in 21–49 days — Class name: 1≥ 50: Customers that will purchase in more than 50 days — Class name: 0tx_class['NextPurchaseDayRange'] = 2tx_class.


NextPurchaseDay>20,'NextPurchaseDayRange'] = 1tx_class.


NextPurchaseDay>50,'NextPurchaseDayRange'] = 0The last step is to see the correlation between our features and label.

The correlation matrix is one of the cleanest ways to show this:corr = tx_class[tx_class.



figure(figsize = (30,20))sns.

heatmap(corr, annot = True, linewidths=0.

2, fmt=".

2f")Looks like Overall Score has the highest positive correlation (0.

45) and Recency has the highest negative (-0.


For this particular problem, we want to use the model which gives the highest accuracy.

Let’s split train and test tests and measure the accuracy of different models:Selecting the ML model for the best accuracyAccuracy per each model:From this result, we see that XGBoost is the best performing one.

But before that, let’s look at what we did.

We applied a fundamental concept in Machine Learning, which is Cross Validation.

How can we be sure of the stability of our machine learning model across different datasets?.Also, what if there is a noise in the test set we selected.

Cross Validation is a way of measuring this.

It provides the score of the model by selecting different test sets.

If the deviation is low, it means the model is stable.

In our case, XGB has two scores, 0.

574 and 0.

582, which looks alright.

Let’s move forward with XGBoost and build our multi-classification model.

Multi-Classification ModelTo build our model, we will follow the steps in the previous articles.

But for improving it further, we’ll do Hyperparameter Tuning.

Programmatically, we will find out what are the best parameters for our model to make it provide the best accuracy.

Let’s start with coding our model first:xgb_model = xgb.


fit(X_train, y_train)print('Accuracy of XGB classifier on training set: {:.

2f}' .


score(X_train, y_train)))print('Accuracy of XGB classifier on test set: {:.

2f}' .



columns], y_test)))In this version, our accuracy on the test set is 58%:XGBClassifier has many parameters.

You can find the list of them here.

For this example, we will select max_depth and min_child_weight.

The code below will generate the best values for these parameters:from sklearn.

model_selection import GridSearchCVparam_test1 = { 'max_depth':range(3,10,2), 'min_child_weight':range(1,6,2)}gsearch1 = GridSearchCV(estimator = xgb.

XGBClassifier(), param_grid = param_test1, scoring='accuracy',n_jobs=-1,iid=False, cv=2)gsearch1.


best_params_, gsearch1.

best_score_The algorithm says the best values are 3 and 5 for max_depth and min_child_weight respectively.

Check out how it improves accuracy:Our score increased from 58% to 62%.

It is quite an improvement.

Knowing the next purchase day is a good indicator for predicting demand as well.

We will be doing a deep dive on this topic in Part 6.

You can find the Jupyter Notebook for this article here.


. More details

Leave a Reply