Predicting Customer Lifetime Value with “Buy ‘Til You Die” probabilistic models in Python

And above all, how long should we expect a customer to be “alive” for?While these are very common questions among Marketing, Product, VCs and Corporate Finance professionals, it is always hard to properly answer them with accurate numbers.

In non-contractual business settings, where customers can end their relationship with a retailer at any moment and without notice, this can be even trickier.

 Amazon for books (or any other of its product categories without subscription), Zalando for clothing, and Booking.

com for hotels are all examples of non-contractual businesses settings.

For all these three E-commerces we cannot look at the end date of a customer’s contract to know if he’s “alive” (will purchase in the future) or “dead” (will never purchase again).

We can only rely on a customer’s past purchases and other less characterizing events (website visits, reviews, etc.


 But how do we decide in this scenario if a customer is going to come back or he’s gone for good?“Buy ‘Til You Die” probabilistic models help us in quantifying the lifetime value of a customer by assessing the expected number of his future transactions and his probability of being “alive”.

The BG/NBD ModelTo understand how Buy ’Til You Die models work, we focus on our best choice to predict real life data: the BG/NBD model.

The Beta Geometric/Negative Binomial Distribution model was introduced in 2004 by P.

Fader’s Paper as an improvement of the Pareto/NBD model (the first BTYD) developed by Schmittlein et al.

in 1987.

In particular, to predict future transactions the model treats the customer purchasing behaviour as a coin tossing game.

Each customer has 2 coins: a buy coin that controls the probability of a customer to purchase, and a die coin that controls the probability of a customer to quit and never purchase again.

Let’s go through the model assumptions to understand how everything works out.

Assumption 1: while active, the number of transactions made by a customer follows a Poisson Process with transaction rate λ (=expected number of transactions in a time interval).

A customer’s purchasing behavior observed over a period of 12 months, where the number of transactions is distributed as a Poisson Process with unobserved transaction rate cAt every sub-period (1 month) of a specific time interval (12 months) each customer tosses his buy coin and, depending on the result, he purchases or not.

 The number of transactions (heads) we observe in the period depends on each customer’s probability distribution around λ.

 Let’s plot below a customer’s Poisson probability distribution to visualize what we just said.

Poisson Probability Mass Function of a customer with λ = 4.

3Here we assume that our random customer has a transaction rate λ = 4.


As a consequence he will have a 19% probability of purchasing 4 times in a random 12 month period and a 4% probability of purchasing 8 times, and so on.

Assumption 2: heterogeneity in transaction rates among customers follows a Gamma distribution.

This is equivalent to saying that each customer has its own buy coin (with its very own probability of head and tail).

To better understand the assumption, we simulate the Poisson distribution of 100 customers where each λ is modelled with a Gamma distribution with parameters: shape=9 and scale=0.


Simulation of 100 customers Poisson Probability distributions where each customer’s λ depends on a Gamma distribution with shape = 9 and scale = 0.

5As stated in the assumption, each customer has his own probability of purchasing x times in a given time interval.

Assumption 3: after any transaction, a customer becomes inactive with probability p.

 Therefore the point at which the customer “drops out” is distributed across transactions according to a (shifted) Geometric distribution.

After every transaction, each customer will toss the second coin, the die coin.

Given that p is the probability of “dying”, then we can define P(Alive) = 1-p.

Once again, let’s plot a random customer probability distribution to better grasp the meaning of this assumption.

Shifted Geometric Probability Mass Function for a customer with p = 0.

52Assuming that our customer becomes inactive with probability p = 0.

52, then the probability that he becomes inactive after the 2nd transaction is 25%, and the probability that he becomes inactive after the 3rd transaction is 12%.

 As you see the more the customer purchases the higher his probability of being alive.

Assumption 4: heterogeneity in p follows a Beta distribution.

As for the buy coin, each customer has his own die coin with its own probability of being alive after a specific amount of transactions.

We can see below how that would look for a simulation of 10 customers where p follows a Beta distribution with α = 2 and β = 3.

Simulation of 10 random customers Geometric Probability distributions where p is follows a beta distribution with parameters α = 2 and β = 3Assumption 5: the transaction rate λ and the dropout probability p vary independently across customers.

Model OutputsEventually, by fitting the previously mentioned distributions on the historical customers data we are able to derive a model that for each customer provides:P(X(t) = x | λ, p)- the probability of observing x transactions in a time period of length tE(X(t) | λ, p)- the expected number of transactions in a time period of length tP(τ>t) – the probability of a customer becoming inactive at period τThe fitted distributions parameters are then used in the forward-looking customer-base analysis to find the expected number of transactions in a future period of length t for an individual with past observed behavior defined by x, tₓ, T — where x = number of historical transactions, tₓ = time of last purchase and T = Age of a customer.

 And here below for the math lovers the final formula (careful derivation is provided in the Appendix of P.

Fader’s Paper):The expected number of transactions in a future period of length t for an individual with past observed behavior (X = x, tₓ, T; where x = n.

historical transactions, tₓ = time of last purchase and T = Age of a customer) given the fitted model parameters r, α, a, bImplementing the CLV Model in PythonNow that we understood how Buy ‘Til You Die models work, we are finally ready to pass from theory to practice and apply the BG/NBD model on real customer data.

Among the various alternatives available to implement the model I highly recommend the Lifetimes package in Python (used here) and the BTYD library in R.

These packages did a great job in wrapping the model’s equations into handy functions that make our life incredibly easy.

The Shape of DataAs seen before, the BG/NBD model fits several distributions into historical customers purchasing data.

In order to do that, we need to build a dataset that for each customer provides the following three informations:Recency (derived from tₓ): the age of the customer at the moment of his last purchase, which is equal to the duration between a customer’s first purchase and their last purchase.

Frequency (x): the number of periods in which the customer has made a repeat purchase.

Age of the customer (T): the age of the customer at the end of the period under study, which is equal to the duration between a customer’s first purchase and the last day in the dataset.

Here below an example of how your data should look like:Wether you choose days, months or years highly depends on your typical customer buying cycle.

Food delivery businesses for example tend to experience customer repeats even within the same week, so they might go for days.

In the example above I used months because it was more appropriate.

Fitting the ModelOnce we have created the dataset we can pass it to the model and print out the summary.

Feel free to use the Online Retail Dataset provided by the UCI ML repository if you need some real customer data (I will use a fictitious dataset not to deliver any sensitive information about the company I analyzed).

Cool!.What is this?.We fitted the distributions from our assumptions into historical data and derived the model parameters: alpha and r are for the Gamma distribution (Assumption 2), and a and b for the Beta distribution (Assumption 4).

In the summary we also have a confidence interval for each parameter that we can use to compute a confidence interval of the expected future transactions for each customer.

Assessing the Model FitNow that we built a model, we can check if it really makes sense.

A first way to do this is to artificially generate customers with expected purchasing behavior dependent on the fitted model parameters, and comparing it to the real data.

For what we see, the artificial customers distribution resembles very closely the real data.

 At this step, I would also suggest to compute the overall percentage error (=predicted transactions/actual transactions -1) and the percentage error per transactions done in the calibration period.

 In this way you can quantify how close to reality the model is, and if it is a better fit for some customers than others.

For example the model might place fewer customers in the 5, 6 and 7 calibration transactions buckets than in reality, and this could eventually result in an overall strong under-predicting.

Visualizing the Model Frequency/Recency MatrixNow that we have a fitted model we can look at its Frequency/Recency Matrix to inspect the expected relationship (based on our fitted model paramaters) between a customer’s recency (age at last purchase), frequency (the number of repeat transactions made) and the expected number of transactions in the next time period (left graph below).

 We can also visualize the expected probability of a customer to be alive depending on her recency and frequency (right graph below).

Indeed we see that if a customer has purchased more than 25 times and their latest purchase was when they were more than 25 months old (bottom right), then they are your hottest customers with the highest probability of being alive and purchasing.

On the contrary, your coldest customers are in the top right corner: they bought a lot quickly, and we haven’t seen them in months.

Cross ValidationOnce you have verified that the model is close enough to actual data, we can see how good it is in predicting future purchases.

Thanks to Lifetimes’ calibration_and_holdout_data() function we can quickly split a simple transactions dataset into calibration and holdout periods.

We will first fit the model to a calibration period of 2 years, then predict the next year transactions, and finally compare predicted vs holdout transactions.

Here’s how the cal_hold dataframe looks like:By comparing the average actual and predicted purchases in the plot below, we can notice that the prediction and the actuals are very close for customers with 0 to 3 repeat transactions in the calibration period, while they increasingly diverge for customers with more repeats.

Also in this case, since the graph alone can be misleading — small differences for buckets with a lot of customers can result in big errors — to properly evaluate the prediction we should look at the overall percentage error (Prediction Error in the graphs), that in my scenario accounts for -6.

3% (the model predicted 6.

3% fewer transactions than in reality).

By looking at the percent error by repeats in calibration period, we notice that we under-predict for customers who did 3 or more repeats (-7.

3% to -30.

3%), and for those with 0 repeats (-12.


We also strongly over-predict for 1 repeat customers (+19.


Despite we mispredict at more granular level, this is a pretty good result since we are interested in the overall predicted amount of transactions.

Although, cross validating over just one period doesn’t properly allow us to understand what to expect in the future.

Is -6.

3% a reasonable error to expect in the future, or was this an incredibly lucky shot?To better evaluate the model we will run cross validation over several periods and then check each period errors.

 For this we simply build a for-loop that iterates the cross validation over several periods.

 In particular, we run it on a dataset with 6 years of transactions, where for each iteration we sample a subset of customers in the selected 2 years calibration period, and predict 1 year of future transactions.

Thus, we end up with 4 cross validated periods.

 We then plot the results.

As you can see below, the prediction error for each year fell always in the range 4.

1% to -7.

9% with 2018 being the best predicted year (-2.

3% Prediction Error).

This is pretty good, above all when compared to common cohorts models with an absolute prediction error bigger than 10%.

Customer Predictions and Probability HistoriesOnce you have built the model and verified its validity you can easily look at single customer predictions and their probability of being alive.

 This is incredibly valuable because you can then use the CLV prediction for marketing activities, forecasting or more generally churn prevention.

Here below, for example, we plot the historical probability of a customer of being alive.

From the graph we can observe that as soon as the customer does each additional purchase, his probability of being alive increases, and then starts to drop again; but at a slower rate, because each new transactions increases his frequency and his recency.

ConclusionTo summarize, predicting CLV is always a tricky task, and usually historical frequency models fail at differentiating among customers with close number of past purchases.

Buy ’Til You Buy probabilistic models come in our rescue by allowing us to build rather accurate predictions by using only 3 customers’ information (frequency, recency and age of a customer).

In this article we intentionally didn’t mention some important topics linked to CLV, but before letting you go, let me leave a short comment on three of them:We predicted future transactions, but we left out the “value” part of the CLV equation.

Usually the Gamma-Gamma submodel is used on top of the BG/NBD model to estimate the monetary value of transactions.

We used BG/NBD model to predict aggregate future transactions, but if you are going to take action at user level you should properly measure accuracy of single customer predictions.

If the model doesn’t properly fit your customers (and the assumptions are reasonable for your business), consider fitting it on customer cohorts (ex.

split by user country), and/or joining it with a linear model with additional features (ex.

a customer’s website visits, time since last visit, product reviews, channel of acquisition, etc.


Thank you for reading, and keep transforming the world with Data!Constructive feedbacks and stimulating talks are always welcome.

Feel free to connect and say Hi on Linkedin!.

. More details

Leave a Reply