Churn Prediction and Prevention in Python

And how would you know what to focus on if you wanted to keep them and how much you could spend to keep them before having them as a customer turned into a loss?Let’s try taking a different perspective.

Survival AnalysisWhat Logistic Regression is doing (under the hood) is assigning a probability to each observation that describes how likely it is to belong to the positive class.

In the case of churn vs no-churn, and any kind of classification, this can seem like a minor annoyance that we have to overcome by selecting a cutoff and rounding the output (<0.

5 means 0, >=0.

5 means 1).

However, if you think about it, this probability is exactly what we want.

In any large enough group of customers there are going to be people who have the same attributes / features (that’s the pigeon hole principle at work).

Some of those will churn and some of them won’t, and what you’d ideally like to know is the probability of churn for each group.

This is what Logistic Regression gives you (before you cutoff and round), but using Logistic Regression in this scenario does have one problem — it’s not clear what time scale it’s predicting for.

Enter Survival Analysis.

Survival Analysis is a set of methods used in the life sciences (mostly Epidemiology and Pharma research) to determine the probability of patient cohort survival over time.

It’s a very large body of work with a great many intricate and statistically sophisticated tools, but we’ll only be using one of them — The Cox Proportional Hazards Model.

The Cox Proportional Hazards ModelThe Cox PH Model is a regression based model that analyses the covariates (features) of a dataset with regards to how long a patient (or customer) survives.

It is called the Proportional Hazards model as it describes how each feature proportionally increases risk over the baseline survival rate for a cohort.

Thanks to the very good and well-documented lifelines package in Python, it is simple to get started using the Cox PH model.

Applying the CoxPH ModelIn most cases, the first thing you’d have to do to your dataset to get it ready for the Cox regression model is create two new features:The ‘age’ of an observation (the difference in time between a patient starting a course of medication and the most recent observation of their status, or in our case the time between a customer joining a service and the most recent observation of whether or not they have churned)The ‘event’ flag (a binary flag representing if an event had occurred, such as death or churn)Luckily for us, the telco dataset has both of these features already engineered in the tenure and Churn columns.

An important thing to note when it comes to doing any kind of matrix-based regression is that Singular Matrices will throw an error in Python (as they should).

All this means is that when you create dummy variables, you must throw one of the columns away.

(Don’t worry, we’ll still be able to infer the missing category from the remaining variables.

)Here’s the first 5 rows of our dummied and reduced dataset:You can see the Gender_Male has disappeared, as has Partner_No, Dependents_No and so on.

Now that we have our dataset in the right format, let’s fit the Cox model:from lifelines import CoxPHFitter cph_train, cph_test = train_test_split(data[x_select], test_size=0.

2) cph.

fit(cph_train, 'tenure', 'Churn_Yes')There are a few unique things about the lifelines package which can trip up frequent users of Scikit-Learn, the first is that the column that contains the event observations (churn), needs to be in the single dataset that’s passed to the fit call.

As such, we can’t split our dataset into four pieces (train and test for X and y) as we would in Logistic Regression, we’ll have to split it in two.

For anyone who’s curious, this is very similar to the R syntax, where you specify the relevant columns within a single dataset and the algorithm takes care of removing them from the training data.

In the cph.

fit call, you have to pass in three different arguments.

The first is the dataset that we’ve created using train_test_split, the second is the ‘age’ column (in our case tenure) and the third is the ‘event’ column (Churn_Yes in our case).

The next unique thing about the lifelines package is the .

print_summary method that can be used on models (another thing borrowed from R).

Here’s our model summary:There are a few important things to notice about this output.

We can see the number of observations listed as n=5634 right at the top of the output, next to that we have our number of events (churned customers).

We get the coefficients of our model.

These are very important and they tell us how each feature increases risk, so if it’s a positive number that attribute makes a customer more likely to churn, and if it is negative then customers with that feature are less likely to churn.

We get significance codes for our features.

A very nice addition!We get the concordance.

ConcordanceSimilar to how you compare Logistic Regression models using accuracy, you can compare between different Cox PH models using the concordance.

Simply put, concordance is an assessment of how internally consistent the model is — if it says a particular feature increases risk, the observations that have that feature should be more at risk.

If they are the concordance goes up, if they’re not then it decreases.

Our model has a concordance of .

929 out of 1, so it’s a very good Cox model.

Plotting the Cox modelCalling the basic .

plot function on the model gives us this:A handy visualisation of the significance and risk effects of the various features.

Something else we can do at this point is investigate how the features impact survival over time, like this:cph.

plot_covariate_groups('TotalCharges', groups=[0,4000]).

plot_covariate_groups is a method from the lifelines package which takes a feature name as its first input and a range of groupings for its second.

So here we’ll be looking at the different survival curves for customers whose TotalCharges are near zero compared to those whose TotalCharges are closer to 4000.

Which looks like this:You can see that customer whose TotalSpend is closer to zero are at a much higher risk for churning (their survival curve drops off) than those whose TotalSpend is closer to 4000.

Churn PredictionWe have a good, working model, now what?The point of this exercise was to get some usable information to help us make decisions about how to intervene to reduce and prevent churn.

So let’s make some predictions.

Unfortunately, there’s not much we can do for the customers who have already left, so let’s select only the remaining customers from our dataset:censored_subjects = data.

loc[data['Churn_Yes'] == 0]In Survival Analysis lingo a censored observation is one which is yet to have an ‘event’, so we’re selecting all those customer who have yet to churn.

Now to predict their survival curves we use the handy method .

predict_survival_function like so:unconditioned_sf = cph.

predict_survival_function(censored_subjects)You can see that we’ve called these ‘unconditioned’ survival functions, that’s because we know that some of these curves will predict churn before the customer’s current tenure time.

We have to condition the predictions on the basis that we know the customers were still with us when the data was collected:conditioned_sf = unconditioned_sf.

apply(lambda c: (c / c.

loc[data.

loc[c.

name, 'tenure']]).

clip_upper(1))Now we can investigate individual customers and see how the conditioning has affected their survival over the baseline rate:subject = 12 unconditioned_sf[subject].

plot(ls="–", color="#A60628", label="unconditioned") conditioned_sf[subject].

plot(color="#A60628", label="conditioned on $T>58$") plt.

legend()As you can see, the fact that we know customer 12 is still a customer after 58 months means his survival curve drops slower than the baseline one for customers similar to him without that condition.

That very handy predict_survival_function method has created a matrix containing a survival probability for each remaining customer at each point in time.

What we need to do next is use that to select a single number as a prediction for how long the customer will last, which we can use to determine intervention value.

Depending on the use case you could choose any percentile, but for our case we’ll use the median.

from lifelines.

utils import median_survival_times, qth_survival_times predictions_50 = median_survival_times(conditioned_sf) # This is the same, but you can change the fraction to get other # %tiles.

# predictions_50 = qth_survival_times(.

50, conditioned_sf)What that gives us is a single row (in a pandas DataFrame) that has the month number (tenure) where the customer has a 50% likelihood of churning.

We can use this single row and by joining it to our data DataFrame can investigate the predicted remaining value a customer has for the business:values = predictions_50.

T.

join(data[['MonthlyCharges','tenure']]) values['RemainingValue'] = values['MonthlyCharges'] * (values[0.

5] – values['tenure'])Here’s the first 5 rows of this new DataFrame:The column named 0.

5 is the single row we received from our median_survival_times call.

If you selected a different percentile, this column would be named differently.

What we see here is that by multiplying the MonthlyCharges by the difference between the customer’s current tenure and their median cancellation date, we can get a sense for which customers would inflict the most damage to our top line.

Churn PreventionOkay, so now we know which customers are the highest risk for churn, but what can we do to keep them?If we go back to our coefficient chart from earlier we can see that the most significant features which impact survival positively are:Having a 2 year contractHaving a 1 year contractPaying by Credit CardPaying by Bank TransferBeyond these four the increases in survival become minimal and the results aren’t significant anyway.

So let’s focus on these four.

What we need to do to understand how much we can spend to keep customers is compare their survival rates with customer similar to them who instead have each of these four features present:upgrades = ['PaymentMethod_Credit card (automatic)', 'PaymentMethod_Bank transfer (automatic)', 'Contract_One year', 'Contract_Two year'] results_dict = {} for customer in values.

index: actual = data.

loc[[customer]] change = data.

loc[[customer]] results_dict[customer] = [cph.

predict_median(actual)] for upgrade in upgrades: change[upgrade] = 1 if list(change[upgrade]) == [0] else 0 results_dict[customer].

append(cph.

predict_median(change)) change[upgrade] = 1 if list(change[upgrade]) == [0] else 0 results_df = pd.

DataFrame(results_dict).

T results_df.

columns = ['baseline'] + upgrades actions = values.

join(results_df).

drop([0.

5], axis=1)What we’re doing here is looping through the customers, changing one feature at a time and storing the predicted median survival for a customer who has that change.

What we’re left with at the end is this:We can see from this that if we managed to get the first customer to use a Credit Card to make payments that we could increase the survival time by 4 months (25–21 baseline) and so on.

This is a great result which really helps us to see how we can move the needle on keeping customers, but let’s go a step further and see what impact that has financially:actions['CreditCard Diff'] = ( actions['PaymentMethod_Credit card (automatic)'] – actions['baseline']) * actions['MonthlyCharges'] actions['BankTransfer Diff'] = ( actions['PaymentMethod_Bank transfer (automatic)'] – actions['baseline']) * actions['MonthlyCharges'] actions['1yrContract Diff'] = ( actions['Contract_One year'] – actions['baseline']) * actions['MonthlyCharges'] actions['2yrContract Diff'] = ( actions['Contract_Two year'] – actions['baseline']) * actions['MonthlyCharges']Now we can see that moving the customer in the first row to paying by Credit Card could be worth up to £119.

40.

That’s much more helpful than a simple count of months.

Accuracy and CalibrationOkay, we’re nearly done.

We have monetary values that we can use to judge whether a particular churn intervention is feasible as well as solid predictions regarding when a customer will churn.

But how accurate is all of this?We know our Cox model is a good one (92.

9% concordance), but what does that mean in real terms?.How accurate is it?When you take a probabilistic view of events like churn (or fraud or theft) it’s important to check for calibration more so than accuracy.

Calibration is the propensity of a model to get probabilities right over time.

Think of it like this, a weather forecasting service is well calibrated if out of all the times it said there was a 40% chance of rain, it actually rained 40% of the time.

In Scikit-Learn, we can use the calibration_curve method to derive values from probabilistic predictions and the true values (binary) of our dataset:from sklearn.

calibration import calibration_curve plt.

figure(figsize=(10, 10)) ax1 = plt.

subplot2grid((3, 1), (0, 0), rowspan=2) ax1.

plot([0, 1], [0, 1], "k:", label="Perfectly calibrated") probs = 1-np.

array(cph.

predict_survival_function(cph_test).

loc[13])actual = cph_test['Churn_Yes'] fraction_of_positives, mean_predicted_value = calibration_curve(actual, probs, n_bins=10, normalize=False) ax1.

plot(mean_predicted_value, fraction_of_positives, "s-", label="%s" % ("CoxPH",)) ax1.

set_ylabel("Fraction of positives") ax1.

set_ylim([-0.

05, 1.

05]) ax1.

legend(loc="lower right") ax1.

set_title('Calibration plots (reliability curve)')Which gives us this:You read this chart by examining the various probabilities against the fractions that are present in the dataset (in our case the test set).

You can see that it’s pretty close the diagonal line which represents perfect calibration.

However, our model seems to underpredict risk at the low end (< 50% chance of churn) and slightly overpredict risk at the high end (> 50% chance of churn)To get a numeric understanding of how far away the line is from perfect calibration, we can use the brier_score_loss from the Scikit-Learn package:brier_score_loss( cph_test['Churn_Yes'], 1 – np.

array(cph.

predict_survival_function(cph_test).

loc[13]), pos_label=1 )Those of you with a keen eye may have noticed that I keep indexing at tenure = 13.

Because our model works over a range of time periods, we have to inspect the calibration at each step in order to get a feel for accuracy.

Let’s do that all in one go:loss_dict = {} for i in range(1,73): score = brier_score_loss( cph_test['Churn_Yes'], 1 – np.

array(cph.

predict_survival_function(cph_test).

loc[i]), pos_label=1 ) loss_dict[i] = [score] loss_df = pd.

DataFrame(loss_dict).

T fig, ax = plt.

subplots() ax.

plot(loss_df.

index, loss_df) ax.

set(xlabel='Prediction Time', ylabel='Calibration Loss', title='Cox PH Model Calibration Loss / Time') ax.

grid() plt.

show()Which gives us this:So we can see that our model is pretty well calibrated between 5 and 25 months and then proceeds to get less and less so as we start to predict further out.

The only remaining thing to do to make our analysis more realistic is account for this poor calibration.

Let’s create upper and lower bounds for the expected return on investment from getting customers to make changes:loss_df.

columns = ['loss'] temp_df = actions.

reset_index().

set_index('PaymentMethod_Credit card (automatic)').

join(loss_df) temp_df = temp_df.

set_index('index') actions['CreditCard Lower'] = temp_df['CreditCard Diff'] – (temp_df['loss'] * temp_df['CreditCard Diff']) actions['CreditCard Upper'] = temp_df['CreditCard Diff'] + (temp_df['loss'] * temp_df['CreditCard Diff']) temp_df = actions.

reset_index().

set_index('PaymentMethod_Bank transfer (automatic)').

join(loss_df) temp_df = temp_df.

set_index('index') actions['BankTransfer Lower'] = temp_df['BankTransfer Diff'] – (.

5 * temp_df['loss'] * temp_df['BankTransfer Diff']) actions['BankTransfer Upper'] = temp_df['BankTransfer Diff'] + (.

5 * temp_df['loss'] * temp_df['BankTransfer Diff']) temp_df = actions.

reset_index().

set_index('Contract_One year').

join(loss_df) temp_df = temp_df.

set_index('index') actions['1yrContract Lower'] = temp_df['1yrContract Diff'] – (.

5 * temp_df['loss'] * temp_df['1yrContract Diff']) actions['1yrContract Upper'] = temp_df['1yrContract Diff'] + (.

5 * temp_df['loss'] * temp_df['1yrContract Diff']) temp_df = actions.

reset_index().

set_index('Contract_Two year').

join(loss_df) temp_df = temp_df.

set_index('index') actions['2yrContract Lower'] = temp_df['2yrContract Diff'] – (.

5 * temp_df['loss'] * temp_df['2yrContract Diff']) actions['2yrContract Upper'] = temp_df['2yrContract Diff'] + (.

5 * temp_df['loss'] * temp_df['2yrContract Diff'])Here I’m discounting the value we came up with before to account for the uncertainty with regards to calibration.

We look up how well calibrated the model is for each time period we predict that a specific upgrade will produce and create a lower and upper bound around the estimated return on investment.

That gives us something like this:With that we’ve ended our whirlwind tour of Survival Analysis, Concordance and Calibration.

What we’ve got for our efforts is an actionable set of data that we can use to keep customers signed up for longer — which is the point of churn prediction!Thank you for reading!If you have any questions, please email me.

Originally published at machinelearningphd.

com on March 7, 2019.

.

. More details

Leave a Reply