Predicting Customer Churn with Spark

Predicting Customer Churn with SparkCélestin HermezBlockedUnblockFollowFollowingJan 21For many companies, churn is a major concern.

It is natural that some people stop using the service, but if this proportion becomes too large it can hinder growth, regardless of revenue sources (ad sales, subscriptions or a mix of both).

With that in mind, the ability for firms to predict churn by identifying customers at risk is crucial for it enables them to take certain actions, such as personalized offers or discounts, to try and mitigate the loss of customers.

Machine learning models built on historical data can give us insight into signals of churn and help us predict it before it happens.

For this example, we use log data from a fictitious music app company, called Sparkify.

This is a toy dataset, relatively small in size so it could be processed by a single computer.

Nonetheless, in order to mimic the real world, we use Spark (in local mode) to process the data and build the model.

Spark is one of the leading solutions for Big Data processing and modeling, getting its speed from its in-memory processing and lazy evaluation of computations through DAG’s.

You can click here to learn more.

By using Spark for this project, although it is not strictly necessary, we build an extensible framework to analyze churn for data of any size, since the code and the analysis could easily scale up provided they were deployed on a cluster (such as AWS, IBM Cloud or GCP) which can handle the computations required.

In this blog post I will summarize the analysis contained in this GitHub repository.

After a brief overview of the data at hand, we will present the model, its results and what it means for Sparkify.

The DataAfter starting our Spark session in local mode, we can load our dataset.

It contains information on 226 distinct users between 10/1/2018 and 12/03/2018.

It is in JSON format (more information on the JSON format here), and can easily be loaded with the following commands:path = “mini_sparkify_event_data.

json”df = spark.


json(path)This data captures actions as varied as listening to a song, giving a “thumbs up”, hitting the homepage, changing the settings of the account or adding a song to a playlist.

As a result, although it is a small subset of users the dataset still includes 278,251 rows.

Of the 226 users present in the dataset, 52 ended up churning.

In order to properly train and assess our model, we address this disproportion to make sure our predictions can accurately predict both categories, and do not overly lean towards predicting absence of churn.

We do so by a technique called upsampling, i.


sampling with replacement from the population of users who churned until we get two groups of comparable size.

The ModelIn order to build features for this model, I let data exploration dictate my approach, along with domain knowledge I acquired while working for a very similar (real) music app company.

In particular, I was looking for features whose values seem to vary significantly between users who churned and those who did not.

Looking for such features highlighted the importance of the type of the account (free vs.

paid) as well as other account-related information, such as the state and the registration date, to give us some insight into who the users are.

Users who churned are more likely to have a free accountAnother group of features is centered around behaviors people take on platform.

These elements, such as length of sessions, number of songs per session but also thumbs up/thumbs down, adding to playlist or adding a friend provide us with additional insight.

The intuitive interpretation prior to modeling is that these features capture a latent variable related to user engagement, with a lower engagement being linked to a higher likelihood of churning.

For instance, users who ended churning came 9 times per month on platform, while users who remained came 14 times.

We will see later on whether this pre-conception was confirmed by the data.

The distribution in terms of session length and number of items in session differs between both groupsThose of you who are interested in the technical details of creating these features can refer back to the code linked in the introduction, but thanks to Spark’s pipelines we can efficiently process and transform all these features into a form suitable for analysis.

You can see below the first few rows of the dataset with all its features.

First few rows of features datasetExamining the distribution of these features, we can see that itemInSession, thumbsUp, addFriend and addToPlaylist are very spread out around the mean.

This is important as models rely on variability to learn and make predictions.

On the other hand, features such as daily sessions or length have less variation, so I expect those to carry a lesser weight in predictions.

Once this is done, we test out three different classification models (Random Forest, Logistic Regression and Gradient Boosting) and assess their accuracy and F1 score on a test set.

It is important to consider both (not only the accuracy), because the latter metric allows us to adjust for the class imbalance present in the test set and, by extension, in the real world.

Due to the lack of striking differences in results between the three models, we choose to further tune a logistic regression model given its greater interpretability.

We do this through cross validation, leveraging a GridSearch algorithm to find the best combination of parameters.

In particular, we test out the following values:minInfoGain (the minimum information gain for a split to be considered at a tree node): 0, 1maxDepth (maximum depth of the tree): 5, 10numTrees (the number of trees): 20, 50I chose these parameters specifically because they are related to preventing overfitting.

The ResultsAfter optimization, the optimal hyperparameters are 50 trees, 0 minimum information gain and a maximum depth of 10.

We have a model which reaches 73% accuracy on the test set, with an F1 score of 0.


These two metrics together are very encouraging: with only data about 191 users (in our training set) we are able to efficiently categorize users in these two categories, without drastic difference in performance in predicting one or the other.

Interestingly, the performance of our model did not improve after grid search, most likely due to the small size of our dataset.

We even assessed the robustness of the model by training and predicting with varying random states, and found that the accuracy of the model was very consistent across them.

Looking at feature importance, our earlier intuition was confirmed: both static variables (registration month, geographical location) and behaviors (adding a friend) bear a heavy weight in our predictions.

This should encourage Sparkify in logging as much information as possible, as all signals are important when trying to predict churn.

ConclusionSpark provides us with a generalizable framework to predict churn.

It can handle big data for any company provided it is deployed on a cluster that can handle the computations required.

Should this analysis be applied to a larger dataset with more computing power available, I contend even better accuracy/F1 will be reached, for we will be able to conduct a search over a larger hyperparameter space for more users.

We could even combine random search over a very large hyperparameter space to yield a subset over which grid search would look for the best combination in order to further speed up computations and improve performance.

Finally, to gain more insights into the model we could leverage SHAP values or permutation importance to understand how individual features influence model predictions.

With historical data on a small subset of customers, we built a model that can identify users at risk of churning with 73% accuracy.

It can be applied somewhat regularly (every day/week depending on the computing infrastructure in place) to the user base, and flag users who may leave the service soon.

With this information in mind, Sparkify can take mitigating action, such as sending a personalized message or offering a monthly discount.

All of this could be automated and would have a great impact on revenue and growth.

Which specific action to take should be determined through A/B testing.

Finally, it is important that this model be re-trained regularly as mitigating actions are implemented and the user base grows and evolves, in order to make this model adapt to changing conditions.


. More details

Leave a Reply