Big data analytics: Predicting customer churn with PySpark

They could be subscribing to a competitor’s business, or abandoning the service altogether.

Design by Artpunk101In case you missed it, the cauldron is a metaphor for the service, and the leak is a metaphor for churn.

In a more literal sense, if our cauldron is slowly leaking faster than it is being filled, then it will inevitably dry up.

This is the main question of our current project: to predict which users are likely to cancel their subscription to an online music streaming service, and even more importantly, why?Let’s start by examining the datasets we have at our disposal.

The full dataset is a very granular user log which is stored in a 12 GB json file stored on AWS — at these sizes, only big data frameworks like Spark are feasible to use.

In order to understand the available features and build our model, we will start by using a small subset of the data (~128 MB).

This sample will be used on a single machine for exploratory data analysis purposes.

The first step is to load our data as follows:Let us bask in the richness of our data by printing the schema:root |– artist: string (nullable = true) |– auth: string (nullable = true) |– firstName: string (nullable = true) |– gender: string (nullable = true) |– itemInSession: long (nullable = true) |– lastName: string (nullable = true) |– length: double (nullable = true) |– level: string (nullable = true) |– location: string (nullable = true) |– method: string (nullable = true) |– page: string (nullable = true) |– registration: long (nullable = true) |– sessionId: long (nullable = true) |– song: string (nullable = true) |– status: long (nullable = true) |– ts: long (nullable = true) |– userAgent: string (nullable = true) |– userId: string (nullable = true)Exploratory analysisJust from a cursory look at the data, we noticed that there were rows where the userId was missing.

Upon further investigation, it looks like only the following pages have no userId:+ — — — — — — — — — -+| page|+ — — — — — — — — — -+| Home|| About|| Submit Registration|| Login|| Register|| Help|| Error|+ — — — — — — — — — -+This leads us to believe that the page hits from these ID-less rows are from people who are non-registered, not-signed-in-yet, or not-yet-playing-music.

All these pages will eventually lead to playing music.

However, since theses rows contain mostly empty values, we will drop them all before starting our analysis.

Just from the list of feature names in the previous section, we can get an idea of how rich the data is.

It’s safe to say that every single interaction a user has with the app is logged.

Without going any deeper down the data rabbit hole, we first need to define churn.

For our project’s sake, a churned user is someone who has visited the page for “Cancellation Confirmation”.

This is where users confirm their cancellation request.

Henceforth, any user who has visited said page will be classified as churned, otherwise they will be considered non-churned.

Luckily for us, printing the categories in the page feature shows that it has just such a value.

+————————-+|page |+————————-+|About ||Add Friend ||Add to Playlist ||Cancel ||Cancellation Confirmation||Downgrade ||Error ||Help ||Home ||Login ||Logout ||NextSong ||Register ||Roll Advert ||Save Settings ||Settings ||Submit Downgrade ||Submit Registration ||Submit Upgrade ||Thumbs Down ||Thumbs Up ||Upgrade |+————————-+We can gain better insight by looking at some selected columns of the activity of a random user.

+——————–+————-+———+—–+———+| artist|itemInSession| length|level| page|+——————–+————-+———+—–+———+| Cat Stevens| 0|183.

19628| paid| NextSong|| Simon Harris| 1|195.

83955| paid| NextSong|| Tenacious D| 2|165.

95546| paid| NextSong|| STEVE CAMP| 3|201.

82159| paid| NextSong|| DJ Koze| 4|208.

74404| paid| NextSong|| Lifehouse| 5|249.

18159| paid| NextSong||Usher Featuring L.

| 6|250.

38322| paid| NextSong|| null| 7| null| paid|Thumbs Up|| Harmonia| 8|655.

77751| paid| NextSong|| Goldfrapp| 9|251.

14077| paid| NextSong|+——————–+————-+———+—–+———+Or we can observe when songs are played the most during the day.

Which turns out to be at noon or during lunch hour.

xkcd-like scatter plotFeature EngineeringNow that we have clearly identified the churned users, it’s time to put our data science hats on and try to identify the contributing factors to the churn rate.

How does one behave if they were unsatisfied with the service?.I will try to answer that question with the 6 engineered features below:Average hour when a user plays a songMaybe people who aren’t enjoying the service tend to play music at different times per day.

It’s clear that the difference between these two groups is negligible.

The actual values are 11.

74 and 11.

71 between churned and non-churned users respectively.

Which is a measly difference of 1 minute and 48 seconds.

However, we will keep it in our model since it might prove to be useful once we combine it with the other features we are about to create.


GenderOur music streaming service might be more appealing to one gender over the other.

It’s worth investigating if one gender is over-represented among the churned users.

We will create a dummy variable for gender with a value of 1 for Male and 0 for female.

By grouping the users by gender and churn, we can clearly see an over-representation of males in the churned group.

62% of churned users are male, versus only 51% of non-churned users.

So males are more likely to cancel their subscription.

+—–+——————+|churn| avg(gender_dum)|+—–+——————+| 1|0.

6153846153846154|| 0|0.


Days of activityPeople who have just signed up for the service might not have formed a real opinion on the service yet.

For starters, they haven’t had enough time to try all the different features.

People also need time to get comfortable with any new software.

Which in theory should make newer users who haven’t been subscribed for long more likely to cancel, or churn.

It turns out that our theory was accurate, new users are significantly more likely to cancel than older users.


Number of sessionsIt makes sense that users who log on more are enjoying the service since they are using it more often.

The more times someone logs on, the more likely they should be to keep using our service.

That turned out to be accurate.

Similar to the Days of activity feature, people who are either new subscribers, or don’t use the service that often are more likely to cancel.


Average number of songs per sessionIt’s not just the quantity of sessions that matter, but the quality as well.

Even for users who stream music sporadically, they can still be consistently listening to plenty of songs per session.

Indicating a good experience, despite the low number of times they’ve used it.

While the difference is small, it is certainly visible.

The effect might not be as significant as the previous two calculated features, but it’s visible enough to keep in our analysis and find out how relevant it is.


Number of errors encounteredOne of the most annoying messages we can receive is an error message.

Especially one which interferes with something we are engrossed in, like a streaming movie or music.

Users who experience more errors on average should be more likely to cancel their service, since they’ve had a buggier experience.

Let’s look at the verdict.

It turns out that churned users experience less errors.

This might be explained by the longer time-frame in which non-churned users are active.

So the longer we use the service, the more likely we are to run into errors.

All things considered, the error rate is low.

Building our modelWith our features being properly engineered, we will train three models and pick the best one:Logistic RegressionRandom ForestGradient-boosted TreeIn our attempt to build modular code, we’ve written a function which builds a pipeline for each model and runs it.

It also takes a parameter grid if we want to train our model on the data with different hyper-parameters.

After finding the best fitting model from the combination of hyper-parameters, it is returned by our function.

We can then apply each model on our test data and calculate how accurate it is by using a standard evaluation metric.

In this case we will be using the F1 score.

The F1 score is the weighted average of Precision and Recall.

It’s a good measure for general problems since it isn’t heavily skewed by false positives or false negatives alone.

Something that is especially useful when we have an uneven class distribution, which is our exact situation here.

The end result being:The F1 score for the logistic regression model is: 0.

825968664977952The F1 score for the random forest model is: 0.

8823529411764707The F1 score for the gradient-boosted tree is: 0.

8298039215686274By using the same metric, we can easily compare all 4 models and conclude that in this ‘smaller’ dataset, the most accurate model to predict user churn is the Random Forest classifier.

Training on the full datasetThe next step was to take our functioning code and run it on a cluster on AWS.

We ran into several issues while trying to run the analysis on the full dataset.

While the issues seemed small in nature, it took a very long time to resolve them given how much time it takes the Spark cluster to process the data.

Firstly, the version of Pandas on the cluster was outdated.

We circumvented the issue by simply changing the problematic method from ‘toPandas()’ to ‘show()’.

While the output isn’t as visually appealing, it serves its purpose.

Secondly, Spark was timing out after a certain amount of time which made the results prior to the timeout useless.

The only viable solution I could find was to increase the cluster size in order to shorten the time required to process our code.

This was a surprising problem since Spark is a big data framework, so long processing times are supposed to be expected.

Thirdly, performing a grid search on hyper-parameters proved especially challenging given the timeout issue and just how much time it take to process the data on one model.

So for the sake of our own sanity, we used the default parameters for our classifier of choice.

Lastly, the biggest issue happened at the model evaluation step.

Even though the model was trained without any apparent problems, I kept getting the following error message and could not find a solution for it.

Exception in thread cell_monitor-17:Traceback (most recent call last): File "/opt/conda/lib/python3.


py", line 916, in _bootstrap_inner self.

run() File "/opt/conda/lib/python3.


py", line 864, in run self.


_args, **self.

_kwargs) File "/opt/conda/lib/python3.





py", line 178, in cell_monitor job_binned_stages[job_id][stage_id] = all_stages[stage_id]KeyError: 1256I was not able to properly gauge the accuracy or find the confusion matrix of the random forest classifier.

Unfortunately, we can only presume that the model would perform even better when trained the full dataset using a cluster on AWS.

Feature importanceFrom a business standpoint, model accuracy is not the only thing decision makers are interested in.

They also want to know exactly what pushes people to cancel or unsubscribe from their service.

The following graph shows exactly how big of a role each feature plays in that decision.

+————–+———-+|feature |importance|+————–+———-+|days_active |0.

4959 ||n_sessions |0.

2146 ||avg_play_hour |0.

1355 ||avg_sess_songs|0.

1009 ||n_errors |0.

0386 ||gender_dum |0.

0145 |+————–+———-+It turns out that the number of active days and the number of sessions streaming music are the most important features when predicting user churn.

ConclusionTo recap, we utilized PySpark, or Spark for Python.

With this framework, we built an end-to-end machine learning workflow.

This workflow can identify potential customer churning for a music streaming service by analyzing the interactions of each user with said service.

The data exploration step was fascinating since we could clearly see the behavior of the users from their habits.

Seeing just how long some sessions were, how late they went, and how the genre of music changed during different times of the day was interesting to say the least.

The engineered features focus on a variety of behaviors.

They included interactions with the service, or listening habits that focus on time and personality rather than taste in music.

These behaviors also include measures of satisfaction with the service, such as the total number of errors users encounter.

Before training our models, the selected features are scaled to the range [0,1] so that the larger values do not distort the model.

In the end, we found our best model in the group to be the random forest classifier.

And we attempted to train it on the full dataset to achieve even better results.

Big data analytics has its own set of unique challenges.

Debugging such a long process proved to be painfully time consuming.

But in the end, it was a very interesting exercise and the results were even more interesting.

Future improvementsEven though some models performed surprisingly well, I would be very curious to see the results of a cross-validation on the most successful models.

Fine tuning the hyper-parameters might yield an optimized version of the model which could be robust enough to be used in production.

But to achieve that result, we will either need a lot more PCs, or a lot more coffee.


. More details

Leave a Reply