Predicting Kickstarter Campaign Success with Gradient Boosted Decision Trees: A Machine Learning Classification Problem

Predicting Kickstarter Campaign Success with Gradient Boosted Decision Trees: A Machine Learning Classification ProblemRiley PredumBlockedUnblockFollowFollowingFeb 2I was surfing data.

world the other day and I came across a dataset that lookedat kickstarter campaigns that succeeded and failed.

I thought, ‘what would it take, i.

e.

what features are important, in predicting whether or not a kickstarter campaign succeeds?’ Thus, this project was born.

In this article, I’ll walk you through the more exciting highlights of the project.

For the full code, see the link to the repo at the bottom!Exploratory Data Analysis to Data Cleaning and Back AgainJumping right in, I looked at the dataset’s structure to see what I could make of it.

In data science, this stage of the workflow is called exploratory data analysis (EDA).

EDA usually comes after data cleaning, but it’s not necessarily a linear process.

Sometimes, through EDA, you realize you need to clean something differently or more than you had before.

You’ll see what I mean as we proceed.

The dataset contained 20,632 observations of 67 variables, also called features in machine learning problems.

Calling .

info() on the dataframe showed me what data type each feature was.

From that information I found a few useless variables and promptly removed them.

Oftentimes in data cleaning, we deal with missing values.

If your dataset isn’t too large and multidimensional you can do an ingenious little trick with the seaborn library (sns) to visualize exactly where there are using sns.

heatmap(df.

isnull()) where your dataframe is df.

By doing that, only the values that are null will show up in a contrasting color to the rest like so.

As we can see, the profile variable is the only remaining variable that contains missing values.

I next looked at that variable in more detail and saw that, since it was mostly metadata with each element already having their own variable elsewhere in the data, I decided to remove it.

It’s essential in a classification problem to visualize your target variable (the one you are trying to predict using machine learning algorithms).

For this problem 0 = campaign failed, and 1 = campaign succeeded).

We can see that it is roughly a 60–40 split, with more failures than successes.

This intuitively makes sense.

Throughout your workflow, it’s important to always ask yourself if something intuitively makes sense.

Classification algorithm models are trained on a subset of data, and tested on data that doesn’t contain the target variable labels.

By measuring how many of the predictions were correctly labeled, we can get a sense of the model’s performance.

It’s important to know the distribution of your target variable in a classification problem: there is bias in this sample space.

Looking at the plot above, there are fewer successful campaigns to train on, so take predictions with a grain of salt (and then some more salt, as you’ll later see).

The OutliersDealing with outliers is crucial too.

Not only will they throw off your plots’ scaling, they will muddy your data and thus your predictive model’s performance.

If your model is taking into account variables that are, say 5 standard deviations away from the mean (assuming a Normal Distribution), that’s a hugely unlikely value that a new observation could take on.

Your model will be skewed/influenced by a data point that is not representative of the likely population!While few of my variables were normally distributed, I still felt it necessary to take the interquartile range and remove outliers that were 3 standard deviations below or above the mean.

Here’s how that operation altered the backers_count variable (the number of backers a campaign had).

Initial backers_countIn the box plot above, you can’t even see the box.

The line at the bottom is the edge of the 4th quartile (the top 25% of the data).

So everything after that is extremely extreme.

The dots are individual outliers.

Not only are there many, but one single campaign had 100,000 backers.

Whereas the mean was something like 12.

That’s just too much of an outlier.

So I applied the transformation and got something far more reasonable:There is still some right-skewness as the majority of the values lie in Q3 and Q4.

But this is a much more reasonable box plot!.I did the same for others like the goal variable and the pledged variable.

Also the create_to_launch_days variable, which is the number of days between the date the campaign was started and the date it will launch.

Correlation, Feature Engineering, and Feature SelectionI next needed to find correlation.

How well associated are each of these variables to the target variable?.This will give us our first indications of the explanatory power/influence that each of the variables may have on the target.

This is important for two reasons:We will know more about which variables provide the most explanatory horsepower, so to speak, to the model.

We will see if there are any variables that correlate similarly or the same to the target variable (AKA collinearity, or multicollinearity if there are more than two).

A notable finding from this step of the project was realizing that if a campaign was in the spotlight it had a perfect positive correlation of 1.

0 with campaign success.

This is where we need our intuition/thinking caps/serious detective faces.

Nothing is perfect in statistics…at least I’m 95% sure of it…source: this funny siteMy suspicion is that this dataset was collected on only campaigns that were in the spotlight.

Maybe Kickstarter wanted to share data on their favorite campaigns, and the only ones that succeeded happened to be in the spotlight.

So we cannot assume that there is a guaranteed success of a campaign just because it’s in the spotlight, though I bet it helps gain exposure!Also, not all relationships between variables are linear, so one technique to deal with this and check correlation for non-linear relationships is to take the log and square root of the variables and again check correlations.

Doing this, I saw that the logarithmic goal of the campaign was significantly more highly correlated than its original value.

The logarithmic pledged amount received a boost as well.

Runner up variables that made the cut into the final feature space include:launch_to_deadline_days — how many days between launch and the deadline)staff_pick — whether or not Kickstarter staff picked the campaignFitting the models, evaluating performance, choosing a final model, and predicting on a new (totally real) campaignAnother common thing in the data science workflow is trying out multiple models.

There are ways to minimize the effort in this stage based on what you want to accomplish or what the dataset is/what the problem is (you wouldn’t try regression models for example, since this is a classification problem).

We know this is a classification problem (did the campaign succeed or fail?) because the outcome or target variable takes on binary, discrete values (0 or 1 and nothing in between).

On to the models I tried.

I used the awesome machine learning library sklearn and defined a function that took in a model, fit that model on the training dataset for the features and target variable, and returns the classification report and confusion matrix for that model.

Feel free to check out those links if you don’t know what those two evaluation methods are, but I will explain them in brief below too.

Classification Report for K-Nearest-Neighbors Classifier (KNN) on Kickstarter Campaign datasetLet’s take a look at the results of the KNN algorithm above.

On the left you have the classes: 0 and 1.

Along the top you have precision, recall, f1-score, and support.

Precision basically means: for all observations classified as positive, what percent were correct?.This number.

Recall is: the measure of all observations that were actually positive, the percent that were classified correctly.

F₁ score is the weighted harmonic mean of precision and recall.

This should be used to compare models, not global accuracy on the problem in question.

Finally, support is the number of instances of each class in the dataset.

We knew from the EDA that there were more failures than successes, and we can see that here.

We also have the confusion matrix, which utilizes the same concepts as above, but shows individual instances as opposed to percentages.

Let’s take a look at the confusion matrix for the KNN classifier below.

Confusion Matrix for KNN ClassifierOK, so left to right, top to bottom we have:The number of true negatives: the number of times the classifier correctly guessed that a campaign failed.

The number of false positives: the number of times the classifier wrongly guessed a campaign’s success when it was actually a failed campaign.

The number of false negatives: the number of times the classifier wrongly guessed a campaign was a failure when it was actually a success.

The number of true positives: the number of times the classifier correctly guessed that a campaign succeeded.

As you can see, both the classification report and the confusion matrix are super useful evaluation tools for classification problems.

The classification report gives you more information, including the number of observations of each class.

But I also like the simplicity of the confusion matrix.

It’s informative to just check them both out and it’s just one line of code.

As the title suggests, I went with the Gradient Boosted Decision Tree classifier (which actually had 100% accuracy, which is concerning, because again, perfection is something to be wary of in statistics and machine learning).

Predicting a (totally real) campaign’s successOK, it’s not real.

I thought up the numbers, but here it is:gradient_boosted.

predict_prob([[45, 0, 6, 1, 15000, 9.

62, 6.

91]])What was the result when this was called?.Well, first let me explain what each number is.

Each number corresponds to our list of features, so read this next paragraph carefully to understand what each of them are.

I will go in order from the first one, starting at [0] for Pythonic reasons.

A campaign with 45 days until the deadline [0], that was not staff picked [1], backed by 6 backers [2], is in the spotlight [3], has a goal of 15000 USD [4], which is a log_goal of 9.

62 [5], and has a log_pledged amount of 6.

91 (equal to 1000 USD) [6], has a 99.

9% chance of succeeding!ConclusionI hope you enjoyed reading!.Want to see the source code?.Want to help me figure out why my Gradient Boosted classifier achieved 100%?Check out the repo!Happy coding,Riley.

. More details

Leave a Reply