Machine learning classification: the success of Kickstarter tech projects

If you max out your project duration by setting a deadline 60 days from your launch date, are you more likely to hit your target?To find out, I had to split the “launched” variable from its date-time format into 2 separate variables: “launch_date” & “launch_time”.

From here, I could simply subtract “launch_date” from “deadline” and create a new variable that tracks project duration.

# split "launched" variable into date and time variableslibrary(tidyr)tech_projects <- separate(data = tech_projects, col = launched, into = c("launch_date", "launch_time"), sep = " ")# subtract from one anothertech_projects$date_diff <- as.

Date(as.

character(tech_projects$deadline), format="%Y-%m-%d")- as.

Date(as.

character(tech_projects$launch_date), format="%Y-%m-%d")# group by sub-category before plottinggrouped_data_tech <- tech_projects %>% group_by(category) %>% summarise( target = mean(date_diff), success_rate = mean(success) )Using ggplot2, the scatter diagram below could be plotted:Relationship between campaign duration & success rateRunning a quick linear regression model on these data revealed that, unsurprisingly considering the plot, there’s no relationship between success and project duration.

Use classification models to predict project successThe insights obtained so far are interesting, but I wanted to go further and build a couple of classification models that look at a project and determine whether or not the project is likely to be successful.

I’m going to compare the output of a support-vector machine (SVM) model with that of a logistic regression model.

Before starting, I need to reconstruct my categorical “category” variable into multiple dummy variables that the models can understand.

# get the category variabletech_projects_dummy <- select(tech_projects, category)# split into multiple dummy variablestech_projects_dummy <- dummy.

data.

frame(tech_projects_dummy, sep = ".

")# reattach variables of interest to tech_projects_dummytech_projects_dummy$goal <- tech_projects$usd_goal_realtech_projects_dummy$success <- tech_projects$success# encode dependent variable as a factortech_projects_dummy$success <- factor(tech_projects_dummy$success, levels = c(0,1))With the above code, we have a data frame with a bunch of dummy variables, one for each sub-category of tech project, the fundraising target of each project, and the binary dependent variable, “success”.

The last section simply ensures our dependent variable is encoded correctly as a factor.

Next step, split the data frame so we have one for training the model and another to test its results.

This can easily be done in a few lines of code:# splitting dataset into training and test setlibrary(caTools)set.

seed(123)split <- sample.

split(tech_projects_dummy$success, SplitRatio = 0.

75)training_set <- subset(tech_projects_dummy, split == TRUE)test_set <- subset(tech_projects_dummy, split == FALSE)As with any project like this, we need to apply feature scaling to our data so that each variable carries equal weight and doesn’t distort the model output.

# feature scaling everything but dependent variabletraining_set[-18] = scale(training_set[-18])test_set[-18] = scale(test_set[-18])From here, all that’s left to do is to apply our classifiers to our training sets and let them do their work!# fitting logistic regression to the training setl_classifier = glm(formula = success ~ .

, family = binomial, data = training_set)# fitting SVM to the training setlibrary(e1071)svm_classifier = svm(formula = success ~ .

, data = training_set, type = 'C-classification', kernel = 'linear')# predicting the logistic regression test set resultsl_pred = predict(l_classifier, type = 'response', newdata = test_set[-18])# create binary variable from the logistic regression predictionsl_pred = ifelse(l_pred > 0.

5, 1, 0)# predicting the SVM test set resultssvm_pred = predict(svm_classifier, newdata = test_set[-18])Now that we have our predictions l_pred and svm_pred we can build a confusion matrix which essentially compares our predictions with the actual results from the success variable in our test set data.

# logistic regression confusion matrixl_cm = table(test_set[,18], l_pred)# SVM confusion matrixsvm_cm = table(test_set[,18], svm_pred)Of the 6,762 projects in our test set, the logistic regression model correctly predicted 5,166 (76.

4% accuracy) and the SVM model 5,131 (75.

9% accuracy).

The logistic regression model just comes out on top, but there is clearly still more fine-tuning to do here in order to build a more reliable indicator of tech project success on Kickstarter.

Thanks so much for reading!.I’m no expert so I welcome all feedback and constructive criticism in the comments.

.

. More details

Leave a Reply