Predicting Titanic Survivors (A Kaggle Competition)

We’ll find out!Let’s get started!1.

0 Importing the DataThe first step in the process is always to load in the data as well as the necessary packages.

In R, the programming language I am using, packages are collections of algorithms that allow users to perform specified tasks.

There are packages for creating beautiful plots, building stock portfolios and pretty much anything else you can imagine.

Here, I loaded a number of packages that allow me to utilize a handful of powerful machine learning (ML) models.

lapply(c(“caret”, “h2o”, “pROC”, “randomForest”, “readr”, “tidyverse”, “xgboost”), library, character.

only = TRUE) h2o.

init() h2o.

no_progress() set.

seed(123) train <- read_csv(“~/Downloads/train.

csv”) test <- read_csv(“~/Downloads/test.

csv”)The competition dataset came in two files: train and test.

As you may be able to guess, the former is used to train the ML models and the test is used to make the predictions that are ultimately submitted.

2.

0 Exploring the DataAfter getting my data into R, it’s time to explore the shape of the data.

train %>% glimpse()## Observations: 891 ## Variables: 12 ## $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1… ## $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1… ## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2… ## $ Name <chr> “Braund, Mr.

Owen Harris”, “Cumings, Mrs.

John Bradl… ## $ Sex <chr> “male”, “female”, “female”, “female”, “male”, “male”… ## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39… ## $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0… ## $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0… ## $ Ticket <chr> “A/5 21171”, “PC 17599”, “STON/O2.

3101282”, “113803… ## $ Fare <dbl> 7.

2500, 71.

2833, 7.

9250, 53.

1000, 8.

0500, 8.

4583, 51… ## $ Cabin <chr> NA, “C85”, NA, “C123”, NA, NA, “E46”, NA, NA, NA, “G… ## $ Embarked <chr> “S”, “C”, “S”, “S”, “S”, “Q”, “S”, “S”, “S”, “C”, “S…test %>% glimpse()## Observations: 418 ## Variables: 11 ## $ PassengerId <dbl> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 90… ## $ Pclass <dbl> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3… ## $ Name <chr> “Kelly, Mr.

James”, “Wilkes, Mrs.

James (Ellen Needs… ## $ Sex <chr> “male”, “female”, “male”, “male”, “female”, “male”, … ## $ Age <dbl> 34.

5, 47.

0, 62.

0, 27.

0, 22.

0, 14.

0, 30.

0, 26.

0, 18.

0… ## $ SibSp <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0… ## $ Parch <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… ## $ Ticket <chr> “330911”, “363272”, “240276”, “315154”, “3101298”, “… ## $ Fare <dbl> 7.

8292, 7.

0000, 9.

6875, 8.

6625, 12.

2875, 9.

2250, 7.

6… ## $ Cabin <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, “B45… ## $ Embarked <chr> “Q”, “S”, “Q”, “S”, “S”, “S”, “Q”, “S”, “C”, “S”, “S…What the above code says is that the training data has 891 rows with 12 different variables.

These variables include things like the name of the passenger, their sex and age, how much they paid for their ticket and where they boarded, among other things.

The most important variable here is the one named “Survived,” a list of 1’s and 0’s that indicate whether a passenger lived or died, respectively.

The testing data features 418 rows and lacks the “Survived” variable as that is what the competition asks us to predict.

While there is plenty of information contained within the data as it exists now, not all of it is usable in its current form.

To extract as much useable information as possible, I will have to transform some of these variables.

3.

0 Transforming the DataThe first variable I will look at is “Name.

” As far as I know, the iceberg that sunk the Titanic didn’t have a personal vendetta against any of the passengers, so simply using the full names of the passengers won’t provide any useful information.

What may, however, is the passengers’ titles.

What titles like “Mr.

,” “Mrs.

” or “Countess” might help us determine is if passengers’ social status (i.

e.

Were they commoners or nobility?) had any impact on their survival.

To get these titles, I have to extract them from “Name,” which is the below code does.

titles <- c(unique(str_extract(str_extract(train$Name, “,s[A-Za-z]+”), “[A-Za-z]+”))) titles <- replace(titles, titles == “the”, “Countess”) titles## [1] “Mr” “Mrs” “Miss” “Master” “Don” “Rev” ## [7] “Dr” “Mme” “Ms” “Major” “Lady” “Sir” ## [13] “Mlle” “Col” “Capt” “Countess” “Jonkheer”If you were wondering, “Jonkheer” was an honorific used by Dutch nobility.

There was one Jonkheer on board the Titanic, Johan George Reuchlin, and, spoiler alert, he died.

¯_(ツ)_/¯In addition to using this list of titles to create a new variable, I also am going to extract the Deck from “Cabin” and create a variable named “Family_Size” that is simply a combination of “SibSp,’ a count of siblings and spouses on board, and ”Parch,” a count of parents and children on board.

I also will clean up a few of the other variables to make them easier for the ML models to understand.

train <- train %>% mutate(Survived = factor(Survived), Sex = factor(recode(Sex, male = 1, female = 0)), Pclass = factor(Pclass), Embarked = factor(Embarked), Deck = factor(replace_na(substr(train$Cabin, 1, 1), “Unknown”)), Title = factor(str_extract(train$Name, paste(titles, collapse = “|”))), Family_Size = SibSp + Parch) %>% select(-c(Cabin, Name, Ticket)) test <- test %>% mutate(Sex = factor(recode(Sex, male = 1, female = 0)), Pclass = factor(Pclass), Embarked = factor(Embarked), Deck = factor(replace_na(substr(test$Cabin, 1, 1), “Unknown”)), Title = factor(str_extract(test$Name, paste(titles, collapse = “|”))), Family_Size = SibSp + Parch, Fare = ifelse(is.

na(Fare), mean(Fare, na.

rm = TRUE), Fare)) %>% select(-c(Cabin, Name, Ticket))3.

1 Predicting Passenger AgesOne thing that I didn’t point out earlier when examining the data was just how many passengers’ ages were not recorded.

Of the 1309 passengers whose data we have, 266 had no age.

That missing information will be problematic later, so I feel it’s necessary to replace those empty values with a guess.

To keep things brief, what the below code does is combine the training and testing data, pull out the records that have ages and fit a Random Forest (RF) model that determines the relationship between a passenger’s age and the other variables.

Finally, it will fill all of the missing ages with the best guess as to what their age might be.

Side note: to go into a description of what an RF model is would completely derail this case study.

If you’re interested in learning more about RF models and how they work, check out this website.

# Combining the training and test data and selecting only the rows with ages age <- train %>% select(-Survived) %>% rbind(test) %>% filter(!is.

na(Age)) %>% filter(!is.

na(Embarked)) # Building a prediction model age_rf_model <- randomForest(Age ~ .

— PassengerId, age, ntree = 5000, mtry = 9, na.

action = na.

exclude) # Determining the accuracy of the model age %>% select(Age) %>% add_column(Pred = predict(age_rf_model, age)) %>% na.

omit() %>% mutate(Error = Age — Pred, Error_Pct = Error/Age) %>% summarize(RMSE = sqrt(mean(Error²)), MAPE = mean(abs(Error_Pct)))## # A tibble: 1 x 2 ## RMSE MAPE ## <dbl> <dbl> ## 1 7.

30 0.

302# Using the model to predict passenger age train <- train %>% mutate(Age = ifelse(is.

na(Age), round(predict(age_rf_model, train)), Age)) test <- rbind(train[1, c(1, 3:12)], test) test <- test[-1, ] test <- test %>% mutate(Age = ifelse(is.

na(Age), round(predict(age_rf_model, test)), Age))To check how accurate the RF model’s predictions are, I calculated the Root Mean Squared Error (RMSE) and Mean Absolute Percent Error (MAPE) of the predictions to gauge the quality of those predictions.

See my earlier article for a brief description of these two metrics.

Long story short: both measures of error are low.

The MAPE tells me that the average prediction has an error of only .

3%, so while not perfect, I feel it’s acceptable for my ultimate goal of predicting survival.

4.

0 Training the ModelsNow that the data is clean, it is time for me to start training the ML models with the data.

Since I want to make sure that my models perform well on unseen data, I am going to divide my training data up into a smaller set of training and test data.

This way, I can assess the accuracy of the models before taking them to the actual, Kaggle-provided testing data (which, remember, I can’t assess the accuracy of because the data lacks the “Survived” variable).

train_1 <- stratified(train, c(“Survived”, “Deck”, “Title”), size = 0.

7, replace = FALSE) train_2 <- setdiff(train, train_1)I’m going to use four different models, each with their own way of making predictions: a linear model, a Random Forest model, an XGBoost (eXtreme Gradient Boosting) model and H2O’s AutoML.

Again, feel free to click the hyperlinks for a description of what these models are and what they’re doing.

To make what’s going on below easier to understand, imagine that instead of predicting Titanic survivors, we wanted to win a Mixed Martial Arts tournament.

We only have enough time to master one martial art before the tournament begins, so we need to figure out which we should study to have the best chance of winning.

We know who the competitors will be (i.

e.

our testing data), but we’re not sure which martial art will serve us best.

What is going on below is we are running simulations where we learn four different martial arts (let’s say boxing, jujutsu, muay tai and tae kwon do) and seeing how we might do against competitors similar to those that we’ll see in the tournament (i.

e.

our training data).

# Linear Model — — glm_model <- glm(Survived ~ .

— PassengerId, family = “binomial”, train_1) # Random Forest — — rf_model <- randomForest(Survived ~ .

— PassengerId, train_1, ntree = 10000, mtry = 3, na.

action = na.

exclude) # XGBoost — — dummy_1 <- dummyVars(Survived ~ .

, train_1[,2:12]) train_1_dummy <- predict(dummy_1, train_1) dummy_2 <- dummyVars(Survived ~ .

, train_2[,2:12]) train_2_dummy <- predict(dummy_2, train_2) dtrain <- xgb.

DMatrix(data = train_1_dummy, label = as.

vector(train_1$Survived)) dtest <- xgb.

DMatrix(data = train_2_dummy, label = as.

vector(train_2$Survived)) watchlist <- list(train = dtrain, test = dtest) xgb_model <- xgb.

train( data = dtrain, watchlist = watchlist, booster = “gbtree”, max.

depth = 3, nthread = 2, nrounds = 5000, objective = “binary:logistic”, early_stopping_rounds = 500, print_every_n = 500 ) # H2O — — train_1_h2o <- train_1 %>% select(-PassengerId) %>% mutate(Pclass = factor(Pclass, ordered = FALSE)) %>% as.

h2o train_2_h2o <- train_2 %>% select(-PassengerId) %>% mutate(Pclass = factor(Pclass, ordered = FALSE)) %>% as.

h2o y <- “Survived” x <- setdiff(colnames(train_1_h2o), y) split <- h2o.

splitFrame(train_1_h2o, ratios = c(.

70, .

15)) t1 <- split[[1]] t2 <- split[[2]] t3 <- split[[3]] h2o_model <- h2o.

automl( x = x, y = y, train = t1, validation_frame = t2, leaderboard_frame = t3, nfolds = 5, stopping_metric = “AUC”, max_runtime_secs = 120 ) h2o_leader <- h2o_model@leader5.

0 Comparing the ModelsTo continue the above metaphor, no one martial art is going to be best against every competitor, so we’re going to try and find the one which performs best.

For martial arts, the metric in which you’d measure that would probably be the number of wins.

For Titanic predictions, I’m going to measure that with accuracy (mostly because that’s what Kaggle uses to score this competition).

To determine that accuracy, I will generate what is called a confidence matrix.

Simply put, this is a 2×2 box that shows actual values (called “Reference” in the output) along the x-axis and the predicted values along the y-axis.

This allows you to see four variables:· True Positives: Prediction = 1, Actual = 1· True Negatives: Prediction = 0, Actual = 0· False Positives: Prediction = 1, Actual = 0· False Negatives: Prediction = 0, Actual = 1Accuracy is a measure of how many True Positives and True Negatives there are out of all the predictions.

compare_set <- train_2 %>% add_column(GLM_Pred = predict(glm_model, train_2, type = “response”)) %>% add_column(RF_Pred = predict(rf_model, train_2)) %>% add_column(XGB_Pred = predict(xgb_model, train_2_dummy)) %>% add_column(H2O_Pred = h2o.

predict(h2o_leader, newdata = train_2_h2o) %>% as_tibble() %>% pull(predict)) %>% mutate_at(vars(GLM_Pred, XGB_Pred), list(~factor(as.

numeric(.

> 0.

5)))) for (i in 13:16) { conmat <- confusionMatrix(compare_set$Survived, compare_set[[i]], positive = “1”) print(colnames(compare_set[, i])) print(conmat$table) print(conmat$overall[1]) }## [1] “GLM_Pred” ## Reference ## Prediction 0 1 ## 0 141 21 ## 1 23 75 ## Accuracy ## 0.

8307692 ## [1] “RF_Pred” ## Reference ## Prediction 0 1 ## 0 149 13 ## 1 26 72 ## Accuracy ## 0.

85 ## [1] “XGB_Pred” ## Reference ## Prediction 0 1 ## 0 147 15 ## 1 20 79 ## Accuracy ## 0.

8659004 ## [1] “H2O_Pred” ## Reference ## Prediction 0 1 ## 0 151 11 ## 1 38 61 ## Accuracy ## 0.

8122605As you can see, in terms of pure accuracy, the XGBoost model performs the best by correctly predicting 86.

6% of all the survivors in the training data.

However, accuracy isn’t always the best measure.

If you look at the confidence matrix for XGBoost, you will see that there were 15 False Negatives.

The RF model, while not performing quite as well in terms of accuracy, had only 13 False Negatives.

Why this may be important depends on the situation.

Imagine you are a doctor tasked with determining whether a patient has a particular disease.

Assume that the treatment is harmless if given to someone who doesn’t have the disease, but without the treatment, people with the disease are guaranteed to die.

Given the numbers above, the RF model would have saved two more lives than the XGBoost model.

The takeaway here is that one should never simply look at the accuracy and make the final judgment based on that.

And so I won’t here!Instead, I will make predictions using both the RF and XGBoost models.

And since a third submission to the competition costs me nothing, I’ll also make a prediction with the linear model because its accuracy isn’t far behind the other two.

6.

0 Making Final PredictionsNow that I have my trained models (or fighters if you’d prefer that metaphor), it’s time to put them to work.

I’m going to use them on the testing data, data that the models have never seen before.

# XGBoost dummy_test <- dummyVars(PassengerId ~ .

, test) test_dummy <- predict(dummy_test, test) submission_xgboost <- test %>% add_column(Survived = predict(xgb_model, test_dummy)) %>% mutate(Survived = as.

numeric(Survived > 0.

5)) %>% select(PassengerId, Survived) # Random Forest submission_rf <- test %>% add_column(Survived = predict(rf_model, test)) %>% select(PassengerId, Survived) # Linear Model submission_glm <- test %>% add_column(Survived = predict(glm_model, test)) %>% mutate(Survived = as.

numeric(Survived > 0.

5)) %>% select(PassengerId, Survived)Let’s take a look at how each model predicted the survival of the first 10 passengers in the testing data.

submission_xgboost %>% left_join(submission_rf, by = “PassengerId”) %>% left_join(submission_glm, by = “PassengerId”) %>% rename(XGBoost = Survived.

x, RF = Survived.

y, Linear = Survived) %>% head(10)## # A tibble: 10 x 4 ## PassengerId XGBoost RF Linear ## <dbl> <dbl> <fct> <dbl> ## 1 892 0 0 0 ## 2 893 0 0 0 ## 3 894 0 0 0 ## 4 895 0 0 0 ## 5 896 0 0 1 ## 6 897 0 0 0 ## 7 898 0 0 1 ## 8 899 0 0 0 ## 9 900 1 1 1 ## 10 901 0 0 0As you can see, all three models predicted that Passenger 900 survived.

The linear model also predicted that Passengers 896 and 898 survived.

7.

0 Submitting PredictionsNow that I have my predictions, it’s time to submit them to Kaggle and see how they did.

First, I have to export these predictions to a CSV file so that I can upload them.

write_csv(submission_xgboost, “~/Downloads/Submission XGBoost.

csv”) write_csv(submission_rf, “~/Downloads/Submission RF.

csv”) write_csv(submission_glm, “~/Downloads/Submission GLM.

csv”)After uploading the CSV, Kaggle generates my final score for each submission.

So, let’s see how I did.

Wow!.Look at that dark horse victory!.Totally unexpected!.Despite performing third of the three models on the training data, the Linear Model actually performed the best of all of the models on the testing data.

I honestly did not see that coming.

It just goes to show that you can do all of the training in the world and sometimes the win simply comes down to luck.

To be objective, a score of 78.

9% isn’t all that impressive considering there are other submissions that got a perfect score.

But given that this was my first competition and I came in 3149th out of 11098 competitors (better than 71.

6% of other participants), I feel that this was a satisfactory effort.

Thank you for reading along.

I hope to see you in the next case study.

.

. More details

Leave a Reply