Predicting Who’s going to Survive on Titanic Dataset

The purpose of this article is to whet your appetites on Data Science with the famous Kaggle Challenge for beginner — “Titanic: Machine Learning from Disaster.

”In this article, You are going to embark on your first Exploratory Data Analysis (EDA) and Machine Learning to predict the survival of Titanic Passengers.

This is the genesis challenge for most onboarding data scientists and will set you up for success.

I hope this article inspires you.

All aboard!!!Technical PrerequisitesWe are going to use Jupyter Notebook with several data science Python libraries.

If you haven’t please install Anaconda on your Windows or Mac.

Alternatively, you can follow my Notebook and enjoy this guide!The StepsFor most of you that are pure beginners.

Don’t worry, I will guide you through the article for end to end data analysis.

Here are a few milestones to explore for today:What we are going to learnWhew, that is a long way until we reach our first destination.

Let’s get sailing, shall we?ImportImporting Library.

Source: unsplashImport the LibrariesPython library is a collection of functions and methods that allows you to perform many actions without writing your code (Quora).

This will help you run several data science functions with no hassle.

If you do not have the library, you can run pip install <library name> on your command prompt to install.

Here is the list of libraries we are using:Numpy : Multidimensional Array and Matrix Representation LibraryPandas : Python Data Analysis Library for Data Frame, CSV File I/OMatplotlib : Data Visualization LibrarySeaborn : Data Visualization Library built on top of Matplotlib.

This gives you a cleaner visualization and easier interface to call.

import numpy as np import pandas as pd import seaborn as snsimport matplotlib.

pyplot as pltImport the DataWe can use pandas to read train.

csv from the Titanic Data Page.

This code will create a DataFrame Object which is a two dimensional array to optimize data exploration process.

Think about it as Excelsheet in Python with rows and columns.

maindf = pd.

read_csv(‘dataset/Titanic/train.

csv’)Explore the datahead() or tail()We will start by viewing the very first few rows inside the data.

This is to take a sneak peek at the data and make sure you extract the data properly.

maindf.

head()Data Frame Headdescribe()The describe will help you get all the statistical description of numerical columns.

If you write the include parameter as object, it would describe the non-numerical columns.

This is very useful method to grab a quick understanding on the data statistically.

Describe() to explore Titanic DataFrom this description, we can find the following thoughts:The mean and the distribution of the variables.

Most passengers bought the tickets for relatively lower price.

But, a few bought at high cost — indicating possible VIPs.

The Parch distribution is highly skewed as the quartiles indicate 0 and the max indicates 6.

This means that most people do not bring parents or children on board and a few parents bring up to 6 children and parents on board.

info()The info helps you figure out the data types and the existence of empty values.

Here, we found out that the columns Age, Cabin, and Embarked possess missing value.

This is good, we have identified a few paths to explore.

But first, let us clean the data.

CleanSource: UnsplashHandle missing values in AgeThere are several ways to replace the missing values in Age Column:Not recommended: Replace by the mean of ages.

This is not a good approach, as you could see that most of the passengers are located among 20–30 years old where the oldest is 80 years old and the youngest is 0.

42 years old (infant).

Recommended: Replace by the median of ages.

This would be a better approach as this would safely allocate our missing values to 20–30 years old which are comfortably the inside interquartile ranges.

Most Recommended: Replace the ages according to the median by each salutation.

This would be the best approach as the salutation will imply the common ages among the imputed data (e.

g: Sir, Mdm, etc).

Conclusion: Let us take the third approach.

In case if you do not know what lambda is, you could think of it as an inline function.

This makes the code much simpler to read.

maindf['Salutation'] = maindf.

Name.

apply(lambda name: group = maindf.

groupby(['Salutation', 'Pclass'])group.

Age.

apply(lambda x: x.

fillna(x.

median()))maindf.

Age.

fillna(maindf.

Age.

median, inplace = True)Drop Irrelevant columnsTo simplify the analysis, let’s drop some columns which might not be relevant to the survival such as passenger id and name.

However, you should be very careful in dropping these columns as this will limit your assumptions.

For example, there might a surprising higher probability of passengers survive with the name “John”.

Upon closer inspection, this is because the name “John” is usually reserved for Englishmen who have high Social Economic Status (SES).

Therefore, if we do not have the SES column, we might need to include names inside our analysis.

cleandf = maindf.

loc[:,['Survived','Pclass','Sex','Age','SibSp','Parch','Embarked']]Engineer the featuresIn this sector, we are going to manipulate some of the features (columns) to sensible and more meaningful data analysis.

SocioEconomicStatus (SES)We will classify the SES features based on the numerical value from Pclass.

We just encode 1 — upper, 2 — middle, 3 — lower.

cleandf['socioeconomicstatus']=cleandf.

Pclass.

map({1:'upper',2:'middle',3:'lower'})Port of EmbarkationWe will map the alphabetical values of (‘C’,’Q’, and ‘S’) to their respective ports.

We can then split the number of the passengers according to their port of embarkation and survival status.

We will use pie chart to compare the percentage based on port of embarkation and survival status.

cleandf['embarkedport']=cleandf.

Embarked.

map({'C':'Cherbourg','Q':'Queenstown','S':'Southampton'})Ratios of survivals based on Port of EmbarkationAgeWe will generate the histogram of age and derive the following binning.

Binning is a great way to quantify skewed distribution of continuous data into discrete categories.

Each bin represents a degree of range and intensity of the grouped numeric values.

agesplit = [0,10,18,25,40,90]agestatus = ['Adolescent','Teenager','Young Adult','Adult','Elder']cleandf['agegroup']=pd.

cut(cleandf.

Age,agesplit,labels=agestatus)binned Age based on Age Split and Age Status GroupFamily RelationshipsThe lifeboats stayed afloat, however, and thus the legend of the “Birkenhead drill” — the protocol that prioritizes women and children during maritime disasters — was born.

 — HistoryThe Birkenhead Drill raises some thoughts to whether the presence of your children or wives will raise your survival rate.

Therefore, we want to engineer the number of siblings/spouse (SibSp) and parents/children (Parch) who were aboard into whether each person comes with a family member— hasfamily.

cleandf['familymembers']=cleandf.

SibSp+cleandf.

Parchhasfamily = (cleandf.

familymembers>0)*1cleandf['hasfamily'] = hasfamilyThe resultThe result of feature engineeringCongrats!.You have done your feature engineering.

Now we can use these new features to analyseAnalyseThere are so many analysis that we could engage with new cleaned dataset.

Now, let us engage with a few questions.

Feel free to access my Python Notebook for more analysis.

Would survival rate differs by gender?Would survival rate differs by SES?Would survival rate differs by gender and SES?Just a quick tip for data scientists, your role is to keep asking questions and answering them in statistical manners.

This is not going to be an one time waterfall process but continuous and iterative.

As what I would say…This is just the tip of the iceberg!Would survival rate differs by gender?maindf.

groupby(['Survived','Sex']).

count().

Namemaindf.

groupby(['Survived','Sex']).

count().

Name.

plot(kind='bar')In total, there are 342 survived and 549 non survived.

Out of those survived (233 are female, 109 are male ) whereas non survived ( 81 are female, 468 are male).

It seems that female is more likely to survive than male.

Would survival rate differs by SES?We can use the cross tab to generate the counts clearly for two categorical features (SES, survival)survived = pd.

crosstab(index=cleandf.

Survived, columns = cleandf.

socioeconomicstatus,margins=True)survived.

columns = ['lower','middle','upper','rowtotal']survived.

index = ['died','survived','coltotal']From a quick look, it seems that the SES greatly matters to the survivals — the upper class survived more than the lower class.

However, let’s further test this hypothesis using Chi Square method.

Chi square run on the cross tabLet us try to understand what this means.

The Chi2 Stat is the chi-square statistic and the degrees of freedom are the columns * rows.

That means you have 6 degrees of freedom from 3 events of SES (lower, middle, upper) and 2 events of survival (died, survived).

The larger the degrees of freedom, the more statistically significant it is given the differences.

The p value will determine the significance of SES towards survival.

We will reject our null hypothesis if our p value is below alpha (0.

01).

Since our p value is 6.

25 and way above our alpha, we can say that this result is not statistically significant.

Maybe we could include sex and observe is there is significant difference?Would survival rate differs by SES and Gender?This crosstab allows us to generate a feature (SES,Sex) for a total of 6 possible events.

It seems that there are big differences of survival in female high SES compared to male low SES.

survivedcrosstabsex = pd.

crosstab(index=cleandf.

Survived, columns = [cleandf[‘socioeconomicstatus’],cleandf[‘Sex’]],margins=True)Let’s do the same thing as before and insert this crosstab of values in a Chi2 TestChi square run on the cross tab (SES and Sex)The p value is lower which shows it has greater statistical significance than if we just analyse using SES.

However, the p value is still above our alpha (0.

01).

Therefore we still think SES and Gender does not have statistical significance to infer survival status.

But we could see that it is close to statistical significance, which is a good sign!For now we can explore these features with our machine learning model :).

ModelTraining Decision Tree ModelLet’s model our finding with decision tree.

A Decision Tree is a machine learning model that provides rule based classification on information gain.

This provides a nice and sleek way to choose critical features and rules to best discriminate our data dependent variables.

We will use our train data then prune the trees to avoid overfitting.

Finally, we will use GraphViz library to visualize our tree as the following.

As the code is long, feel free to skim it at my Python Notebook.

Decision Tree for Titanic Kaggle ChallengeFrom this graph, we can find the beauty of decision tree as followed:Understanding Distribution and Profiles of Survivors: The array indicates [number of death, number of survivals].

In each node we can see different split of the array and the rule classification to branch out to the lower level nodes.

We can find out that if sex = female (≤0.

5), the array indicates more survivors [69,184].

We can follow with the leaf level nodes as well.

Understanding Critical Features Separates Survivors: At the top level features, we see Sex and SES as our major features.

However, as what we have discovered before, these features alone are not enough to be statistically significant.

Therefore, probably the third level of Port of Embarkation and Age group might give us better significance.

Congratulations, you have just created your first machine learning model!Evaluation of Decision Treefrom sklearn.

metrics import accuracy_score, log_losstrain_predictions = clftree.

predict(X_test)acc = accuracy_score(y_test, train_predictions)From here we will retrieve that our accuracy is 77.

65%.

This means out of 100 passengers, the model could answer 77 passengers’ survival status correctly.

To evaluate this model further, we can add confusion matrix and ROC curve into our evaluation which I will cover more in my subsequent articles.

SubmitYou will generate the csv of your prediction using the following method and commands.

We will name it as “titanic_submission_tree.

csv” and save it in your local directory.

def titanic_kaggle_submission(filename, predictions): submission = pd.

DataFrame({‘PassengerId’:testdf[‘PassengerId’],’Survived’:predictions}) submission.

to_csv(filename,index=False)titanic_kaggle_submission("titanic_submission_tree.

csv",predictions_tree)Once you have submitted the titanic_submission_tree.

csv.

You will receive the result as following.

In here, I got the rank of 6535 out of all the relevant entries.

How do we improve our analysis?Hyper parameter tuning and Model Selection.

Use gradient descent to increase the submission analysis.

There are other submissions which greatly leverage on ensemble models such as Random Forest and XG Boost.

Feature engineering: find out what other features that could be generated.

For example, maybe parents with kids would have less likelihood to be saved.

Or maybe people with the early alphabets would be saved first as their room allocations are ordered based on ticket type and alphabetical order?.Creating our features based on questions will improve the accuracy of our analysis more.

Experiment, experiment, and experiment: Have fun and keep tuning your findings.

No ideas are too absurd in this challenge!Congrats!.You have submitted your first data analysis and Kaggle submission.

Now, embark on your own Data Science Journey!!ConclusionIn this article, we learnt one method to design an Exploratory Data Analysis (EDA) on the famous Titanic data.

We learnt how to import, explore, clean, engineer, analyse, model, and submit.

Apart from that, we also learnt useful techniques:Explore → describe, plot, histogramClean → insert missing numerical value, removing irrelevant columnsEngineer → binning, labellingAnalyse → Survival Plots, Crosstab table, Chi2 TestModel → Decision Tree visualization, prediction and evaluationFinally…Source : UnsplashI really hope this has been a great read and a source of inspiration for you to develop and innovate.

Please Comment out below to suggest and feedback.

Just like you, I am still learning how to become a better Data Scientist and Engineer.

Please help me improve so that I could help you better in my subsequent article releases.

Thank you and Happy coding :)About the AuthorVincent Tatan is a Data and Technology enthusiast with relevant working experiences from Visa Inc.

and Lazada to implement microservice architectures, business intelligence, and analytics pipeline projects.

Vincent is a native Indonesian with a record of accomplishments in problem solving with strengths in Full Stack Development, Data Analytics, and Strategic Planning.

He has been actively consulting SMU BI & Analytics Club, guiding aspiring data scientists and engineers from various backgrounds, and opening up his expertise for businesses to develop their products .

Please reach out to Vincent via LinkedIn , Medium or Youtube Channel.. More details

Leave a Reply