A beginner’s guide to Kaggle’s Titanic problem

A beginner’s guide to Kaggle’s Titanic problemSumit MukhijaBlockedUnblockFollowFollowingJun 22Image source: FlickrSince this is my first post, here’s a brief introduction of what I’ve been doing:I am a software developer turned data enthusiast.

I have recently started learning about the nitty-gritties of Data Science.

One of the most prominent challenge when I started learning through videos and courses on websites like Udemy, Coursera etc.

, it made me passive and I did more of listening and less of, well, doing.

I had no practice and even though I could understand most of the theory.

At that point I came across Kaggle, a website with a set of Data Science problems and competitions hosted by multiple mega-technological companies like Google.

Over the world, Kaggle is known for its problems being interesting, challenging and very, very addictive.

One of these problems is the Titanic Dataset.

So summing it up, the Titanic Problem is based on the sinking of the ‘Unsinkable’ ship Titanic in the early 1912.

It gives you information about multiple people like their ages, sexes, sibling counts, embarkment points and whether or not they survived the disaster.

Based on these features, you have to predict if an arbitrary passenger on Titanic would survive the sinking.

Sounds easy, right?Nope.

The problem statement is merely the tip of the iceberg.

Libraries UsedPandasSeabornSklearnWordCloudLaying the landThe initial phase dealt with the characteristics of the complete dataset.

Here, I did not try to shape or gather from the features and merely observed their qualities.

1.

AggregationI initially aggregated the data from the training and test data set.

The resulting dataset had 1309 rows and 12 columns.

Each row represented a unique traveler on RMS Titanic, and each column described different valued attributes for each commuter.

trd = pd.

read_csv('train.

csv')tsd = pd.

read_csv('test.

csv')td = pd.

concat([trd, tsd], ignore_index=True, sort = False)2.

Missing valuesThe dataset had a couple of columns that were missing values.

The ‘Cabin’ attribute had 1014 missing values.

The column ‘Embarked’ that depicted a commuter’s boarding point had a total of 2 missing values.

The property ‘Age’ had 263 missing values, and the column ‘Fare’ had one.

td.

isnull().

sum()sns.

heatmap(td.

isnull(), cbar = False).

set_title("Missing values heatmap")3.

CategoriesFurther, to understand the categorical and non-categorical features, I had a look at the number of unique values each column had.

The attributes ‘Sex’ and ‘Survived’ had two possible values, properties ‘Embarked’ & ‘Pclass’ had three possible values.

td.

nunique()PassengerId 1309Survived 2Pclass 3Name 1307Sex 2Age 98SibSp 7Parch 8Ticket 929Fare 281Cabin 186Embarked 3dtype: int64FeaturesAfter getting a better perception of the different aspects of the dataset, I started exploring the features and the part they played in the survival or demise of a traveler.

1.

SurvivedThe first feature reported if a traveler lived or died.

A comparison revealed that more than 60% of the passengers had died.

2.

PclassThis feature renders the passenger division.

The tourists could opt from three distinct sections, namely class-1, class-2, class-3.

The third class had the highest number of commuters, followed by class-2 and class-1.

The number of tourists in the third class was more than the number of passengers in the first and second class combined.

The survival chances of a class-1 traveler were higher than a class-2 and class-3 traveler.

3.

SexApproximately 65% of the tourists were male while the remaining 35% were female.

Nonetheless, the percentage of female survivors was higher than the number of male survivors.

More than 80% of male commuters died, as compared to around 70% female commuters.

4.

AgeThe youngest traveler onboard was aged around two months and the oldest traveler was 80 years.

The average age of tourists onboard was just under 30 years.

Clearly, a larger fraction of children under 10 survived than died.

or every other age group, the number of casualties was higher than the number of survivors.

More than 140 people within the age group 20 and 30 were dead as compared to just around 80 people of the same age range sustained.

5.

SibSpSibSp is the number of siblings or spouse of a person onboard.

A maximum of 8 siblings and spouses traveled along with one of the traveler.

More than 90% of people traveled alone or with one of their sibling or spouse.

The chances of survival dropped drastically if someone traveled with more than 2 siblings or spouse.

6.

ParchSimilar to the SibSp, this feature contained the number of parents or children each passenger was touring with.

A maximum of 9 parents/children traveled along with one of the traveler.

I added the number of ‘Parch’ and ‘SibSp’ values to store in a new column named ‘Family’td['Family'] = td.

Parch + td.

SibSpMoreover, the chances of survival skyrocketed when a traveler traveled alone.

Created another column, Is_Alone and assigned True if the value in ‘Family’ column was 0.

td['Is_Alone'] = td.

Family == 07.

FareBy splitting the fare amount into four categories, it was obvious that there was a strong association between the charge and the survival.

The higher a tourist paid, the higher would be his chances to survive.

I stored the segregated fare to a new column Fare_Categorytd['Fare_Category'] = pd.

cut(td['Fare'], bins=[0,7.

90,14.

45,31.

28,120], labels=['Low','Mid', 'High_Mid','High'])8.

EmbarkedEmbarked implies where the traveler mounted from.

There are three possible values for Embark — Southampton, Cherbourg, and Queenstown.

More than 70% of the people boarded from Southampton.

Just under 20% boarded from Cherbourg and the rest boarded from Queenstown.

People who boarded from Cherbourg had a higher chance of survival than people who boarded from Southampton or Queenstown.

It is worth noticing that we did not use the ‘Ticket’ column.

Data ImputationData imputation is the practice of replacing missing data with some substituted values.

There can be a multitude of substitution processes that can be used.

I used some of them for the missing values.

1.

EmbarkedSince ‘Embarked’ only had two missing values and the largest number of commuters embarked from Southampton, the probability of boarding from Southampton is higher.

So, we fill the missing values with Southampton.

However, instead of manually putting in Southampton, we would find the mode of the Embarked column and substitute missing values with it.

 The mode is the most frequently occurring element in a series.

td.

Embarked.

fillna(td.

Embarked.

mode()[0], inplace = True)2.

CabinAs the column ‘Cabin’ had a lot of missing data.

I decided to categorize all the missing data as a different class.

I named it NA.

I assigned all the missing values with this value.

td.

Cabin = td.

Cabin.

fillna('NA')3.

AgeAge was the most intricate column to be filled.

Age had 263 missing values.

I initially categorized the people on the basis of their salutations.

A basic Python’s string split was enough to extract the title from each name.

There were 18 different titles.

td['Salutation'] = td.

Name.

apply(lambda name: name.

split(',')[1].

split('.

')[0].

strip())I then grouped the titles with Sex and PClass.

grp = td.

groupby(['Sex', 'Pclass'])The median of the group was then substituted in the missing rows.

grp.

Age.

apply(lambda x: x.

fillna(x.

median()))td.

Age.

fillna(td.

Age.

median, inplace = True)EncodingSince the string data does not go well with the machine learning algorithms, I needed to convert the non-numeric data to numeric data.

I used LabelEncoder to encode the ‘Sex’ column.

The label encoder would substitute ‘male’ values with some number and ‘female’ values with some different number.

td['Sex'] = LabelEncoder().

fit_transform(td['Sex'])For the other categorical data, I used Pandas’ dummies.

It adds columns corresponding to all the possible values.

So, if there could be three embarkment values — Q, C, S, the get_dummies method would create three different columns and assign values 0 or 1 depending on the embarking point.

pd.

get_dummies(td.

Embarked, prefix="Emb", drop_first = True)Dropping columnsFurther, I dropped the columns that I did not need for the prediction and the columns that I had encoded by creating their dummies.

td.

drop(['Pclass', 'Fare','Cabin', 'Fare_Category','Name','Salutation', 'Deck', 'Ticket','Embarked', 'Age_Range', 'SibSp', 'Parch', 'Age'], axis=1, inplace=True)PredictionThis was a case of classification problem and I tried predicting with two algorithms —Random ForestGaussian Naive BayesI was surprised at the results.

The Gaussian Naive algorithm performed poorly and the Random Forest on the other hand was consistently predicting with an accuracy of more than 80%.

# Data to be predictedX_to_be_predicted = td[td.

Survived.

isnull()]X_to_be_predicted = X_to_be_predicted.

drop(['Survived'], axis = 1)# X_to_be_predicted[X_to_be_predicted.

Age.

isnull()]# X_to_be_predicted.

dropna(inplace = True) # 417 x 27#Training datatrain_data = tdtrain_data = train_data.

dropna()feature_train = train_data['Survived']label_train = train_data.

drop(['Survived'], axis = 1)##Gaussianclf = GaussianNB()x_train, x_test, y_train, y_test = train_test_split(label_train, feature_train, test_size=0.

2)clf.

fit(x_train, np.

ravel(y_train))print("NB Accuracy: "+repr(round(clf.

score(x_test, y_test) * 100, 2)) + "%")result_rf=cross_val_score(clf,x_train,y_train,cv=10,scoring='accuracy')print('The cross validated score for Random forest is:',round(result_rf.

mean()*100,2))y_pred = cross_val_predict(clf,x_train,y_train,cv=10)sns.

heatmap(confusion_matrix(y_train,y_pred),annot=True,fmt='3.

0f',cmap="summer")plt.

title('Confusion_matrix for NB', y=1.

05, size=15)##Random forestclf = RandomForestClassifier(criterion='entropy', n_estimators=700, min_samples_split=10, min_samples_leaf=1, max_features='auto', oob_score=True, random_state=1, n_jobs=-1)x_train, x_test, y_train, y_test = train_test_split(label_train, feature_train, test_size=0.

2)clf.

fit(x_train, np.

ravel(y_train))print("RF Accuracy: "+repr(round(clf.

score(x_test, y_test) * 100, 2)) + "%")result_rf=cross_val_score(clf,x_train,y_train,cv=10,scoring='accuracy')print('The cross validated score for Random forest is:',round(result_rf.

mean()*100,2))y_pred = cross_val_predict(clf,x_train,y_train,cv=10)sns.

heatmap(confusion_matrix(y_train,y_pred),annot=True,fmt='3.

0f',cmap="summer")plt.

title('Confusion_matrix for RF', y=1.

05, size=15)RF Accuracy: 78.

77%The cross validated score for Random forest is: 84.

56Lastly, I created a submission file to store the predicted results.

result = clf.

predict(X_to_be_predicted)submission = pd.

DataFrame({'PassengerId':X_to_be_predicted.

PassengerId,'Survived':result})submission.

Survived = submission.

Survived.

astype(int)print(submission.

shape)filename = 'Titanic Predictions.

csv'submission.

to_csv(filename,index=False)print('Saved file: ' + filename)The line of code below is particularly important as Kaggle would rate the predictions wrong if the Survived value in not of int data typesubmission.

Survived = submission.

Survived.

astype(int)Submission resultThe complete implementation Jupyter Notebook can be found on my GitHub or Kaggle.

The submission got me to the top 8% of the contestants.

It wasn’t easy and it took me more than 20 attempts to get there.

I would say the key is to be analytical, play around with analysis, be intuitive and try everything, no matter how absurd it sounds.

.. More details

Leave a Reply