Analyzing the Titanic with a Business Analyst mindset using R (ggplot2)

Coincidentally, i decided to watch the Titanic movie a second time but this time with the mindset of a Business analyst inspired by the power of data analysis .

The first time i watched the movie i had a couple of questions in mind about how things played out on the Titanic, but didn’t get around with finding the answers back then.

Well this time , i got inspired by the solution-driven nature of data analysis and decided to source the answers to my own questions by pulling the ubiquitous Titanic Dataset on google.

I began my analysis with a couple of probe questions (BAs ask lots of questions, guess you all know this already :))regarding the events that unfolded in the Titanic shipwreck.

The visualization dashboard below materialized during my analysis.

Titanic DashboarddProbe QuestionsWhat was the Survival rate on the Titanic?How could i use data to visualize the Women and Children first approach adopted by rescuers on the Titanic?What was the age distribution on the Titanic (both survivors and fatalities)?What was the survivor age distribution by ticket class on the Titanic?Following that the Titanic was the most expensive ship over a century ago, how does the fare value compare across all ticket classes?If you’ve seen the movie then you’re in the right place and if you haven’t then you are the very target of my findings.

Perhaps, you may decide to watch it right away.

Now, let’s get into it.

Loading packages and exploring dataLet’s start by loading the packages we’ll use to create the visualizations used for the analysis.

Tidyverse package will help with data processing and graphing.

Load relevant libraries and import the Titanic Dataset saved on your computer drive into R Studio.

Note: The source dataframe does not contain information for the crew, but it does contain actual and estimated ages for almost 80% of the passengerslibrary(tidyverse)titanic <- read.

csv(file.

choose())Examine the structure of your dataset (variable names and variable type).

This step is essential to determining the suitability of your variables for plotting.

#Check out the structure of the datasetsummary(titanic)str(titanic)names(titanic)head(titanic.

df, n = 10)#Remove rows with NAtitanic.

df <- filter(titanic, survived != "")First, the dataset was cleaned up to remove or replace missing values.

The columns in the dataset are listed below:Pclass: Ticket Class (1 = 1st, 2= 2nd; 3= 3rd)Survived: Survival ( 0 = No;1 = Yes)Name: Passenger nameSex: Gender (Male or Female)Age: Passenger AgeSibSp: Nos of sibling and/or spouses aboardParch: Nos of parent(s) and/or children aboardTicket: Ticket numberFare: Fare price (British Pound)Cabin: CabinEmbarked: Port of embarkation (C = Cherbourg: Q = Queenstown: S= Southampton)Boat: LifeboatBody: Body Identification Numberhome.

dest:Address of Passengers Home or DestinationThe next step is we need to decide what variables we need and the appropriate scale to visualize our data.

I used the table classification below,Visualization using ggplot2To answer our first question, What was the Survival rate on the Titanic?In order of class (*1st, 2nd and 3rd), the percentage of females that survived was 97%, 89% and 49%.

In order of class (*1st, 2nd and 3rd), the percentage of males that survived was 34%, 15(~14.

6)% and 15(~15.

2)%.

Run the code below to generate corresponding visualizations:ggplot(data = titanic.

df) + aes(x = age, fill = survived) + geom_histogram(bin = 30, colour = "#1380A1") + #scale_fill_brewer(palette = "Accent") + labs(title = "Survival rate on the Titanic", y = "Survived", subtitle = "Distribution By Age, Gender and class of ticket", caption = "Author: etoma.

egot") + theme_tomski() + # using a custom theme for my visualizations#theme_bw()+ #Use the inbuilt ggplot2 them for your practice facet_grid(sex~pclass, scales = "free")#Proportion of 1st, 2nd and 3rd class women and men who survivedmf.

survived <- titanic.

df %>% filter(survived == 1)%>% group_by(pclass,sex)%>% summarise(Counts = n() )mf.

died <- titanic.

df %>% filter(survived != 1)%>% group_by(pclass,sex)%>% summarise(Counts = n() )mf.

perc.

survived <- mf.

survived/(mf.

survived + mf.

died) * 100select (mf.

perc.

survived, Counts)Survival Rate on the TitanicResults Interpretation:This graph helps identify the rate of survival patterns considering all the three variables(age, sex, ticket class).

In order of class (*1st, 2nd and 3rd), the percentage of females that survived was 97%, 89% and 49%.

In order of class (*1st, 2nd and 3rd), the percentage of males that survived was 34%, 15(~14.

6)% and 15(~15.

2)%.

Within 1st and 2nd class, all Children survived except one female child from 1st class.

There were more children fatalities in 3rd class.

To our second question, How could i use data to confirm the Women and Children first approach adopted by rescuers on the Titanic?Children, women and men in order of ticket class were considered first by rescuers with priority been women and children and older adults at least 60yrs across all classes.

Run the code to get the visualization below:titanic.

df %>% filter(fare <= 300)%>% ggplot(mapping = aes(x = age, y = fare)) + geom_point(aes(colour = survived, size = fare, alpha = 0.

7)) + geom_smooth(se = FALSE)+ facet_grid(sex~pclass, scales = "free") + labs(title = "Priority and pattern of rescue on the Titanic", x = "Age (yrs)", y = "Fare(£)", subtitle = "Children and women in order of ticket class were.considered first in the rescue plan with priority been.women, children and older adults >= 60yrs", caption = "Author: etoma.

egot") + theme( plot.

subtitle = element_text(colour = "#17c5c9", size=14))+ theme_tomski() #using a custom themePriority and pattern of rescue on the TitanicResults InterpretationFollowing the results of the distribution in the figure above.

(fares been proportional to ticket class).

Aside the fact that Children (<=12) on the titanic were charged separate boarding fares.

It seemed like the fares for children and teens seem unusually high when compared with average fares for non-children age groups.

Let me know if you know why this is so.

Nonetheless,the bubble chart gives some other clues in regards to the pattern of rescue operations.

Evidently, Women and Children first approach in order of ticket class was adopted in the rescue plans by rescuers with priority been women and children and older adults at least 60yrs across all classes.

Apparently little or no priority was given to male passengers by rescuers##Note: I removed the outlier fares (500 £) from the bubble chart.

The males and females who paid these fares were anyway rescued.

I used boxplots to visualize the next three questions:Moving on to the 3rd question, What was the age distribution on the Titanic (both survivors and fatalities)?Generally, the males on the titanic were older than the females by an average of 3yrs across all ticket classes.

titanic.

df %>% ggplot(mapping = aes(x = pclass, y = age)) + geom_point(colour = "#1380A1", size = 1) + geom_jitter(aes(colour = survived))+ #This generates multiple colours geom_boxplot(alpha = 0.

7, outlier.

colour = NA)+ labs(title = "Age Distribution by Class on the Titanic", x = "Ticket Class", y = "Age(Yrs)", subtitle = "The males on the titanic were older than the females by an average of 3yrs across all ticket classes ", caption = "Author: etoma.

egot") + theme_tomski() + #using my own custom theme theme(plot.

subtitle = element_text( size=18))+ facet_wrap(.

~sex)#Calculating Mean and median age by Class and Gender for adultstitanic.

df %>% group_by(pclass, sex)%>% summarise( n = n(), #count of passengers Average.

age = mean(age, na.

rm = TRUE), Median.

age = median(age, na.

rm = TRUE) )Age Distribution by Class on the TitanicResults InterpretationNegatively Skewed — the boxplot will show the median closer to the upper quartilePositively Skewed — the boxplot will show the median closer to the lower quartileFEMALEAbout 75% of the females in order of class (*1st, 2nd, 3rd) were at least 22, 20 and 17 yrs old.

The median age was 36 yrs (normally distributed),28yrs (negatively skewed) and 22 yrs (positively skewed)MALEAbout 75% of the males in order of class (*1st, 2nd, 3rd) were at least 30, 24 and 20 yrs old.

The median age was 42 yrs (negatively skewed),30yrs (positively skewed) and 25 yrs (positively skewed)#Summary:Generally, the males on the titanic were older than the females by an average of 3yrs across all ticket classes.

Subsequently, for the 4th question, What was the survivor age distribution by ticket class on the Titanic?The median age of male and female survivors in 1st class was the same(36 yrs)- The females in 2nd class were 1.

5 times older than the males – The males in 3rd class were older than the females by 2yrstitanic.

df %>% filter(survived ==1)%>% ggplot(mapping = aes(x = pclass, y = age)) + geom_point(size = 1) + geom_jitter(colour = "#1380A1")+ geom_boxplot(alpha = 0.

7, outlier.

colour = NA)+ labs(title = "Survivors Age Distribution by Class on the Titanic", x = "Ticket Class", y = "Age(Yrs)", subtitle = "The median age of male and female survivors in 1st class was the same(36 yrs).The females in 2nd class were 1.

5 times older than the males.The males in 3rd class were older than the females by 2yrs", caption = "Author: etoma.

egot") + theme_tomski() + #using my own custom theme theme(plot.

subtitle = element_text(colour = "#1380A1", size=18))+ facet_wrap(.

~sex)#Calculating Mean and median age by Class and Gender for adultstitanic.

df %>% filter(survived ==1)%>% group_by(pclass, sex)%>% summarise( n = n(), #count of passengers Average.

age = mean(age, na.

rm = TRUE), Median.

age = median(age, na.

rm = TRUE) )Survivor age distribution by class on the TitanicResults InterpretationFEMALE- The median age was 36 yrs (normally distributed),28yrs (negatively skewed) and 22 yrs (positively skewed)MALE- The median age was 36 yrs (positively skewed),19yrs (negatively skewed) and 25 yrs (negatively skewed)#Summary:- The median age of male and female survivors in 1st class was the same(36 yrs)- The females in 2nd class were 1.

5 times older than the males – The males in 3rd class were older than the females by 2yrsFinally, for the last question, Following that the Titanic was the most expensive ship over a century ago, how does the fare value compare across all ticket classes?1st class ticket costs about 3 times a 2nd class ticket and 2nd class ticket was worth about twice that of 3rd class.

#Prepare Data, remove outliers in faretitanic.

df %>% filter(fare < 300)%>% ggplot(mapping = aes(x = pclass, y = fare)) + #geom_point(colour = "#1380A1", size = 1) + #geom_jitter(aes(colour = survived))+ geom_boxplot(colour = "#1380A1", outlier.

colour = NA)+ labs(title = "Fare Value by Class", x = "Ticket Class", y = "Ticket Fare (£)", subtitle = "1st class ticket was worth 3 times a 2nd class ticket.and 2nd class ticket was worth almost twice that of 3rd class", caption = "Author: etoma.

egot") + theme_tomski()+ #using my own custom theme theme(plot.

subtitle = element_text(colour = "#1380A1",size=18))+ coord_cartesian(ylim = c(0,125))+ coord_flip()#Calculating Mean and Median Fare by Classtitanic.

df %>% filter(fare < 300)%>% group_by(pclass)%>% summarise( Average.

fares = mean(fare, na.

rm = TRUE), Median.

fare = median(fare, na.

rm = TRUE) )#Calculating Mean and Median Fare by Class for childrentitanic.

df %>% filter(fare < 300, age <= 12)%>% group_by(pclass)%>% summarise( n = n(), Average.

fares = mean(fare, na.

rm = TRUE), Median.

fare = median(fare, na.

rm = TRUE) )#Calculating Mean and Median Fare by Class for adultstitanic.

df %>% filter(fare < 300, age >= 12)%>% group_by(pclass)%>% summarise( n = n(), Average.

fare = mean(fare, na.

rm = TRUE), Median.

fare = median(fare, na.

rm = TRUE) )Fare Value By Class on the TitanicResults InterpretationThe box plot confirms that the ticket fare is proportional to the Ticket class.

Pretty much intuitive.

The distribution is skewed to the right .

The median fares for 1st, 2nd and 3rd class is 59.

4 £, 15 £ and 8.

05 £.

The mean fares for 1st, 2nd and 3rd class is 82.

2 £, 21.

2 £ and 13.

3 £.

(mean fares are greater than median fares).

Hence,a better measure of the center for this distribution is the median.

Thus,1st class ticket costs about 3 times a 2nd class ticket and 2nd class ticket was worth about twice that of 3rd class.

The average and median fare for children is higher when compared to that of adults in same class.

Note: For a symmetrical distribution, the mean is in the middle.

Hence, Mean is an appropriate measure to use for comparisons.

But if a distribution is skewed, then the mean is usually not in the middle.

Hence, median is an appropriate measure for comparisonsI am quite thrilled at my first write-up on TDS, nonetheless, if you have done similar analysis, i still need to clarify if Children really paid more than some adults across different age groups as my visualization bubble chart results seem to suggest.

Thanks for reading!.

.

. More details

Leave a Reply