Data Analysis and Visualisations using R

If yes, then this tutorial is meant for you!Overview & PurposeWith this article, we’d learn how to do basic exploratory analysis on a data set, create visualisations and draw inferences.

What we’d be coveringGetting Started with RUnderstanding your Data SetAnalysing & Building Visualisations1.

Getting Started with R1.

1 Download and Install R | R StudioR programming offers a set of inbuilt libraries that help build visualisations with minimal code and flexibility.

You can download R easily from the R Project Website.

While downloading you would need to choose a mirror.

Choose R depending on your operating system, such as Windows, Mac or Linux.

It is super easy to install R.

Just follow through the basic installation steps and you’d be good to go.

For an easy way to write scripts, I recommend using R Studio.

It is an open source environment which is known for its simplicity and efficiency.

Launch Screen after starting R Studio1.

2 Install R packagesPackages are the fundamental units created by the community that contains reproducible R code.

These include reusable R functions, documentation that describes how to use them and sample data.

The directory where packages are stored is called the library.

R comes with a standard set of packages.

Others are available for download and installation.

Once installed, they have to be loaded into the session to be used.

To install a package in R, we simply use the commandinstall.

packages(“Name of the Desired Package”)1.

3 Loading the Data setThere are some data sets that are already pre-installed in R.

Here, we shall be using The Titanic data set that comes built-in R in the Titanic Package.

While using any external data source, we can use the read command to load the files(Excel, CSV, HTML and text files etc.

)This data set is also available at Kaggle.

You may download the data set, both train and test files.

In this tutorial, we’d be just using the train data set.

titanic <- read.

csv(“C:/Users/Desktop/titanic.

csv”, header=TRUE, sep=”,”)The above code reads the file titanic.

csv into a dataframe titanic.

With Header=TRUE we are specifying that the data includes a header(column names) and sep=”,” specifies that the values in data are comma separated.

2.

Understanding the Data setWe have used the Titanic data set that contains historical records of all the passengers who on-boarded the Titanic.

Below is a brief description of the 12 variables in the data set :PassengerId: Serial NumberSurvived: Contains binary Values of 0 & 1.

Passenger did not survive — 0, Passenger Survived — 1.

Pclass — Ticket Class | 1st Class, 2nd Class or 3rd Class TicketName — Name of the passengerSex — Male or FemaleAge — Age in years — IntegerSibSp — No.

of Siblings / Spouses — brothers, sisters and/or husband/wifeParch — No.

of parents/children — mother/father and/or daughter, sonTicket — Serial NumberFare — Passenger fareCabin — Cabin NumberEmbarked — Port of Embarkment | C- Cherbourg, Q — Queenstown, S — Southhampton2.

1 Peek at your DataBefore we begin working on the dataset, let’s have a good look at the raw data.

view(titanic)This helps us in familiarising with the data set.

head(titanic,n) | tail(titanic,n)In order to have a quick look at the data, we often use the head()/tail().

Top 10 rows of the data set.

Bottom 5 rows of the data set.

In case we do not explicitly pass the value for n, it takes the default value of 5, and displays 5 rows.

names(titanic)This helps us in checking out all the variables in the data set.

Familiarising with all the Variables/Column Namesstr(titanic)This helps in understanding the structure of the data set, data type of each attribute and number of rows and columns present in the data.

summary(titanic)A cursory look at the dataSummary() is one of the most important functions that help in summarising each attribute in the dataset.

It gives a set of descriptive statistics, depending on the type of variable:In case of a Numerical Variable -> Gives Mean, Median, Mode, Range and Quartiles.

In case of a Factor Variable -> Gives a table with the frequencies.

In case of Factor + Numerical Variables -> Gives the number of missing values.

In case of character variables -> Gives the length and the class.

In case we just need the summary statistic for a particular variable in the dataset, we can usesummary(datasetName$VariableName) -> summary(titanic$Pclass)as.

factor(dataset$ColumnName)There are times when some of the variables in the data set are factors but might get interpreted as numeric.

For example, the Pclass(Passenger Class) tales the values 1, 2 and 3, however, we know that these are not to be considered as numeric, as these are just levels.

In order to such variables treated as factors and not as numbers we need explicitly convert them to factors using the function as.

factor()3.

Analysis & VisualisationsData Visualisation is an art of turning data into insights that can be easily interpreted.

In this tutorial, we’ll analyse the survival patterns and check for factors that affected the same.

Points to think aboutNow that we have an understanding of the dataset, and the variables, we need to identify the variables of interest.

Domain knowledge and the correlation between variables help in choosing these variables.

To keep it simple, we have chosen only 3 such variables, namely Age, Gender, Pclass.

What was the survival rate?When talking about the Titanic data set, the first question that comes up is “How many people did survive?”.

Let’s have a simple Bar Graph to demonstrate the same.

ggplot(titanic, aes(x=Survived)) + geom_bar()On the X-axis we have the survived variable, 0 representing the passengers that did not survive, and 1 representing the passengers who survived.

The Y -axis represents the number of passengers.

Here we see that over 550 passenger did not survive and ~ 340 passengers survived.

Let’s make is more clear by using checking out the percentagesprop.

table(table(titanic$Survived))Only 38.

38% of the passengers who on-boarded the titanic did survive.

Survival rate basis GenderIt is believed that in case of rescue operations during disasters, woman’s safety is prioritised.

Did the same happen back then?We see that the survival rate amongst the women was significantly higher when compared to men.

The survival ratio amongst women was around 75%, whereas for men it was less than 20%.

Survival Rate basis Class of tickets (Pclass)There were 3 segments of passengers, depending upon the class they were travelling in, namely, 1st class, 2nd class and 3rd class.

We see that over 50% of the passengers were travelling in the 3rd class.

Survival Rate basis Passenger Class1st and 2nd Class passengers disproportionately survived, with over 60% survival rate of the 1st class passengers, around 45–50% of 2nd class, and less than 25% survival rate of those travelling in 3rd class.

I’ll leave you at the thought… Was it because of a preferential treatment to the passengers travelling elite class, or the proximity, as the 3rd class compartments were in the lower deck?Survival Rate basis Class of tickets and Gender(pclass)We see that the females in the 1st and 2nd class had a very high survival rate.

The survival rate for the females travelling in 1st and 2nd class was 96% and 92% respectively, corresponding to 37% and 16% for men.

The survival rate for men travelling 3rd class was less than 15%.

Till now it is evident that the Gender and Passenger class had significant impact on the survival rates.

Let’s now check the impact of passenger’s Age on Survival Rate.

Survival rates basis ageLooking at the age<10 years section in the graph, we see that the survival rate is high.

And the survival rate is low and drops beyond the age of 45.

Here we have used bin width of 5, you may try out different values and see, how the graph changes.

Survival Rate basis Age, Gender and Class of ticketsThis graph helps identify the survival patterns considering all the three variables.

The top 3 sections depict the female survival patterns across the three classes, while the bottom 3 represent the male survival patterns across 3 classes.

On the x-axis we have the Age.

It is evident that the survival rate of children, across 1st and 2nd class was the highest.

Except for 1 girl child all children travelling 1st and 2nd class survived.

The survival rates were lowest for men travelling 3rd class.

I hope you found this article helpful.

Keep learning, keep growing!.. More details

Leave a Reply