Exploratory Data Analysis: An Illustration in Python

The ability to build and employ a sophisticated machine learning model is great.

But it must be noted that some parts of that process can be (and is already being) automated.

I cannot make a case for which part of the process should receive the largest weight (think of the common 0.

8 vs 0.

2 rule).

However, I will present a problem and provide basic walk-through to tackle similar problems you might encounter in your journey.

A few years ago I worked at a company that maintained 200+ databases and provided time series data to a number of clients.

I got to see first hand the process of data collection and entry, and I quickly realized that it was far more complicated than I initially imagined.

The process, even when done by a machine or a very meticulous human, is error-prone.

As a result, consumers of these datasets (even after they undergo a quality control process) have in their possession data potentially that conveys inaccurate or wrong information.

This can have damaging consequences for decision-makers and stakeholders.

It is, therefore, imperative that a data scientist should “vet the data” before fitting any model to it.

Here, I present a basic exploratory data analysis (EDA) that could be performed before engaging with the “fun” stuff.

Import the ToolkitWe begin by importing some Python packages.

These will serve as your toolkit for an effective EDA:import numpy as npimport seaborn as snsimport matplotlib.

pyplot as pltimport pandas as pd %config InlineBackend.

figure_format = 'retina'%matplotlib inlineIn this example, we will use the Boston housing dataset (practice with it afterward and convince yourself).

Let’s load the data into our workspace and view the first five rows (remember that Python indexes starting with 0).

Load and Inspect the Datadf = pd.

read_csv(boston_file)df.

head()First five rows of the Boston housing dataNext, you want to inspect the structure and format of the data:df.

isnull().

sum() # returns the sum of missing values across columnsdf.

shape # returns a matrix of i rows and j columns.

df.

dtypesAddressing AnomaliesWe immediately notice a few things from this inspection.

Firstly, the first column appears to be a repetition of the index.

In the absence of a data dictionary, we can safely assume this to be the case.

Let’s drop this column:df.

drop(columns = ['Unnamed: 0'])Include as part of the arguments of the drop() method inplace=True to permanently apply the change to the data.

Secondly, DIS and RAD variables are not in their correct formats.

We know this from the table displayed above.

Let’s begin by addressing the most obvious: the DIS feature.

The values in this column contain a string character ',' .

This will cause Python to read the numeric data as a pandas object datatype.

We can deal with this issue using the map function:df['DIS'] = df['DIS'].

map(lambda dis_cell: dis_cell.

replace(',', '.

'))df['DIS'] = df['DIS'].

astype(float)df.

dtypesVoila, problem solved!The RAD variable, on the other hand, requires a little more probing.

It is not immediately obvious what is caused these values to be read in this way.

df['RAD'].

sort_values(ascending= False).

head(8)There, we have the culprit.

We can deal with this using a similar process we used to deal with the DIS case.

Note that after we replace this string character, we will now have some missing data.

These can either be dropped or imputed.

This is one of the many times that the data scientist will need to make a judgment call either based on her domain knowledge, the problem one seeks to address or her gut.

Changing Column NamesNow imagine that you are working on a team that has members that are tasked with presenting the findings of the data to a non-technical audience or members who have no idea what each column represents.

You may want to make the column names more descriptive.

Here’s how:new_columns_names = { 'CRIM':'rate_of_crime', 'ZN':'residential_zone_pct', 'INDUS':'business_zone_pct', 'CHAS':'borders_river', 'NOX':'oxide_concentration', 'RM':'average_rooms', 'AGE':'owner_occup_pct', 'DIS':'dist_to_work', 'RAD':'access_to_highway', 'TAX':'property_tax', 'PTRATIO':'student_teacher_ratio', 'LSTAT':'pct_underclass', 'MEDV':'home_median_value' }df.

rename(columns = new_columns_names, inplace=True)df.

head()8 out of 13 columns displayed here.

Hang in there, we’re approaching the end of this basic walkthrough!!Now we want to dive a little deeper by looking at a more succinct description of the data:df.

describe().

T #returns a description in transposed matrixThis simple method returns a nice summary of the data.

We now have access to useful statistics of each numerical column such as the mean, median (50%), min and max values (useful for observing outliers), the count (useful for spotting missing values), and so on.

Plotting the DataGraphs excel where numbers struggle.

Face it, numbers can be boring.

Listening to a presenter who only tells a story with numbers is difficult, even for our quantitative friends.

Instead of talking through statistical jargon, show them!.A good place to start could be by looking at the distribution of the data using a histogram:df.

hist(figsize = (14,14)) Finally, depending on the problem you seek to address, you may want to check for the correlation between variables.

This is particularly important when performing linear regressions.

plt.

figure(figsize=(12, 12))sns.

heatmap(df.

corr(), annot = True, cmap = 'coolwarm')We have a (Pearson) correlation matrix, presented with a visually appealing heatmap.

Again, don’t use numbers alone!P.

S.

You may want to plot other kinds of graphs to examine relationships, spot outliers, and identify underlying distributions of your target and predictor variables.

Concluding RemarksAs noted earlier, the EDA component, like the rest of a data scientist’s workflow, is not linear.

It involves multiple iterations; my experience has shown me it is a cyclical process.

This demonstration is a simple EDA that can be implemented on most datasets; it certainly is not exhaustive, however, it is useful.

Get your hands dirty with the data!.Assume that no dataset is clean until you have combed through it.

This part of the process is indispensable if your goal is to produce high performing models.

As George Fuechsel said, “garbage in, garbage out”.

.

. More details

Leave a Reply