Pandas, Plotting, Pizzaz, Brazil?

That’ll soon change.

We’ll take you from a passive panda perceiver to a full on Pandas professional, in just a few short minutes.

This article will assume some basic knowledge of Python, and a general idea of what the Pandas library is will be helpful, but not necessary.

I’ll go over some basic/introductory concepts to get an overview and general understanding, but the focus of this article will be on application of matplotlib, pyplot, seaborn, and pandas in exploratory data analysis of a messy(ish) dataset.

As we go through, I’ll suggest exploration to do on your own for your own practice.

Some of the questions I’ll answer here:What is Pandas?When is Pandas used?How do you clean a dataset using Pandas?How can visualizations aid in exploratory data analysis (EDA)?What does exploratory data analysis look like?What is Pandas?Pandas is one of the premier packages for managing and cleaning data in the Python data science space.

It allows for the neat containerization of data into Pandas objects called dataframes and is compatible with the widely used Python computing and data manipulation package NumPy, and can be combined with a ton of other common data manipulation packages like SQL.

Pandas also contains many functions for cleaning, manipulating, indexing, and performing operations on data in a way which is optimized and significantly faster than standard Python operations, which especially comes in handy when working with the very large datasets which we can sometimes encounter.

At its most basic form, a Pandas dataframe is a two-dimensional organization of data with rows and columns.

Though rows and columns are referred to in many way and many contexts throughout the data science world, I’ll try to stick to either rows and columns or datapoints and features.

Dataframe columns can be individually selected as a Series, the other common Pandas datatype.

The Pandas series and dataframe are the foundation of the Pandas library, and the documentation does a good job explaining these datatypes, so you can read more here if you’re unfamiliar.

When is Pandas Used?This is an easy one.

Are you working with data in Python?.Use Pandas.

There are cases when a dataset is simply too large for a local runtime, where additional strategies must be employed, though after a certain point (limited either by your patience or your machine), a switch to a tool designed for larger datasets will be required regardless.

Pandas is used when working with CSVs, data scraped from the web, datasets from Kaggle or other sources, or pretty much any other time you have data which takes the from of datapoints with multiple features.

How do you clean a dataset using Pandas?The short answer is it depends.

There are a myriad of strategies that can be employed, and often you’ll have to look up examples specific to your situation, such as dealing with categorical variables, strings, etc.

That’s not a very useful answer though.

In an attempt to write an actually helpful article, I’ll highly recommend visiting Kaggle and simply noting the strategies experts and master make use of, and the situations in which they’re used, making a data science cheatsheet of sorts where you can note basic tasks.

I’ll link a rough one I made a while back here.

A helpful process can actually be writing down end goals you want, attempting to figure out a few substeps to get there, and then searching Stack Overflow or Pandas documentation for the implementation.

Data cleaning is much more about understanding the mindset of how to manipulate data into useful forms than it is about memorizing an algorithm.

What Does EDA Look Like?I’ll go through an example here, and post the Kaggle Kernel to take an even closer look, directly at the code.

I highly recommend following along on your own, or in the notebook, looking up documentation as you go along.

Brazilian CitiesA collection of 79 attributes from Brazilian Citieswww.

kaggle.

comBrazil Data ExplorationUsing data from multiple data sourceswww.

kaggle.

comThis is a fairly large dataset with a TON of features.

Let’s see if we can employ Pandas and some creative visualizations to clean this up.

As a disclaimer, Seaborn is based on PyPlot and Matplotlib, hence their mention in the intro, but I prefer Seaborn’s functionality and style, so you may not see PyPlot or Matplotlib explicitly here.

# import pandas and numpy, as well as seaborn for visualizationsimport numpy as np import pandas as pdimport seaborn as sns# import the os package for Kaggleimport osprint(os.

listdir(".

/input"))# read in the data and create a dataframedf = pd.

read_csv(".

/input/brazilian-cities/BRAZIL_CITIES.

csv", sep=";", decimal=",")# view a random sample of the dataframedf.

sample(25)Good stuff!.Let’s also look at the shape to get a sense of the magnitude of the set we’re looking at.

# get dataframe shapedf.

shapeOut:(5576, 81)Whenever we do any type of complicated work, it can often be helpful to have a deep copy of our dataframe on hand such that we can more easily revert back if and when we mess up our primary dataframe.

# create a deep copy of our dataframedf_copy = df.

copy(True)Now that we’ve done this, let’s get to work!This dataset has a fairly large amount of features, which might make it hard to explore.

Since we’re not looking to create a model, we don’t have to perform feature selection, so we can simply select a subset of columns we’d like to explore.

Feel free to make a different list than my own!columns = ['CITY', 'STATE', 'CAPITAL', 'IBGE_RES_POP', 'IBGE_RES_POP_BRAS','IBGE_RES_POP_ESTR','IBGE_DU','IBGE_DU_URBAN','IBGE_DU_RURAL', 'IBGE_POP','IBGE_1','IBGE_1-4','IBGE_5-9','IBGE_10-14','IBGE_15-59','IBGE_60+','IBGE_PLANTED_AREA','IDHM','LONG','LAT','ALT','ESTIMATED_POP','GDP_CAPITA','Cars','Motorcycles','UBER','MAC','WAL-MART','BEDS']# create reduced dataframe and check shaper_df = df[columns]r_df.

shapeOut:(5576, 29)Awesome!.Much more manageable now.

A really helpful tool for initial exploration is the Seaborn pairplot function.

It graphs every variable against every other variable in one method!.Let’s see if it can help us here.

# create a seaborn pairplotpp = sns.

pairplot(r_df)Seaborn Pairplot of the DataWow.

Umm, okay.

That’s huge, a little overwhelming, and not particularly helpful.

Imagine if we had left all the features in!.A good next step, especially if the pairplot to unhelpful is trying a correlation matrix just to see if there’s any linear relationship among variables.

corr = r_df.

corr()# I prefer one sided matricies so create a maskmask = np.

zeros_like(corr, dtype=np.

bool)mask[np.

triu_indices_from(mask)] = True# set up figure f, ax = plt.

subplots(figsize=(15, 15))cmap = sns.

diverging_palette(220, 20, as_cmap=True)sns.

heatmap(corr, mask=mask,cmap=cmap, vmax=.

3, center=0, square=True, linewidths=.

5, cbar_kws={"shrink": .

5})Correlation matrix of all featuresThe UBER column is completely blank, let’s look directly at the data to see what’s going on.

r_df.

UBEROut:0 1.

01 NaN2 1.

03 1.

04 1.

05 1.

06 NaN7 1.

08 1.

09 1.

010 1.

011 1.

012 1.

013 1.

014 1.

015 1.

016 1.

017 NaN18 NaN19 NaN20 NaN21 1.

022 1.

023 NaN24 NaN25 NaN26 1.

027 1.

028 NaN29 1.

0 .

5546 NaN5547 NaN5548 NaN5549 NaN5550 NaN5551 NaN5552 NaN5553 NaN5554 NaN5555 NaN5556 NaN5557 NaN5558 NaN5559 NaN5560 NaN5561 NaN5562 NaN5563 NaN5564 NaN5565 NaN5566 NaN5567 NaN5568 NaN5569 NaN5570 NaN5571 NaN5572 NaN5573 NaN5574 NaN5575 NaNLot’s of NaNs and no zeros, a possible error in the set.

# there are a lot of nans, possibly in place of zeros, let's checkdf_copy.

UBER.

value_counts()Out:1.

0 125Yeah, no zeros in the whole columns and only 125 values out of over 5000.

Let’s replace NaNs with zeros and try again.

r_df.

UBER.

replace({np.

nan:0}, inplace=True)r_df.

UBER.

value_counts()Out:0.

0 54511.

0 125Success!.Let’s try the correlation matrix again.

Running the same code, we get:Corrected correlation matrixNow let’s start looking at the relationships!# let's investigate the strongest correlation firstsns.

set_style('dark')sns.

scatterplot(x=r_df.

IDHM, y=r_df.

LAT)Relationship of Latitude and IDHM (Human Development Index)This shows what we expected from the correlation matrix, but doesn’t really supply a lot of meaning to most people, as without an idea of the latitudes of Brazil or a better idea of what IDHM is, they’re kinda just meaningless points.

Let’s contextualize within the geography of the country using latitude and longitude.

# map of lat and long with IDHM detemining sizef, ax = plt.

subplots(figsize=(8, 8))sns.

scatterplot(x=r_df.

LONG, y=r_df.

LAT, size=r_df.

IDHM)Attempt #2 at contextualizing IDHM trendsHere we can see a rough outline of the nation of Brazil.

Overlaying this over a map may be even more helpful, but we’re going to skip that here.

We can see a bit of a trend, but with so many points it’s hard to distinguish the sizes.

Let’s try adding a color encode.

# it's hard to see any trends here, let's add color to get a better ideaf, ax = plt.

subplots(figsize=(8, 8))sns.

scatterplot(x=r_df.

LONG, y=r_df.

LAT, size=r_df.

IDHM, hue=r_df.

IDHM)Attempt #3 at contextualizing IDHM trends, much better!Fantastic!.Here we can clearly see a trend of higher IDHM towards the center and south of the country.

In your own exploration, maybe you can find characteristics about these parts of the country which may cause this.

Let’s see if we can identify the state capital cities in this plot.

# let's see if we can spot any capitals in theref, ax = plt.

subplots(figsize=(8, 8))markers = {0:'o', 1:'s'}sns.

scatterplot(x=r_df.

LONG, y=r_df.

LAT, size=r_df.

IDHM, hue=r_df.

IDHM,style=r_df.

CAPITAL, markers=markers)Attempt #4 at contextualizing IDHM trends, with markers added for capital citiesHere we’re facing the same problem as in attempt #2.

We can’t see the additionally encoded data!.Let’s try an overlay and see if that’s more clear.

f, ax = plt.

subplots(figsize=(8, 8))sns.

scatterplot(x=r_df.

LONG, y=r_df.

LAT, size=r_df.

IDHM, hue=r_df.

IDHM)sns.

scatterplot(x=r_df[r_df.

CAPITAL==1].

LONG, y=r_df[r_df.

CAPITAL==1].

LAT, s=100)Attempt #5 at contextualizing IDHM trends, with an overlay added for capital citiesGreat.

Now we can see the capital cities as well, and we can see that they very strongly trend towards the east of the country, towards the coast.

Not surprising, yet very interesting to see all the same.

Let’s see if we can give GDP per capita a similar treatment, kind of reusing our code.

f, ax = plt.

subplots(figsize=(8, 8))sns.

scatterplot(x=r_df.

LONG, y=r_df.

LAT, size=r_df.

GDP_CAPITA, hue=r_df.

GDP_CAPITA)GDP per capita encoded with latitude and longitudeHmm.

It looks like there aren’t enough color bins to show all the trends in GDP here, likely meaning the data has a large spread and/or skew.

Let’s investigate.

# let's take a look at the distribution, after taking care of nansf, ax = plt.

subplots(figsize=(12, 8))gdp = r_df.

GDP_CAPITA.

dropna()sns.

distplot(gdp)Distribution of GDP per Capita# it looks like gdp is heavily right skewed with a massive tail.

# it seems likely that those massive outliers are errors, and could be removed in some casesgdp.

describe()Out:count 5573.

000000mean 21129.

767244std 20327.

836119min 3190.

57000025% 9061.

72000050% 15879.

96000075% 26156.

990000max 314637.

690000A huge tail, and a significant skew.

This will make the data much more difficult to encode with color.

If we wanted to, we could remove outliers or try a different scale on the data, potentially log, but we’ll skip that for now and investigate one last variable.

Uber!.Let’s see how they’re doing.

f, ax = plt.

subplots(figsize=(12, 8))sns.

countplot(r_df['UBER'])Countplot of Uber in Brazillian CitiesWe can see here that the vast majority of Brazillian cities don’t have Uber.

Let’s see where they are.

f, ax = plt.

subplots(figsize=(8, 8))sns.

scatterplot(x=r_df[r_df.

UBER==0].

LONG, y=r_df[r_df.

UBER==0].

LAT)sns.

scatterplot(x=r_df[r_df.

UBER==1].

LONG, y=r_df[r_df.

UBER==1].

LAT)Distribution of cities with presence of UBER encoded in orangeWe can see similar trends to IDHM here, with clustering by the coast and the south of the country.

Try checking out IDHM and Uber on your own!.Let’s look at the Uber’s relationship with cars.

f, ax = plt.

subplots(figsize=(16, 12))sns.

boxplot(y=r_df['Cars'], x=r_df['UBER'])Box plot of the distribution of number of cars in Brazillian cities, separated by the presence of Uber.

Oof.

That’s not pretty or useful.

Let’s remove those giant outlier and get a better look.

ubers, car_vals = r_df[r_df.

Cars <100000].

UBER, r_df[r_df.

Cars <100000].

Carssns.

boxplot(ubers, car_vals )Box plot of the distribution of number of cars (#Cars <100,000) in Brazillian cities, separated by the presence of Uber.

This is a really interesting distribution.

The minimum bounds for cities with Uber is above the maximum bound for cities without.

Try investigating other variables related to cars on your own to see why this is.

My guess would be something to do with population or GDP.

Also the presence of outliers in the non Uber cities suggests a large skew/tail.

Let’s take a look.

f, ax = plt.

subplots(figsize=(16, 12))sns.

distplot(r_df[(r_df.

Cars < 100000) & (r_df.

UBER==0)].

Cars)sns.

distplot(r_df[(r_df.

Cars < 100000) & (r_df.

UBER==1)].

Cars, bins=20)As expected!.The majority of cities either have no cars or a very small amount.

Good work!What we learned:What is Pandas?. More details

Leave a Reply