Practical Statistics & Visualization With Python & Plotly

Practical Statistics & Visualization With Python & PlotlyHow to use Python and Plotly for statistical visualization, inference, and modelingSusan LiBlockedUnblockFollowFollowingMay 14One day last week, I was googling “statistics with Python”, the results were somewhat unfruitful.

Most literature, tutorials and articles focus on statistics with R, because R is a language dedicated to statistics and has more statistical analysis features than Python.

In two excellent statistics books, “Practical Statistics for Data Scientists” and “An Introduction to Statistical Learning”, the statistical concepts were all implemented in R.

Data science is a fusion of multiple disciplines, including statistics, computer science, information technology, and domain-specific fields.

And we use powerful, open-source Python tools daily to manipulate, analyze, and visualize datasets.

And I would certainly recommend anyone interested in becoming a Data Scientist or Machine Learning Engineer to develop a deep understanding and practice constantly on statistical learning theories.

This prompts me to write a post for the subject.

And I will use one dataset to review as many statistics concepts as I can and lets get started!The DataThe data is the house prices data set that can be found here.

import numpy as npimport pandas as pdimport matplotlib.

pyplot as pltimport seaborn as snsfrom plotly.

offline import init_notebook_mode, iplotimport plotly.

figure_factory as ffimport cufflinkscufflinks.

go_offline()cufflinks.

set_config_file(world_readable=True, theme='pearl')import plotly.

graph_objs as goimport plotly.

plotly as pyimport plotlyfrom plotly import toolsplotly.

tools.

set_credentials_file(username='XXX', api_key='XXX')init_notebook_mode(connected=True)pd.

set_option('display.

max_columns', 100)df = pd.

read_csv('house_train.

csv')df.

drop('Id', axis=1, inplace=True)df.

head()Univariate Data AnalysisUnivariate analysis is perhaps the simplest form of statistical analysis, and the key fact is that only one variable is involved.

Describing DataStatistical summary for numeric data include things like the mean, min, and max of the data, can be useful to get a feel for how large some of the variables are and what variables may be the most important.

df.

describe().

TTable 2Statistical summary for categorical or string variables will show “count”, “unique”, “top”, and “freq”.

table_cat = ff.

create_table(df.

describe(include=['O']).

T, index=True, index_title='Categorical columns')iplot(table_cat)Table 3HistogramPlot a histogram of SalePrice of all the houses in the data.

df['SalePrice'].

iplot( kind='hist', bins=100, xTitle='price', linecolor='black', yTitle='count', title='Histogram of Sale Price')Figure 1BoxplotPlot a boxplot of SalePrice of all the houses in the data.

Boxplots do not show the shape of the distribution, but they can give us a better idea about the center and spread of the distribution as well as any potential outliers that may exist.

Boxplots and Histograms often complement each other and help us understand more about the data.

df['SalePrice'].

iplot(kind='box', title='Box plot of SalePrice')Figure 2Histograms and Boxplots by GroupsPlotting by groups, we can see how a variable changes in response to another.

For example, if there is a difference between house SalePrice with or with no central air conditioning.

Or if house SalePrice varies according to the size of the garage, and so on.

Boxplot and histogram of house sale price grouped by with or with no air conditioningboxplot.

aircon.

pyFigure 3histogram_aircon.

pyFigure 4df.

groupby('CentralAir')['SalePrice'].

describe()Table 4It is obviously that the mean and median sale price for houses with no air conditioning are much lower than the houses with air conditioning.

Boxplot and histogram of house sale price grouped by garage sizeboxplot_garage.

pyFigure 5The larger the garage, the higher house median price, this works until we reach 3-cars garage.

Apparently, the houses with 3-cars garages have the highest median price, even higher than the houses with 4-cars garage.

Histogram of house sale price with no garagedf.

loc[df['GarageCars'] == 0]['SalePrice'].

iplot( kind='hist', bins=50, xTitle='price', linecolor='black', yTitle='count', title='Histogram of Sale Price of houses with no garage')Figure 6Histogram of house sale price with 1-car garagedf.

loc[df['GarageCars'] == 1]['SalePrice'].

iplot( kind='hist', bins=50, xTitle='price', linecolor='black', yTitle='count', title='Histogram of Sale Price of houses with 1-car garage')Figure 7Histogram of house sale price with 2-car garagedf.

loc[df['GarageCars'] == 2]['SalePrice'].

iplot( kind='hist', bins=100, xTitle='price', linecolor='black', yTitle='count', title='Histogram of Sale Price of houses with 2-car garage')Figure 8Histogram of house sale price with 3-car garagedf.

loc[df['GarageCars'] == 3]['SalePrice'].

iplot( kind='hist', bins=50, xTitle='price', linecolor='black', yTitle='count', title='Histogram of Sale Price of houses with 3-car garage')Figure 9Histogram of house sale price with 4-car garagedf.

loc[df['GarageCars'] == 4]['SalePrice'].

iplot( kind='hist', bins=10, xTitle='price', linecolor='black', yTitle='count', title='Histogram of Sale Price of houses with 4-car garage')Figure 10Frequency TableFrequency tells us how often something happened.

Frequency tables give us a snapshot of the data to allow us to find patterns.

Overall quality frequency tablex = df.

OverallQual.

value_counts()x/x.

sum()Table 5Garage size frequency tablex = df.

GarageCars.

value_counts()x/x.

sum()Table 6Central air conditioning frequency tablex = df.

CentralAir.

value_counts()x/x.

sum()Table 7Numerical SummariesA quick way to get a set of numerical summaries for a quantitative variable is to use the describe method.

df.

SalePrice.

describe()Table 8We can also calculate individual summary statistics of SalePrice.

print("The mean of sale price, – Pandas method: ", df.

SalePrice.

mean())print("The mean of sale price, – Numpy function: ", np.

mean(df.

SalePrice))print("The median sale price: ", df.

SalePrice.

median())print("50th percentile, same as the median: ", np.

percentile(df.

SalePrice, 50))print("75th percentile: ", np.

percentile(df.

SalePrice, 75))print("Pandas method for quantiles, equivalent to 75th percentile: ", df.

SalePrice.

quantile(0.

75))Calculate the proportion of the houses with sale price between 25th percentile (129975) and 75th percentile (214000).

print('The proportion of the houses with prices between 25th percentile and 75th percentile: ', np.

mean((df.

SalePrice >= 129975) & (df.

SalePrice <= 214000)))Calculate the proportion of the houses with total square feet of basement area between 25th percentile (795.

75) and 75th percentile (1298.

25).

print('The proportion of house with total square feet of basement area between 25th percentile and 75th percentile: ', np.

mean((df.

TotalBsmtSF >= 795.

75) & (df.

TotalBsmtSF <= 1298.

25)))Lastly, we calculate the proportion of the houses based on either conditions.

Since some houses are under both criteria, the proportion below is less than the sum of the two proportions calculated above.

a = (df.

SalePrice >= 129975) & (df.

SalePrice <= 214000)b = (df.

TotalBsmtSF >= 795.

75) & (df.

TotalBsmtSF <= 1298.

25)print(np.

mean(a | b))Calculate sale price IQR for houses with no air conditioning.

q75, q25 = np.

percentile(df.

loc[df['CentralAir']=='N']['SalePrice'], [75,25])iqr = q75 – q25print('Sale price IQR for houses with no air conditioning: ', iqr)Calculate sale price IQR for houses with air conditioning.

q75, q25 = np.

percentile(df.

loc[df['CentralAir']=='Y']['SalePrice'], [75,25])iqr = q75 – q25print('Sale price IQR for houses with air conditioning: ', iqr)StratificationAnother way to get more information out of a dataset is to divide it into smaller, more uniform subsets, and analyze each of these “strata” on its own.

We will create a new HouseAge column, then partition the data into HouseAge strata, and construct side-by-side boxplots of the sale price within each stratum.

df['HouseAge'] = 2019 – df['YearBuilt']df["AgeGrp"] = pd.

cut(df.

HouseAge, [9, 20, 40, 60, 80, 100, 147]) # Create age strata based on these cut pointsplt.

figure(figsize=(12, 5)) sns.

boxplot(x="AgeGrp", y="SalePrice", data=df);Figure 11The older the house, the lower the median price, that is, house price tends to decrease with age, until it reaches 100 years old.

The median price of over 100 year old houses is higher than the median price of houses age between 80 and 100 years.

plt.

figure(figsize=(12, 5))sns.

boxplot(x="AgeGrp", y="SalePrice", hue="CentralAir", data=df)plt.

show();Figure 12We have learned earlier that house price tends to differ between with and with no air conditioning.

From above graph, we also find out that recent houses (9–40 years old) are all equipped with air conditioning.

plt.

figure(figsize=(12, 5))sns.

boxplot(x="CentralAir", y="SalePrice", hue="AgeGrp", data=df)plt.

show();Figure 13We now group first by air conditioning, and then within air conditioning group by age bands.

Each approach highlights a different aspect of the data.

We can also stratify jointly by House age and air conditioning to explore how building type varies by both of these factors simultaneously.

df1 = df.

groupby(["AgeGrp", "CentralAir"])["BldgType"]df1 = df1.

value_counts()df1 = df1.

unstack()df1 = df1.

apply(lambda x: x/x.

sum(), axis=1)print(df1.

to_string(float_format="%.

3f"))Table 9For all house age groups, vast majority type of dwelling in the data is 1Fam.

The older the house, the more likely to have no air conditioning.

However, for a 1Fam house over 100 years old, it is a little more likely to have air conditioning than not.

There were neither very new nor very old duplex house types.

For a 40–60 year old duplex house, it is more likely to have no air conditioning.

Multivariate AnalysisMultivariate analysis is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time.

Scatter plotA scatter plot is a very common and easily-understood visualization of quantitative bivariate data.

Below we make a scatter plot of Sale Price against Above ground living area square feet.

it is apparently a linear relationship.

df.

iplot( x='GrLivArea', y='SalePrice', xTitle='Above ground living area square feet', yTitle='Sale price', mode='markers', title='Sale Price vs Above ground living area square feet')Figure 142D Density Joint plotThe following two plot margins show the densities for the Sale Price and Above ground living area separately, while the plot in the center shows their density jointly.

price_GrLivArea.

pyFigure 15Heterogeneity and stratificationWe continue exploring the relationship between SalePrice and GrLivArea, stratifying by BldgType.

stratify.

pyFigure 16In almost all the building types, SalePrice and GrLivArea shows a positive linear relationship.

In the results below, we see that the correlation between SalepPrice and GrLivArea in 1Fam building type is the highest at 0.

74, while in Duplex building type the correlation is the lowest at 0.

49.

print(df.

loc[df.

BldgType=="1Fam", ["GrLivArea", "SalePrice"]].

corr())print(df.

loc[df.

BldgType=="TwnhsE", ["GrLivArea", "SalePrice"]].

corr())print(df.

loc[df.

BldgType=='Duplex', ["GrLivArea", "SalePrice"]].

corr())print(df.

loc[df.

BldgType=="Twnhs", ["GrLivArea", "SalePrice"]].

corr())print(df.

loc[df.

BldgType=="2fmCon", ["GrLivArea", "SalePrice"]].

corr())Table 10Categorical bivariate analysisWe create a contingency table, counting the number of houses in each cell defined by a combination of building type and the general zoning classification.

x = pd.

crosstab(df.

MSZoning, df.

BldgType)xTable 11Below we normalize within rows.

This gives us the proportion of houses in each zoning classification that fall into each building type variable.

x.

apply(lambda z: z/z.

sum(), axis=1)Table 12We can also normalize within the columns.

This gives us the proportion of houses within each building type that fall into each zoning classification.

x.

apply(lambda z: z/z.

sum(), axis=0)Table 13One step further, we will look at the proportion of houses in each zoning class, for each combination of the air conditioning and building type variables.

df.

groupby(["CentralAir", "BldgType", "MSZoning"]).

size().

unstack().

fillna(0).

apply(lambda x: x/x.

sum(), axis=1)Table 14The highest proportion of houses in the data are the ones with zoning RL, with air conditioning and 1Fam building type.

With no air conditioning, the highest proportion of houses are the ones in zoning RL and Duplex building type.

Mixed categorical and quantitative dataTo get fancier, we are going to plot a violin plot to show the distribution of SalePrice for houses that are in each building type category.

price_violin_plot.

pyFigure 17We can see that the SalesPrice distribution of 1Fam building type are slightly right-skewed, and for the other building types, the SalePrice distributions are nearly normal.

Jupyter notebook for this post can be found on Github, and there is an nbviewer version as well.

Reference:Statistics with Python | CourseraThis specialization is designed to teach learners beginning and intermediate concepts of statistical analysis using the…www.

coursera.

or.

. More details

Leave a Reply