Automated Data Profiling Using Python

Automated Data Profiling Using PythonGenerating a MS Word document with data profiling outputs using Python.

Anandakumar VaratharajahBlockedUnblockFollowFollowingJun 10This blog is about automating the data profiling stage of the Exploratory Data Analysis process (EDA).

We will automate the data profiling process using Python and produce a Microsoft Word document as the output with the results of data profiling.

The key advantage of producing a MS Word document as the output with data profiling results is that it can be used to capture the discussions and decisions with the domain experts regarding data quality, data transformations and feature engineering for further modelling and visualisations.

A sample page for numeric column data profilingThe advantage of the Python code is that it is kept generic to enable a user who wants to modify the code to add further functionality or change the existing functionality easily.

E.

g.

Change the types of graphs produced for numeric column data profile or load the data from an Excel file.

You can download the complete code from my GitHub site AnalyticsInsightsNinja.

Download the following files to a folder to execute the Python code:01_16_av_automated_data_profiling_MS_WORD.

ipynbSAMPLE_FULL_DPD_Image_MSWORD.

PNGWhat is Exploratory Data Analysis (EDA)Exploratory Data Analysis refers to a set of techniques originally developed by John Tukey to display data in such a way that interesting features will become apparent.

Unlike classical methods which usually begin with an assumed model for the data, EDA techniques are used to encourage the data to suggest models that might be appropriate.

[Source: http://www.

statgraphics.

com/exploratory-data-analysis]Data profiling is the first stage of the Exploratory Data Analysis process.

What is Data Profiling process?Data profiling is the process of examining the data available from an existing information source (e.

g.

a database or a file) and collecting statistics or informative summaries about that data.

The purpose of data profiling is to find out whether existing data can be easily used for other purposes.

Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during data profiling such as data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition.

[Source: https://en.

wikipedia.

org/wiki/Data_profiling]Data Profile Dataframe (DPD) — The Game Changer!Python and Pandas provide many functions to profile the data.

But these commands have to be issued individually and the information will be dispersed.

This is a repetitive and tedious manual process.

The Data Profile Dataframe is a game changing way to conveniently access all data profiling information in a single dataframe, which is generated by combining many of the data profiling functionalities of Pandas package and generic statistics functions of Python.

Creating the Data Profile DataframeThe code to create the DPD can be downloaded from my GitHub site AnalyticsInsightsNinja.

Download the following files to a folder to execute the Python code:01_16_av_automated_data_profiling_MS_WORD.

ipynbSAMPLE_FULL_DPD_Image_MSWORD.

PNGAs the code can be downloaded freely, I will cover only the key aspects of creating the DPD in this article.

First a dataframe is created with rows equal to the number of columns in the dataset.

This is because the rows in the DPD are the profile of the columns and therefore the number of rows in DPD is equal to number of columns in the dataset.

The required columns are added to the dataframe.

The columns in the source data are added as rows in the DPD along with details such as value count, unique value count, etc.

The next step is to generate a dataframe of the source dataframe profile using the command:df.

describe().

T.

round(2)This creates a source data profile dataframe, transposes it (T) and rounds the numeric values to two decimals.

The key step in creating the DPD is merging the initially created DPD with the dataframe resulting from the above describe() function.

pd.

merge(data_qlt_df, raw_num_df, how=’left’, left_on=’column_name’, right_index=True)Then the remaining columns are added to the DPD.

Generating plots/graphsAt various points in the code different plots/graphs are generated and saved to the disk.

These images will be inserted in the MS Word document at the appropriate locations.

Plots/graphs for numerical columnsFor numerical columns the following graphs are generated:Boxplot for all the values of the columnHistogram for all the values of the columnBoxplot of quartiles for all the values of the columnBoxplot without outliersViolin plot for values less than 95 percentileHistogram for values less than 95 percentileIn the above diagram, the histogram for values less than 95 percentile is a custom version of histogram with count of values for bars, bar range values and coloured bars for different percentile ranges.

The code for this histogram can be downloaded form my GitHub site AnalyticsInsightsNinja — PythonVisualAnalytics.

Plots/graphs for numerical columnsFor ‘object’ type columns the normalised values of the unique values of the columns are plotted for the first twenty five (25) values.

Correlation plotCorrelation plot of the numerical columns is useful to determine the relationship between columns and/or to understand how columns are affected by another column.

The Final OutputThe final Microsoft Word document is created using the ‘docx’ package.

The code to generate the MS Word document with the data profiling details is some what lengthy as every detail that goes into the Word document has to be coded.

But the output is worth every line of the code!Sample pages from the final Data Profile MS Word documentSample numeric column data profileThe sample Data Profile MS Word document can be downloaded form my GitHub site (data_profile_df_MS_WORD.

docx).

What’s next after the data profiling?As mentioned earlier in this blog, the purpose of data profiling is to find out whether existing data can be easily used for other purposes.

So what decisions can be made using the DPD?Column names and data types: understand what the column is, whether the column data type is correct, whether the columns will be useful for feature engineering for modelling purposes or can be used for visual analytics, etc.

Memory increase from disk to RAM: decisions can be made whether the column data types needs to be optimised to fit the RAM.

Numerical column statistics: based on the max and min values of the column decisions can be made whether the column type can be down casted to optimise memory usage.

E.

g.

Downcast from float64 to float16Unique value count: based on the column type and number of unique values decisions can be made whether it is a candidate columns to convert to a ‘categorical’ data type.

Percentage or number of null values: decisions can be made as to whether the null values can be filled in a meaningful way, whether the column serves any useful purpose if it has large number of null values, whether the column can be dropped, whether the observation/row can be dropped if the column value is null for that row, etc.

Column value distribution: decisions can be made regarding how to handle the outliers, should the outliers be dropped, what kind of visual will be effective to identify trends or patterns, etc.

Correlation plot: decisions can be made whether correlation could be used to make predictions, whether strongly correlated columns could be replaced with one instead of many columns without loss of information, etc.

Note: It is important to note that any decision made regarding the data requires domain knowledge or access to someone with domain knowledge and who understands the data.

Decisions taken without proper understanding of the data and the domain knowledge could result in poor analysis and analytics.

ConclusionIn this article we explored how to automate the data profiling stage of the Exploratory Data Analysis process (EDA) using Python.

The key advantage of producing a MS Word document as the output with data profiling results is that it can be used to capture the discussions and decisions with the domain experts regarding data quality, data transformations and feature engineering for further modelling and visualisations.

Though there are various existing software and packages to assist in data profiling process, this code delivers some very useful benefits:The Python code is generic so it can be changed to load any data source the user desires into a Pandas dataframe.

Once the data is loaded into a Pandas dataframe the remaining data profiling code can be executed as is without any changes to produce the final MS Word output.

The output of the data profile is a complete Microsoft Word document which can then be used to document further discussions and decisions about the data.

.. More details

Leave a Reply