Cleaning a messy dataset using Python

Cleaning a messy dataset using PythonReza RajabiBlockedUnblockFollowFollowingJun 6According to a survey conducted by Figure Eight in 2016, almost 60% of Data Scientists’ time is spent on cleaning and organizing data.

You can find the survey results at here.

Spent time by data scientists in 2016 surveyData in real world comes from different sources and can be incomplete, noisy, and inconsistent.

Dirty data can be produced as a result of human errors, integration of separated systems, or changing requirements.

Poor data quality can lead to inaccurate results in data analytics and drive misguided decision making.

In this post, I intend to go through a data cleaning process to prepare a messy dataset for the final analysis.

The following figure depicts a set of IMDB records, that I downloaded from IMDB data source at here.

A sample messy datasetAs you can see, there might be different types of errors in a dataset:Outliers: outliers are the numbers in a column that are out of range.

In fact, an outlier lies outside of most of the other values in a dataset (red cells in the figure).

Duplicates: A dataset can contain repetitive rows or records (purple rows in the figure).

Missing values: We might lose data as the result of human error or missing information (orange cells in the figure).

Bad character or NULL values: Some values may contain bad characters like ₮ or NULL values.

Sometimes null data may be specified with different values like N/A or NA or NAN (light or dark cells in the figure).

Data cleaningData cleaning is a scientific process to explore and analyze data, handle the errors, standardize data, normalize data, and finally validate it against the actual and original dataset.

Data cleaning tasksSample datasetTo perform data cleaning, I selected a subset of 100 records from IMDB movie dataset.

It included around 20 attributes, which was reduced to 12 for our analysis.

The dataset is available at here.

I used Python for the analysis, as a powerful, flexible, and open-source language.

Python has a set of libraries for data manipulation, analysis and visualization.

However, there are other data analytics tool that you can choose for data cleaning such as: Tableau, R, QlickView, SAP, Excel, Apache SPARK, etc.

Data cleaning processData explorationFirst step in data cleaning is understanding data by exploring the dataset and its attributes.

The type of analysis might be different for each attribute type.

The following table shows 12 attributes we consider for data cleaning with their data types.

I categorized the attributes into different groups:dataset’ attributesNominal: an attribute can be categorical like color that can contain a set of restricted values (e.

g.

, blue,red, brown, .

)Text: an attribute can be a free text or string.

Numeric: an attribute can be numeric (e.

g.

,currency, scores).

Missing or repetitive values in columnsHandling errorsBased on the type of error, we choose a specific strategy to handle it.

If we have missing or repetitive values for one attribute we may decide to drop the column from our analysis.

Based on experience, if more than 60% of a column contain null values, we can drop that column for our analysis.

This is true for repetitive values of one column.

If all the values of a column are the same, it is definitely useless.

In our example, one sentence mentioning that all the movies in our selected dataset is in English language is more than enough for the analysis.

Duplicate records also mislead us to perform a precise analysis.

Let’s assume that we have the following numbers: 4,6,8,10 .

The average of these numbers 7 .

However, if we wrongly add another 8 to this set (4,6,8,8,10), the average will be 7.

2 which is not equal to the previous average.

Removing the duplicate records in a dataset is an essential step in an analysis.

Duplicate recordsIdentifying invalid outliers in our dataset is also very important, as it impacts on the analysis’ result.

There are different ways to identify the outliers in a dataset.

One is simply finding minimum and maximum of the numbers.

Another approach is visualizing the numeric data in a box-plot to see if there is an outlier in the data or not.

Some outliers might be valid and should be precisely treated in the dataset (like health data that may include some numbers out of range and show a specific behavior of a medical test result).

Filtering out invalid outliers or filling with a default or appropriate values are two solutions to face this issue in data analytics.

Outliers can be depicted with box-plotsStandardize valuesAnother data cleaning step is making data standard.

We should make sure that data is in the right column and values are in correct data type.

Also, the values in columns should be unified.

For example, if there is an attribute in datetime format, all the values in that column should be unified in same format ( e.

g.

, YYYY-MM-dd ).

Text values should be unified or bad character should be identified and fixed.

For example, if we have country column, all the canada, CANADA, and Canada are converted to canada or Canad.

Normalize valuesTo minimize duplicate data, avoid data modification issues, and simplify queries, we should normalize data in a dataset.

An attribute should contain atomic values and NOT a combination of several values.

In our example, there are three actors are stored in one column.

To normalize this column, we should split them up into three attributes (e.

g.

, Actor1, Actor2, and Actor3).

Another step towards data normalization can be adding new metrics or categories to current dataset.

For example, we might want to add a metric named gross over budget to the table to have a calculated metric for our final analysis.

Data validationAfter cleaning data, the final dataset should be compared to the original dataset to assure data accuracy.

This step is essential, as we want to know that whether or not we lose data as the result of data cleaning.

Python code for data cleaning our examplePython code for data cleaningRead CSV file in pythonIn the following line, we read an IMDB sub-dataset using read_csv command.

dataset = pd.

read_csv(‘movie_sample_dataset.

csv’, encoding=’utf-8')First of all, we drop the useless columns:dataset.

drop([‘color’,’language’], axis=1, inplace=True)In the next step, we fill the empty values of two numeric columns: gross and budget.

dataset[‘gross’].

fillna(0, inplace=True)dataset[‘budget’].

fillna(0, inplace=True)Then, we unify the values for other columns like country, director name, or movie title:# Uppercase all the country valuesdataset['country']=dataset['country'].

str.

upper()dataset['country'] = np.

where(dataset['country']=='UNITED STATES','USA', dataset['country'])# Bad data entrydataset['director_name'] = np.

where(dataset['director_name']=='N/A','', dataset['director_name'])dataset['director_name'] = np.

where(dataset['director_name']=='Nan','', dataset['director_name'])dataset['director_name'] = np.

where(dataset['director_name']=='Null','', dataset['director_name'])dataset['movie_title'] = dataset['movie_title'].

str.

replace('Â', '')As I mentioned above, one of the solutions to handle the outliers, is filling the correspondent rows with appropriate numbers.

For example, movie duration in our dataset cannot be less than 10 or more than 300 minutes.

For those cases, we fill the cells with zero values.

dataset['duration'] = np.

where(dataset['duration']<=10,0, dataset['duration'])dataset['duration'] = np.

where(dataset['duration']>300,0, dataset['duration'])dataset['imdb_score'] = np.

where(dataset['imdb_score']<=0,0, dataset['imdb_score'])dataset['title_year'] = np.

where(dataset['title_year']<2010,0, dataset['title_year'])To normalize data, we split up actors into three different attributes and drop the actual column as follows:actor_list = dataset["actors"].

str.

split(",", n = 2, expand = True) dataset["actor1"]= actor_list[0] dataset["actor2"]= actor_list[1] dataset["actor3"]= actor_list[2] dataset.

drop(columns=['actors'], inplace=True)And finally, we define a new metric called GOB that shows gross over budget for each movie:# Add a new metric GOB(Gross over Budget)dataset['GOB'] = dataset.

apply(lambda row: row['gross']/row['budget'] if row['budget']!=0 else 0, axis=1)top_GOB=dataset.

sort_values('GOB',ascending=False).

head(15)and save the result in another csv file to validate:dataset.

to_csv('output_IMDB.

csv')Thank you for reading this article!.I will probably write another article about the analyzing as well as visualizing this clean dataset.

.

. More details

Leave a Reply