DATA CLEANING WITH PYTHON

DATA CLEANING WITH PYTHONBalogun OmobolajiBlockedUnblockFollowFollowingFeb 17INTRODUCTIONBefore working on any Data, it is very necessary to clean it.

If you plan to seek a career in this field, you should be able to handle messy data which might include missing values, inconsistent values or outliers.

A basic understanding of Numpy and Pandas Library is required before you can proceed to clean a Dataset.

Data CleaningAs the name implies, Data cleaning is cleaning Data, that is the process of identifying and correcting inaccurate records from a dataset, recognising unreliable or irrelevant parts of the data and then restoring or removing the messy data.

Data Cleaning is one of the least spoken about aspect of Data Science because it is not as important as training a neural network or identifying images, but to perform these things data cleaning plays a very important role.

Without data cleaning, machine learning prediction models will not be as efficient and accurate as we want them to be.

We will discuss the following Inconsistencies in DataMissing DataNeed to process columnsRemoving characters from StringsMissing DataWhen and Why Data is missed?When filling surveys, at times people do not share all the required information, and some make use of incorrect datatypes.

All these result in missing data which must be accounted for before using the Data to make further predictions.

We will be looking at this dataset about some students who wrote an entrance examination to different levels, their scores, school of preference and other details are contained therein.

As usual, we will first import Pandas and read the dataset…import pandas as pddata = pd.

read_csv('Responses.

csv')Where necessary, you could change the non standard types of NaN(e.

g ‘n/a’, ‘na’, ‘ — ’) to missing values.

This is done by:missing_values = ['n/a', 'na', '–']data =pd.

read_csv('Responses.

csv', na_values = missing_values)data.

head()Knowing the number of missing values in ‘District’ columnLooking at the District column, we want to check if it has any null value and we also want to know the number of null values it its column.

data['District'].

isnull().

values.

any() #To know if there is any missing values#Returns Truedata['District'].

isnull().

sum()#Returns 16Replacing all NaN valuesWe can choose to replace all NaN values with our preferred value, let’s say 14.

data['District'].

fillna(14, inplace = True) #replacing missing #valuesdata['District']Replacing a specific NaN valueWe can also replace a value in a specific location, say, row index 3.

data.

loc[3, 'District'] = 32# dataReplacing missing values with the medianWe can replace the NaN values with the median of the column, the median in this case is 3.

5.

median = data['District'].

median()mediandata['District'].

fillna(median, inplace =True)data['District']Drop missing valuesIf you want to simply exclude the missing values, then use the dropna function along with the axis argument.

By default, axis=0, which means along row, that is, if any value within a row is NA then the whole row is excluded.

# Drop any rows which have any NaNsdata.

dropna()# Drop columns that have any NaNsdata.

dropna(axis=1)Dropping Columns that have more than 90% of its data as NaNThis is one interesting function I recently learnt about.

The parameter thresh=N requires that a column has at least N non-NaNs to survive.

You only want the records that have 90% of the available features before you consider them as candidates for your model.

# Only drop columns which do not have at least 90% non-NaNsdata.

dropna(thresh=int(data.

shape[0] * .

9), axis=1)#Returns a data with the shape of 117rows and 8 columns#Recall that the original data 117rows and 12columnsSo this means that 4 columns have more than 90% of their data equivalent to NaN.

They therefore have little or no impact on the outcome of our result.

Another way of doing the above is to manually scan/read through the columns and drop whichever column will have little impact on the outcome of our result.

to_drop = ['District', 'School Preferences', 'School Assigned' 'Will you enroll there?']data.

drop(columns=to_drop, inplace=True)#We will have the same result as the aboveDeleting some characters in stringsAssuming we want to work on a large dataset that has some strings which we don’t want to include in our model, we can use the function below to remove some character of each string.

A screenshot showing how to strip a few characters from a stringThe soupsubcategory is the only column with a datatype of ‘object’, so we are selecting select_dtypes([‘object’]) and we are removing ‘images’ from each of the data in that column using the lambda function.

new_dataset = dataset.

select_dtypes([‘object’])dataset[new_dataset.

columns] = new_dataset.

apply(lambda x: x.

str.

strip(‘/images’))print (dataset)There are several other functions and methods that we could perform on our data which are not covered in this article.

You could learn more from this course.

ConclusionThere are ton of resources that could help you gain a deeper understanding of Python for Data Science.

The above is a little fraction of what Data Science entails.

After Cleaning your data, you will have visualize it (Data Visualization) before processing it and making predictions from the result.

Check out the links below to find additional resources that will help you on your Python data science journey:The Pandas documentationThe Numpy documentationIntroduction to Python for Data Science.

This is a great course for beginners who have no previous knowledge in Data Science.

Thanks for Reading through.

.

. More details

Leave a Reply