Data Preprocessing: A Practical Guide

Download the dataset from this link.

This dataset has been published as a part of Kaggle competition.

It has three .

csv files, train.

csv, test.

csv and gender_submission.

csv.

We are going to work on train.

csv data in this tutorial.

Open a new Jupyter Notebook (or any other IDE of your choice) to run our Python scripts.

Import the packages needed.

Import the datasetFirstly, import the packages needed to proceed further.

Read the dataset using Pandas read_csv() and store it in a variable named training_set, display the first few rows with head(), by default head() will return first 5 rows of the dataset, but you can specify any number of rows like head(10).

Dataset — RMS Titanic SurvivalCheck the dataset infoLet's check for the basic information about the dataset by running simple commands.

training_set.

shapeIt returns a number of rows and columns in a dataset.

(891, 12)training_set.

columnsIt returns column headings.

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')training_set.

isnull().

sum() It returns a number of null values in each column.

PassengerId 0Survived 0Pclass 0Name 0Sex 0Age 177SibSp 0Parch 0Ticket 0Fare 0Cabin 687Embarked 2dtype: int64Preparing the datasetFrom the overall understanding of the dataset, we gather so many insights which can help us in our journey,Insights,‘Survived’ is the target variable, which we will predict once our preprocessing of our data is done.

So, we retain that column.

Only the columns such as ‘Age’, ‘Cabin’ and ‘Embarked’ has missing values.

‘PassengerId’, ‘Name’ and ‘Ticket’ doesn’t add much value in predicting the target variable.

‘ParCh’(Parent/Children) and ‘SibSp’(Siblings/Spouse) details are related to family, so we can derive a new column named ‘Size of the family’‘Sex’, ‘Cabin’ and ‘Embarked’ are the categorical data that needs to be encoded to numerical values.

These are all the insights that I could gather in my view!.Now we process the data in accordance with this information.

Dropping of columnsIn this step, we are going to drop columns with the least priority.

The column such as ‘PassengerId’ and ‘Ticket’ comes under this category.

Use drop() to drop the columns.

Now, let’s run training_set.

info(), and look at the status of our dataset.

> training_set.

info()<class 'pandas.

core.

frame.

DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 10 columns):PassengerId 891 non-null int64Survived 891 non-null int64Pclass 891 non-null int64Name 891 non-null objectSex 891 non-null objectAge 714 non-null float64SibSp 891 non-null int64Parch 891 non-null int64Fare 891 non-null float64Embarked 889 non-null objectdtypes: float64(2), int64(5), object(3)memory usage: 69.

7+ KBWe can see that, only ‘Cabin’, ‘Embarked’ and ‘Age’ column has missing values.

Let’s work on that now.

Creating new classes‘Cabin’: Though Cabin column has 687 missing values, when you see carefully, it has a unique character at the beginning which denotes the deck number, therefore, we are going to create a column named Deck to extract this information, which may be used later in our prediction.

‘ParCh’ and ‘SibSp’ are the details related to family size, so let’s derive a new column named ‘Size of the Family’.

‘Name’: Instead of dropping right away, from the Name of the Passenger, we need to get only their TitleNow, let's drop Cabin, Name columns, we have extracted needed information from these two.

This is how our dataset looks like now.

Handling missing values‘Embarked’: Only two rows are missing the values for Embarked column.

 Embarked takes categorical values such as C = Cherbourg; Q = Queenstown; S = Southampton, here we can simply impute the missing values with most commonly occurred value, which is ‘S’ in this case.

‘Age’: We are going the impute the missing values in the ‘Age’ column by taking the mean value in each group.

Taking the mean value of the whole column can make the data inconsistent because there are several ranges in age.

Encoding categorical featuresMany machine learning algorithms cannot support categorical values without being converted to numerical values.

Fortunately, the python tools of pandas and sci-kit-learn provide several approaches to handle this situation.

 They are, — Find and Replace — Label coding — One hot encoding — Custom Binary Encoding — Using LabelEncoder from Sci-kit learnEvery method has its own advantage as well as disadvantages.

Initially, we are just going to map the categorical values into numerical data using map().

Manually replacing the categorical value is not the right choice if there are many categories.

Let’s do one conversion using LabelEncoder() provided by sklearn.

preprocessing library.

This transforms the categorical data into numerical value.

Dataset ready…Now our data is free from missing values, categorical data, and unwanted columns and ready to be used for further processing.

training_set.

info()Survived 891 non-null int64Pclass 891 non-null int64Sex 891 non-null int64Age 891 non-null float64SibSp 891 non-null int64Parch 891 non-null int64Fare 891 non-null float64Embarked 891 non-null int64Title 891 non-null int64FamilySize 891 non-null int64Deck 891 non-null int64dtypes: float64(2), int64(9)memory usage: 76.

6 KBApplaud yourself on completing this!.????I hope that this article provides you the understanding of how to practically preprocess your data.

Transform data into insights!.????✔Other tutorials,Explore your Data: Exploratory Data AnalysisWould you like to explore your data?.Let’s learn by analyzing India Air Quality datasetmedium.

com#100daysofMLcodingEnd of Day #9.

Happy Learning!.

. More details

Leave a Reply