Data Handling Using Pandas: Cleaning and Processing

", print movies_df.

isna()The above commands return the following outputLooking For Missing Data in data-frameRather than printing out the data-frame with True/False as entry, we can extract the relevant information by adding a .

sum() along with the previous command.

With this we can find total number of missing values for each column.

print movies_df.

isna().

sum()>>>color 19director_name 104num_critic_for_reviews 50duration 15director_facebook_likes 104actor_3_facebook_likes 23actor_2_name 13actor_1_facebook_likes 7gross 884genres 0actor_1_name 7movie_title 0num_voted_users 0cast_total_facebook_likes 0actor_3_name 23facenumber_in_poster 13plot_keywords 153movie_imdb_link 0num_user_for_reviews 21language 12country 5content_rating 303budget 492title_year 108actor_2_facebook_likes 13imdb_score 0aspect_ratio 329movie_facebook_likes 0dtype: int64Adding another .

sum() returns the total number of null values in the data-set.

print "total null values: ", movies_df.

isna().

sum().

sum()>> total null values: 2698One of the easiest ways to remove rows containing NA is to drop them, either when all column contain NA or any column contain NA.

Let’s start with dropping rows that contain NA values in any of the columns.

clean_movies_df = movies_df.

dropna(how='any')print "new dataframe shape: ", clean_movies_df.

shapeprint "old dataframe shape: ">>> new dataframe shape: (3756, 28)old dataframe shape: (5043, 28)So dropping rows containing NA values in any of the columns resulted in almost 1300 rows reduction.

This can be important for data-sets with less number of rows where dropping all rows with any missing value can cost us losing necessary information.

In that case we can use pandas.

DataFrame.

fillna() method to fill NA/NaN values using a specified method.

Easiest way to fill all the NA/NaNs with some fixed value, for example 0.

We can do that simply bymovies_df.

fillna(value=0, inplace = True) Instead of filling up all the missing values with zero, we can choose some specific columns and then use DataFrame.

fillna() method as below —movies_df[['gross', 'budget']] = movies_df[['gross', 'budget']].

fillna(value=0)For columns with ‘object’ dtypes, for example ‘language’ column, we can use some words like “no info” to fill up the missing entries.

movies_df['language'].

fillna("no info", inplace=True)Another method to fill the missing value could be ffill method, which propagates last valid observation to the next.

Similarly bfill method uses next observation to fill gap.

movies_df['language'].

fillna(method='ffill', inplace=True)Another effective method is to use the mean of the column to fill the missing values as belowmovies_df['budget'].

fillna(movies_df[budget].

mean(), inplace=True)For more details on how to use Pandas to deal with missing values, you can check the Pandas user guide document on missing data.

Duplicate Data in a Data-Frame: DataFrame.

duplicated( )Apart from missing data, there can also be duplicate rows in a data-frame.

To find whether a data-set contain duplicate rows or not we can use Pandas DataFrame.

duplicated() either for all columns or for some selected columns.

pandas.

Dataframe.

duplicated() returns a Boolean series denoting duplicate rows.

Let’s first find how many duplicate rows are in this movies data-set.

duplicate_rows_df = movies_df[movies_df.

duplicated()]print "number of duplicate rows: ", duplicate_rows_df.

shape>>> number of duplicate rows: (45, 28)So there are 45 rows with duplicate elements present in each column.

We can check this for individual column too —duplicated_rows_df_imdb_link= movies_df[movies_df.

duplicated(['movie_imdb_link'])]print duplicate_rows_df_imdb_link.

shape>>> (124, 28)So there are 124 cases where imdb link is same, another way to check the same, is to use pandas.

Series.

unique() method.

Let’s see:print len(movies_df.

movie_imdb_link.

unique())>>> 4919So total number of unique links are 4919 and if you have noticed that duplicate links were 124, adding them gives (4919 + 124 = 5043) total number of rows.

It is necessary to select the unique rows for better analysis, so at least we can drop the rows with same values in all column.

We can do it simply using pandas.

DataFrame.

drop_duplicates() as belowprint "shape of dataframe after dropping duplicates", movies_df.

drop_duplicates().

shape >>> shape of dataframe after dropping duplicates (4998, 28)Binning Data: pandas.

cut( )Another very important data processing technique is data bucketing or data binning.

We will see an example here with binning IMDb-score using pandas.

cut() method.

Based on the score [0.

,4.

, 7.

, 10.

], I want to put movies in different buckets [‘shyyyte’, ‘moderate’, ‘good’].

As you can understand movies with score between 0–4 will be put into the ‘shyyyte’ bucket and so on.

We can do this with the following lines of codeop_labels = ['shyttte', 'moderate', 'good']category = [0.

,4.

,7.

,10.

]movies_df['imdb_labels'] = pd.

cut(movies_df['imdb_score'], labels=op_labels, bins=category, include_lowest=False)Here a new column ‘imdb_labels’ is created containing the labels and let’s take a look on it —print movies_df[['movie_title', 'imdb_score', 'imdb_labels']][209:220]>>> movie_title imdb_score imdb_labels209 Rio 2 6.

4 moderate210 X-Men 2 7.

5 good211 Fast Five 7.

3 good212 Sherlock Holmes:.

7.

5 good213 Clash of the.

5.

8 moderate214 Total Recall 7.

5 good215 The 13th Warrior 6.

6 moderate216 The Bourne Legacy 6.

7 moderate217 Batman & Robin 3.

7 shyttte218 How the Grinch.

6.

0 moderate219 The Day After T.

6.

4 moderateTo fully capitalize pandas.

cut() method, you can check the docs.

Detecting Outliers in a Data-Set:Most of the times for Exploratory Data Analysis (EDA), outlier detection is an important segment, as, outlier for particular features may distort the true picture, so we need to disregard them.

Specifically, outliers can play havoc when we want to apply machine learning algorithm for prediction.

At the same time outliers can even help us for anomaly detection.

So let’s see how we can use Pandas to detect outliers in this particular data-frame.

Seaborn Box Plot:Box plot is a standard way of visualizing distribution of data based on median, quartiles and outliers.

Probably you already know what exactly are these quantities but still I made short review in the figure below.

Figure 1: Schematic of Box Plot (Source: Author)We can use python data visualization library Seaborn to plot such box plots.

Let’s plot the distribution of number of actors who featured in the movie poster using a box plot.

sns.

boxplot(x=movies_df['facenumber_in_poster'], color='lime')plt.

xlabel('No.

of Actors Featured in Poster', fontsize=14)plt.

show()The code above results in the plot belowFigure 2: Too many outliers in number of faces featured in movie posterLet’s check the movie with maximum number of actors (faces) that featured in the movie poster.

print movies_df[['movie_title', 'facenumber_in_poster']].

iloc[movies_df['facenumber_in_poster'].

idxmax()]>>>movie_title 500 Days of Summer facenumber_in_poster 43So maximum number of faces (43) were featured in movie ‘500 Days of Summer’ .

Let’s see a basic statistical details of this column ‘facenumber_in_poster’ with pandas.

DataFrame.

describe() method.

print movies_df['facenumber_in_poster'].

describe()>>>count 5030.

000000mean 1.

371173std 2.

013576min 0.

00000025% 0.

00000050% 1.

00000075% 2.

000000max 43.

000000With this, probably the box plot makes a lot more sense to you know.

Another way to detect outlier is to use Z Score.

Let’s see how that works.

Z Score and Outliers:Figure 3: 1σ and 3σ Standard deviation on a normal distribution with 0 μ.

(Source: Author)Z score is a number (dimensionless) that signifies how much standard deviation a data point is, from the mean.

Z score simply can be defined as —Z =(X-μ)/σ, where μ is the population mean and σ is the standard deviation, X is one element in the population.

To plot the figure below, I have used normal distribution numpy.

random.

normal() and, in a normal distribution almost all the values — about 99.

7%, fall within 3 σ deviation from the mean (for the plot below μ = 0).

So the way we can use Z score to reject outliers is to consider the data points which are within 3 units of Z score.

This can be done for all columns with ‘non object’ type data using scipy.

stats as below.

1.

Check the data types of all column in the data-frameprint "data types:.", movies_df.

dtypes>>>data types: color objectdirector_name objectnum_critic_for_reviews float64duration float64director_facebook_likes float64actor_3_facebook_likes float64actor_2_name objectactor_1_facebook_likes float64gross float64genres objectactor_1_name objectmovie_title objectnum_voted_users int64cast_total_facebook_likes int64actor_3_name objectfacenumber_in_poster float64plot_keywords objectmovie_imdb_link objectnum_user_for_reviews float64language objectcountry objectcontent_rating objectbudget float64title_year float64actor_2_facebook_likes float64imdb_score float64aspect_ratio float64movie_facebook_likes int642.

Create a new data-frame excluding all the ‘object’ types columnprint "shape before :", movies_df.

shapemovies_df_num = movies_df.

select_dtypes(exclude=['object'])print "shape after excluding object columns: ", movies_df_num.

shape>>>shape before : (3756, 28)shape after excluding object columns: (3756, 16)3.

Select elements from each column that lie within 3 units of Z scoremovies_df_Zscore = movies_df_num[(np.

abs(stats.

zscore(movies_df_num))<3).

all(axis=1)]print "shape after rejecting outliers: ", movies_df_Zscore.

shape>>>shape after rejecting outliers: (3113, 16)Figure 4: Box plot of number of faces featured in a movie poster.

After applying the Z score method.

We can check the effect of the above steps by plotting again the box plot for ‘facenumber_in_poster’.

Here one can see the difference compared to figure 2, where we had the box plot considering all elements in the ‘facenumber_in_poster’ column.

These are some ways one can prepare the data for analysis and applying machine learning algorithm for prediction.

Effectively preparing the data-set can help a lot for comprehensive analysis and, I wish that this post will help you to prepare a data-set more methodically for further analysis.

Depending upon the problem and data-set you may have to decide, choose and repeat these processes to interpret what are the effects, so, good luck exploring your data-set.

Stay Strong and Cheers !!Codes used for this post are available on my Github.

Find me in Linkedin.

.

. More details

Leave a Reply