Data Pre-processing with Pandas on Trending YouTuBe Video Statistics 〠 ❤︎ ✔︎

Data Pre-processing with Pandas on Trending YouTuBe Video Statistics 〠 ❤︎ ✔︎Alina ZhangBlockedUnblockFollowFollowingFeb 24The purpose of this article is to provide a standardized data pre-processing solution that could be applied to any types of datasets.

You will learn how to convert data from initial raw form to another format, in order to prepare the data for exploratory analysis and machine learning models.

Overview of the dataThis dataset is a daily record of the top trending YouTube videos from the United States.

Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count, etc.

The shape of the dataset is 16580 rows * 16 columns.

A roadmap for data pre-processingSource code for data pre-processingBasic insights of the dataset# A quick look at the dataset.

Return 5 random rows.

df.

sample(5)# Return data typesdf.

dtypes# Dimensions of datasetdf.

shape# Statistical summarydf.

describe()df_summary = df.

describe(include="all")df_summary“top” is the most frequently shows up item“freq” is the number of time top item shows“NaN” means it could not be calculated for that type of data# Unique value: value counts of a specific columndf['category_id'].

value_counts()Above image shows the unique values in column “category_id”.

The counts of value “24” is 3911.

# Look at a specific fielddf.

iloc[23,5]# Look at selected columnscolumnWeCareAbout=['title','views','likes','dislikes','comment_count']df[columnWeCareAbout].

sample(5)Identify and handle missing data# Use heatmap to check missing datasns.

heatmap(df_summary.

isnull(), yticklabels=False, cbar=False, cmap='viridis')# See counts of missing valuefor c in df_summary.

columns: print(c,np.

sum(df_summary[c].

isnull()))# Replace missing datadf_summary['views'].

fillna(df_summary['views'].

mean(), inplace=True)Above code fill the missing data with mean.

You could consider about interpolation, median, or other methods in real cases.

# Drop a column most value are missingdf_summary.

drop(['thumbnail_link'], axis=1, inplace=True)sns.

heatmap(df_summary.

isnull(), yticklabels=False, cbar=False, cmap='viridis')No missing value detectedData formatting# Change data type if neededdf['left_publish_time'] = pd.

to_datetime(df['left_publish_time'], format='%Y-%m-%dT%H:%M:%S')# Unit conversion# conversion factorconv_fac = 0.

621371# calculate milesmiles = kilometers * conv_facData normalization# Number in different range which influence the result differentlydf[['views','likes','dislikes','comment_count']].

head()# Simple feature scalingdf['views'] = df['views'] / df['views'].

max()df[['views','likes','dislikes','comment_count']].

head()New values in column “views”# Min-maxdf['likes'] = (df['likes'] – df['likes'].

min()) / (df['likes'].

max() – df['likes'].

min())df[['views','likes','dislikes','comment_count']].

head()Normalized value in column “likes”# Z-scoredf['dislikes'] = (df['dislikes'] – df['dislikes'].

mean()) / df['dislikes'].

std()df['comment_count'] = (df['comment_count'] – df['comment_count'].

mean()) / df['comment_count'].

std()df[['views','likes','dislikes','comment_count']].

head()BinningGrouping of values into binsConcert numeric value into categorical variable‘likes’ is numeric and we want to convert it into ‘Low’, ‘Medium’, ‘High’ to have a better representation of video’s popularitybinwidth = int(max(df['likes'])-min(df['likes']))/3binwidthbins = range(min(df['likes']), max(df['likes']),binwidth)group_names = ['Low','Medium','High']df['likes-binned'] = pd.

cut(df['likes'], bins, labels=group_names)df['likes-binned']Visualizing binned dataOne-hot encodingAdd dummy variables for each unique categoryAssign 0 or 1 in each categoryConvert categorical variable to numericdf['category_id'].

sample(5)category = pd.

get_dummies(df['category_id'], drop_first=True)category.

head()# Add dummy values into data framedf = pd.

concat([df, category], axis=1)df.

sample(5)Congrats!.You finished a long boiling article and now you know a standardized data pre-processing solution that you could apply on any types of datasets.

One more piece for your data scientist puzzle!Quiz: Why we need data normalization?Next step: Exploratory Data Analysis (EDA) with Pandas on Trending YouTuBe Video Statistics.

. More details

Leave a Reply