Data Cleaning, Detection and Imputation of Missing Values in R Markdown — Part 2

Data Cleaning, Detection and Imputation of Missing Values in R Markdown — Part 2Wendy WongBlockedUnblockFollowFollowingJan 11Data cleaning and transforming variables in R using Australian Tennis Open data on the Men’s tour from 2000 to 2018.

Photo by The Creative Exchange on UnsplashToday is Day 4 and marks the continuation of my #29dayproject bootcamp of data science and things I have learnt in data science.

This is a tutorial guide of cleaning and pre-processing the tennis data from the Australian Open Tennis tournament from 2000 to 2019 to predict who might win using R.

Yesterday we merged multiple csv files and subsetted the data you may follow here from the blog post.

What is CRISP-DM methodology?If you take a class in Data Analytics at General Assembly, take a data science course at your university in data science algorithms, take an IT elective of interest, study an online course or you may work on a data mining project for your Professor you will come across the data mining principles of CRISP-DM.

CRISP-DM is the cross-industry process for data mining.

It is also a prescribed analytics workflow I learnt at General Assembly Sydney and the principles are used in data science projects when you are consulting in government.

This the workflow of CRISP-DM methodology used to guide you through data analytics and data science problems.

You may review each of the steps from Smart Vision Europe 2018.

Source: Smart Vision 2018Business UnderstandingThis is not a business problem but the scope of the problem statement was to predict the winner of this year’s Australian Open Tennis Finals based on data from 2000 to 2018.

Data UnderstandingPreviously we obtained a data dictionary and explored the data by analysing the overall structure of the data.

Here is the data dictionary of the variables:ATP = Tournament number (men)WTA = Tournament number (women)Location = Venue of tournamentTournament = Name of tounament (including sponsor if relevant)Data = Date of match (note: prior to 2003 the date shown for all matches played in a single tournament is the start date)Series = Name of ATP tennis series (Grand Slam, Masters, International or International Gold)Tier = Tier (tournament ranking) of WTA tennis series.

Court = Type of court (outdoors or indoors)Surface = Type of surface (clay, hard, carpet or grass)Round = Round of matchBest of = Maximum number of sets playable in matchWinner = Match winnerLoser = Match loserWRank = ATP Entry ranking of the match winner as of the start of the tournamentLRank = ATP Entry ranking of the match loser as of the start of the tournamentWPts = ATP Entry points of the match winner as of the start of the tournamentLPts = ATP Entry points of the match loser as of the start of the tournamentW1 = Number of games won in 1st set by match winnerL1 = Number of games won in 1st set by match loserW2 = Number of games won in 2nd set by match winnerL2 = Number of games won in 2nd set by match loserW3 = Number of games won in 3rd set by match winnerL3 = Number of games won in 3rd set by match loserW4 = Number of games won in 4th set by match winnerL4 = Number of games won in 4th set by match loserW5 = Number of games won in 5th set by match winnerL5 = Number of games won in 5th set by match loserWsets = Number of sets won by match winnerLsets = Number of sets won by match loserComment = Comment on the match (Completed, won through retirement of loser, or via Walkover)Data Preparation — data cleaningThe data has been cleaned by the encoding of categorical variables, transforming variables and the detection of missing values.

We have already merged the original data and the “aus_open.

csv” dataframe is read into R using StringAsfactors =FALSE to ensure all the variables are not automatically read as a factor.

#################################### Pre-Processing the Training Data#################################### Exported aust_open.

csv file was exported and a new column was created in Excel to extract the year from the data with non-standardised formattinga <- read.

csv("aus_open.

csv",stringsAsFactors = FALSE,header = TRUE)Exploratory Data AnalysisUsing Hadley Wickham’s dplyr package, I explored the structure of the data using the function glimpse().

############################## Exploratory Data Analysis#############################glimpse(a) # view the structure of the training dataDescriptive StatisticsThe output of the descriptive statistics is provided below, a summary of all of the numeric variables and missing values.

summary(a) # descriptive statistics ATP Location Tournament Date Min.

:6 Length:2413 Length:2413 Length:2413 1st Qu.

:6 Class :character Class :character Class :character Median :6 Mode :character Mode :character Mode :character Mean :6 3rd Qu.

:6 Max.

:6 Year Series Court Surface Min.

:2000 Length:2413 Length:2413 Length:2413 1st Qu.

:2004 Class :character Class :character Class :character Median :2009 Mode :character Mode :character Mode :character Mean :2009 3rd Qu.

:2014 Max.

:2018 Round Best.

of Winner Loser Length:2413 Min.

:5 Length:2413 Length:2413 Class :character 1st Qu.

:5 Class :character Class :character Mode :character Median :5 Mode :character Mode :character Mean :5 3rd Qu.

:5 Max.

:5 WRank LRank W1 L1 Min.

: 1.

00 Length:2413 Min.

: 0 Length:2413 1st Qu.

: 10.

00 Class :character 1st Qu.

: 7 Class :character Median : 28.

00 Mode :character Median : 860 Mode :character Mean : 46.

97 Mean : 1863 3rd Qu.

: 65.

00 3rd Qu.

: 2145 Max.

:768.

00 Max.

:16790 NA's :128 W2 L2 W3 L3 Min.

:0.

000 Min.

:0.

000 Min.

:0.

000 Min.

:0.

000 1st Qu.

:6.

000 1st Qu.

:3.

000 1st Qu.

:6.

000 1st Qu.

:2.

000 Median :6.

000 Median :4.

000 Median :6.

000 Median :4.

000 Mean :5.

687 Mean :4.

027 Mean :5.

743 Mean :3.

903 3rd Qu.

:6.

000 3rd Qu.

:6.

000 3rd Qu.

:6.

000 3rd Qu.

:6.

000 Max.

:7.

000 Max.

:7.

000 Max.

:7.

000 Max.

:7.

000 NA's :10 NA's :10 NA's :33 NA's :33 W4 L4 W5 L5 Min.

:0.

000 Min.

:0.

000 Min.

: 0.

000 Min.

: 0.

000 1st Qu.

:6.

000 1st Qu.

:2.

000 1st Qu.

: 6.

000 1st Qu.

: 2.

000 Median :6.

000 Median :4.

000 Median : 6.

000 Median : 4.

000 Mean :5.

753 Mean :3.

734 Mean : 5.

883 Mean : 3.

813 3rd Qu.

:6.

000 3rd Qu.

:6.

000 3rd Qu.

: 6.

000 3rd Qu.

: 6.

000 Max.

:7.

000 Max.

:7.

000 Max.

:21.

000 Max.

:19.

000 NA's :346 NA's :346 NA's :1440 NA's :1440 Wsets Lsets Comment Min.

: 0.

000 Min.

: 0.

0 Length:2413 1st Qu.

: 3.

000 1st Qu.

: 0.

0 Class :character Median : 3.

000 Median : 1.

0 Mode :character Mean : 4.

192 Mean : 1.

7 3rd Qu.

: 6.

000 3rd Qu.

: 2.

0 Max.

:22.

000 Max.

:20.

0 NA's :1444 NA's :1444Transforming and encoding categorical variablesFrom the glimpse of the structure above, some of the data attributes were not cast in their correct data type.

This is my favourite part the chance to transform the variables!.Here we go…# Transform character variables into numeric variables a$W1 <- as.

numeric(a$W1)a$L1 <- as.

numeric(a$L1)a$WRank <- as.

numeric(a$WRank)a$LRank <- as.

numeric(a$LRank)########################################################### encoding categorical features########################################################### Convert categorical variables into factors to represent their levelsa$Location <- factor(a$Location)a$Tournament <- factor(a$Tournament)a$Series <- factor(a$Series)a$Court <- factor(a$Court)a$Surface <- factor(a$Surface)a$Best.

of <- factor(a$Best.

of)a$Round <- factor(a$Round)a$Winner <- factor(a$Winner)a$Loser <- factor(a$Loser)a$Comment <- factor(a$Comment)glimpse(a) # check that structure of categorical variables have converted with levels# Create dummy variables for the categorical variables with more than 2 levelslibrary(dummies)Round <- dummy(a$Round)Best.

of <- dummy(a$Best.

of)Winner <- dummy(a$Winner)Loser <- dummy(a$Loser)Comment <- dummy(a$Comment)head(a) # check that the values are been converted to dummy variablesstr(a)Summarise the descriptive statistics and check the structure of your transformed variables# Descriptive statisticssummary(a)# View the structure of the transformed variables the 'dplyr' way > glimpse(a)Observations: 2,413Variables: 27$ ATP <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, .

$ Location <fct> Melbourne, Melbourne, Melbourne, Melbourne, Melbourne,.

$ Tournament <fct> Australian Open, Australian Open, Australian Open, Aus.

$ Date <chr> "1/17/00", "1/17/00", "1/17/00", "1/17/00", "1/17/00",.

$ Year <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, .

$ Series <fct> Grand Slam, Grand Slam, Grand Slam, Grand Slam, Grand .

$ Court <fct> Outdoor, Outdoor, Outdoor, Outdoor, Outdoor, Outdoor, .

$ Surface <fct> Hard, Hard, Hard, Hard, Hard, Hard, Hard, Hard, Hard, .

$ Round <fct> 1st Round, 1st Round, 1st Round, 1st Round, 1st Round,.

$ Best.

of <fct> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, .

$ Winner <fct> Agassi A.

, Alami K.

, Arazi H.

, Behrend T.

, Bjorkman J.

$ Loser <fct> Puerta M.

, Manta L.

, Alonso J.

, Meligeni F.

, Stoltenbe.

$ WRank <dbl> 1, 35, 41, 106, 76, 151, 39, 54, 30, 64, 98, 29, 34, 6.

$ LRank <dbl> 112, 107, 111, 28, 81, 57, 22, 66, 51, 155, 119, 257, .

$ W1 <dbl> 6, 6, 6, 6, 6, 7, 3, 7, 7, 7, 6, 6, 6, 6, 6, 7, 6, 6, .

$ L1 <dbl> 2, 4, 3, 2, 7, 6, 6, 6, 6, 6, 4, 7, 7, 4, 3, 6, 4, 3, .

$ W2 <int> 6, 7, 7, 4, 6, 6, 6, 6, 6, 5, 6, 7, 6, 6, 6, 6, 7, 6, .

$ L2 <int> 2, 6, 6, 6, 4, 1, 1, 4, 4, 7, 4, 6, 3, 4, 3, 3, 6, 3, .

$ W3 <int> 6, 7, 6, 6, 6, 6, 6, NA, 6, 6, 7, 1, 7, 7, 4, 7, 4, 6,.

$ L3 <int> 3, 5, 2, 7, 4, 4, 4, NA, 4, 3, 6, 6, 5, 6, 6, 6, 6, 2,.

$ W4 <int> NA, NA, NA, 6, 0, NA, 7, NA, NA, 7, NA, 6, 6, NA, 7, N.

$ L4 <int> NA, NA, NA, 3, 6, NA, 6, NA, NA, 5, NA, 3, 1, NA, 6, N.

$ W5 <int> NA, NA, NA, 6, 6, NA, NA, NA, NA, NA, NA, 6, NA, NA, N.

$ L5 <int> NA, NA, NA, 0, 4, NA, NA, NA, NA, NA, NA, 1, NA, NA, N.

$ Wsets <int> 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, .

$ Lsets <int> 0, 0, 0, 2, 2, 0, 1, 0, 0, 1, 0, 2, 1, 0, 1, 0, 2, 0, .

$ Comment <fct> Completed, Completed, Completed, Completed, Completed,.

Detecting Missing ValuesWe will try a few methods to detect the missing values such as counting the number of missing values per column, sum and taking the mean.

# Sum the number of missing values> sum(is.

na(a)) [1] 6810# average of the missing values in each column> mean(is.

na(a)) [1] 0.

1045264# Count the number of missing values per column> colSums(is.

na(a)) ATP Location Tournament Date Year Series Court 0 0 0 0 0 0 0 Surface Round Best.

of Winner Loser WRank LRank 0 0 0 0 0 0 5 W1 L1 W2 L2 W3 L3 W4 128 131 10 10 33 33 346 L4 W5 L5 Wsets Lsets Comment 346 1440 1440 1444 1444 0Obtain the percentage of the missing valuesIn detecting missing values 5% is the acceptable threshold for each column.

The output confirms that the columns : L4, W4, L5,W5, Wsets and Lsets have missing values greater than 5% and can be removed or imputed.

sapply(a, function(df){ sum(is.

na(df) ==TRUE)/length(df); }) ATP Location Tournament Date Year Series 0.

000000000 0.

000000000 0.

000000000 0.

000000000 0.

000000000 0.

000000000 Court Surface Round Best.

of Winner Loser 0.

000000000 0.

000000000 0.

000000000 0.

000000000 0.

000000000 0.

000000000 WRank LRank W1 L1 W2 L2 0.

000000000 0.

002072109 0.

053046001 0.

054289266 0.

004144219 0.

004144219 W3 L3 W4 L4 W5 L5 0.

013675922 0.

013675922 0.

143389971 0.

143389971 0.

596767509 0.

596767509 Wsets Lsets Comment 0.

598425197 0.

598425197 0.

000000000We installed Amelia package via install.

packages(“Amelia”) into the console to assist with plotting a map for visualising missing values.

From the map we observe the missing values are detected from the following columns:Lsets — Loser setWsets — Winner setW5 — Winner in the fifth setL5 — Loser in the fifth setL4 — Loser in the fourth setW4 — Winner in the fourth setPlot of the percentage of missing valuesWhy I did not remove all of the missing values?I tried this but, the last line of code helped me create a vector to remove NAs but it also removed more than 50% of my training data so it was not helpful!View(a) # view the missing valuescomplete.

cases(a) # view missing valueswhich(complete.

cases(a)) # view which row has full row values are located inwhich(!complete.

cases(a)) # view which row has ‘full ‘NA’ row values are lna_vec <- which(complete.

cases(a))na_vec <- which(!complete.

cases(a)) # create a vector for NA valuesa[-na_vec] # vector with NA rows removed.

How to Impute missing data? — numeric variablesThe imputed data was plotted to understand the distribution of the original data.

The techniques for imputation that I followed were from the author Michy Alice from his blog.

These are the steps:# Impute missing values with "pmm" – predicted mean matching.

m=5 imputed data sets is defaultimputed_Data <- mice(a.

mis, m=5, maxit = 50, method = 'pmm', seed = 500)summary(imputed_Data)# inspect that missing data has been imputedimputed_Data$imp$Lsets# check imputed methodimputed_Data$meth# Plot the imputed data and inspect the distributionxyplot(imputed_Data,WRank ~ W1+L1+W2+L2+W3+L3+W4+L4+L5+W5+LRank,pch=18,cex=1)Exploratory Data Analysisggplot — Density plot of numeric variablesI used ggplot to examine the numeric attributes that had less than 5% missing values to visualise via density plots.

Lsets had > 5% of missing values.

p1 <- ggplot(a, aes(x=a$Year)) + geom_histogram() + ggtitle(" Histogram of Year")p1p2 <- ggplot(a, aes(x=a$WRank)) + geom_histogram()+ ggtitle(" Histogram of Winner's Ranking")p2p3 <- ggplot(a, aes(x=a$LRank)) + geom_histogram()+ ggtitle(" Histogram of Loser's Ranking")p3p4 <- ggplot(a, aes(x=a$W1)) + geom_histogram()+ ggtitle(" Histogram of Winner in the first set")p4p5 <- ggplot(a, aes(x=a$L1)) + geom_histogram()+ ggtitle(" Histogram of Loser in the first set")p5p6 <- ggplot(a, aes(x=a$W2)) + geom_histogram()+ ggtitle(" Histogram of Winner in the second set")p6p7 <- ggplot(a, aes(x=a$L2)) + geom_histogram()+ ggtitle(" Histogram of Loser in the second set")p7p8 <- ggplot(a, aes(x=a$W3)) + geom_histogram()+ ggtitle(" Histogram of Winner in the third set")p8p9 <- ggplot(a, aes(x=a$L3)) + geom_histogram()+ ggtitle(" Histogram of Loser in the third set")p9Visualise categorical variables that have been dummy coded or one-hot encodedp16 <- plot(x = a$Comment, main = "Distribution of Comment", xlab = "Comment", ylab = "count")p16p17 <- plot(x= a$Winner,main = "Distribution of Winner", xlab = "Winner", ylab = "count")p17p18 <- plot( x = a$Loser, main = "Distribution of Loser", xlab = "Loser", ylab = "Count")p18p19 <- plot( x = a$Best.

of, main = "Distribution of Best.

of", xlab = "Best Of", ylab = "Count")p19p20 <- plot( x = a$Round, main = "Distribution of Tennis Round", xlab = "Round", ylab = "Count")p20This a density plot of the imputed numeric data:densityplot(imputed_Data)Key: colour magenta (imputed data) and blue (observed data)Density plot of the imputed data containing numeric variables# View the data as individual pointsstripplot(imputed_Data, pch = 20, cex = 1.

2)Stripplot of the individual imputed variables containing numeric variablesThe R script for this pre-processing and EDA is shared here on Github.

R Markdown script is also saved in Github and can be viewed as a html file via the command Knit to Html, this is especially useful on a web browser for all the plotting of the data attributes.

** Note to self:1.

When using R Markdown, insert the R code below “`{r cars} as per the screenshot below to be exact it is line 23:2.

Secondly, I inserted the second chunk of R code relating to charts and plots below the section “`{r pressure, echo=FALSE} and code is inserted at line 214.

After we have visualised the data in R, we want to use other analytic tools such as Tableau to explore and visualise Australian Open data.

Happy coding as we continue the next post with Tableau in Part 3!.. More details

Leave a Reply