An Introduction to R— Merging and filtering data— Part 1

An Introduction to R— Merging and filtering data— Part 1Wendy WongBlockedUnblockFollowFollowingJan 9Data understanding by filtering and merging the 2019 Australian Tennis Open data for the Men’s tour.

Photo by Christopher Burns on UnsplashYou know it’s summer when the Australian Tennis Open visits Melbourne and everyone is excited that Roger and Serena are in town.

ProblemI am interested to predict who might win the 2019 Australian Tennis Open on the men’s tour.

I hope it is Roger Federer before he retires.

DatasetI have selected the historical tennis tournament results for men from the website http://www.

tennis-data.

co.

uk/data.

php for the period 2000 to January 2019 which are provided in CSV files.

Data DictionaryThe notes accompanying the tennis data explains the attributes of the data which is provided by Tennis Betting:Key to results data:ATP = Tournament number (men)WTA = Tournament number (women)Location = Venue of tournamentTournament = Name of tournament (including sponsor if relevant)Data = Date of match (note: prior to 2003 the date shown for all matches played in a single tournament is the start date)Series = Name of ATP tennis series (Grand Slam, Masters, International or International Gold)Tier = Tier (tournament ranking) of WTA tennis series.

Court = Type of court (outdoors or indoors)Surface = Type of surface (clay, hard, carpet or grass)Round = Round of matchBest of = Maximum number of sets playable in matchWinner = Match winnerLoser = Match loserWRank = ATP Entry ranking of the match winner as of the start of the tournamentLRank = ATP Entry ranking of the match loser as of the start of the tournamentWPts = ATP Entry points of the match winner as of the start of the tournamentLPts = ATP Entry points of the match loser as of the start of the tournamentW1 = Number of games won in 1st set by match winnerL1 = Number of games won in 1st set by match loserW2 = Number of games won in 2nd set by match winnerL2 = Number of games won in 2nd set by match loserW3 = Number of games won in 3rd set by match winnerL3 = Number of games won in 3rd set by match loserW4 = Number of games won in 4th set by match winnerL4 = Number of games won in 4th set by match loserW5 = Number of games won in 5th set by match winnerL5 = Number of games won in 5th set by match loserWsets = Number of sets won by match winnerLsets = Number of sets won by match loserComment = Comment on the match (Completed, won through retirement of loser, or via Walkover)Merging csv files into a single data file in Mac Book ProI came across a Youtube tutorial from Trent Jessee to help me merge multiple csv files from 2000 to 2019 into a single file on a Macbook Pro.

# Open a Terminal session and entercd Desktop# Enter cd and the name of the folder where you have saved your csv files cd CSV# Merge files with this commandcat *.

csv>merged.

csvLoad or import your data into R# Set the working directorysetwd("~/Desktop/ATP")# Read the dataframe into Rstudio as a csv file.

tennis_data <- read.

csv("merged.

csv",stringsAsFactors = FALSE, header = TRUE)# Review the first 5 observationshead(tennis_data)## ATP Location Tournament Date Series## 1 1 Adelaide Australian Hardcourt Championships 1/3/00 International## 2 1 Adelaide Australian Hardcourt Championships 1/3/00 International## 3 1 Adelaide Australian Hardcourt Championships 1/3/00 International## 4 1 Adelaide Australian Hardcourt Championships 1/3/00 International## 5 1 Adelaide Australian Hardcourt Championships 1/3/00 International## 6 1 Adelaide Australian Hardcourt Championships 1/3/00 International## Court Surface Round Best.

of Winner Loser WRank## 1 Outdoor Hard 1st Round 3 Dosedel S.

Ljubicic I.

63## 2 Outdoor Hard 1st Round 3 Enqvist T.

Clement A.

5## 3 Outdoor Hard 1st Round 3 Escude N.

Baccanello P.

40## 4 Outdoor Hard 1st Round 3 Federer R.

Knippschild J.

65## 5 Outdoor Hard 1st Round 3 Fromberg R.

Woodbridge T.

81## 6 Outdoor Hard 1st Round 3 Gambill J.

M.

Arthurs W.

58## LRank W1 L1 W2 L2 W3 L3 W4 L4 W5 L5 Wsets Lsets Comment X X.

1 X.

2 X.

3## 1 77 6 4 6 2 NA NA NA NA NA NA 2 0 Completed ## 2 56 6 3 6 3 NA NA NA NA NA NA 2 0 Completed ## 3 655 6 7 7 5 6 3 NA NA NA NA 2 1 Completed ## 4 87 6 1 6 4 NA NA NA NA NA NA 2 0 Completed ## 5 198 7 6 5 7 6 4 NA NA NA NA 2 1 Completed ## 6 105 3 6 7 6 6 4 NA NA NA NA 2 1 Completed## Predicting who will win the Men’s 2019 Australian Tennis Open based on data from 2000 to 2019# Load packageslibrary(dplyr)# Set the working directorysetwd(“~/Desktop/ATP”)# Read the dataframe into Rstudio as a csv file.

tennis_data <- read.

csv(“merged.

csv”,stringsAsFactors = FALSE, header = TRUE)# Review the first 5 observations.

It is best practice to use tail() to view the last 5 observations also.

head(tennis_data)View structure of the merged dataIn R we can explore the structure of the data to check the attributes of the data to observe the raw data before it is pre-processed.

str(tennis_data)There were 52383 rows and 83 columns in the merged data file.

## 'data.

frame': 52383 obs.

of 83 variables:## $ ATP : int 1 1 1 1 1 1 1 1 1 1 .

## $ Location : chr "Adelaide" "Adelaide" "Adelaide" "Adelaide" .

## $ Tournament: chr "Australian Hardcourt Championships" "Australian Hardcourt Championships" "Australian Hardcourt Championships" "Australian Hardcourt Championships" .

## $ Date : chr "1/3/00" "1/3/00" "1/3/00" "1/3/00" .

## $ Series : chr "International" "International" "International" "International" .

## $ Court : chr "Outdoor" "Outdoor" "Outdoor" "Outdoor" .

## $ Surface : chr "Hard" "Hard" "Hard" "Hard" .

## $ Round : chr "1st Round" "1st Round" "1st Round" "1st Round" .

## $ Best.

of : int 3 3 3 3 3 3 3 3 3 3 .

## $ Winner : chr "Dosedel S.

" "Enqvist T.

" "Escude N.

" "Federer R.

" .

## $ Loser : chr "Ljubicic I.

" "Clement A.

" "Baccanello P.

" "Knippschild J.

" .

## $ WRank : chr "63" "5" "40" "65" .

## $ LRank : chr "77" "56" "655" "87" .

## $ W1 : chr "6" "6" "6" "6" .

## $ L1 : chr "4" "3" "7" "1" .

## $ W2 : int 6 6 7 6 5 7 6 7 2 6 .

## $ L2 : int 2 3 5 4 7 6 1 6 6 7 .

## $ W3 : int NA NA 6 NA 6 6 NA NA 6 6 .

## $ L3 : int NA NA 3 NA 4 4 NA NA 1 4 .

## $ W4 : int NA NA NA NA NA NA NA NA NA NA .

## $ L4 : int NA NA NA NA NA NA NA NA NA NA .

## $ W5 : int NA NA NA NA NA NA NA NA NA NA .

## $ L5 : int NA NA NA NA NA NA NA NA NA NA .

## $ Wsets : int 2 2 2 2 2 2 2 2 2 2 .

## $ Lsets : int 0 0 1 0 1 1 0 0 1 1 .

## $ Comment : chr "Completed" "Completed" "Completed" "Completed" .

The dimensions of the data frameThe dimensions of the dataframe include — 52383 rows and 83 columns.

dim(tennis_data)Check for any missing valuesThere a quite a few missing values.

Imputation of missing values may be required after an assessment.

is.

na(tennis_data)Filtering and subsetting the data in RI needed to filter my data so it was more manageable to work with, I was interested in analysing Australian Tennis Open tournaments from 2000 to 2019.

# We want to view only the first 26 columns of results datanames(tennis_data)[1:26]# We want to filter data for the Australian Tennis Open tournaments so that we can work with a subset of data:aust_open <- tennis_data[tennis_data$Tournament==”Australian Open”, 1:26]# View structure of the subset of data that relates only to Australian Open tournaments:str(aust_open)## 'data.

frame': 2413 obs.

of 26 variables:## $ ATP : int 6 6 6 6 6 6 6 6 6 6 .

## $ Location : chr "Melbourne" "Melbourne" "Melbourne" "Melbourne" .

## $ Tournament: chr "Australian Open" "Australian Open" "Australian Open" "Australian Open" .

## $ Date : chr "1/17/00" "1/17/00" "1/17/00" "1/17/00" .

## $ Series : chr "Grand Slam" "Grand Slam" "Grand Slam" "Grand Slam" .

## $ Court : chr "Outdoor" "Outdoor" "Outdoor" "Outdoor" .

## $ Surface : chr "Hard" "Hard" "Hard" "Hard" .

## $ Round : chr "1st Round" "1st Round" "1st Round" "1st Round" .

## $ Best.

of : int 5 5 5 5 5 5 5 5 5 5 .

## $ Winner : chr "Agassi A.

" "Alami K.

" "Arazi H.

" "Behrend T.

" .

## $ Loser : chr "Puerta M.

" "Manta L.

" "Alonso J.

" "Meligeni F.

" .

## $ WRank : chr "1" "35" "41" "106" .

## $ LRank : chr "112" "107" "111" "28" .

## $ W1 : chr "6" "6" "6" "6" .

## $ L1 : chr "2" "4" "3" "2" .

## $ W2 : int 6 7 7 4 6 6 6 6 6 5 .

## $ L2 : int 2 6 6 6 4 1 1 4 4 7 .

## $ W3 : int 6 7 6 6 6 6 6 NA 6 6 .

## $ L3 : int 3 5 2 7 4 4 4 NA 4 3 .

## $ W4 : int NA NA NA 6 0 NA 7 NA NA 7 .

## $ L4 : int NA NA NA 3 6 NA 6 NA NA 5 .

## $ W5 : int NA NA NA 6 6 NA NA NA NA NA .

## $ L5 : int NA NA NA 0 4 NA NA NA NA NA .

## $ Wsets : int 3 3 3 3 3 3 3 2 3 3 .

## $ Lsets : int 0 0 0 2 2 0 1 0 0 1 .

## $ Comment : chr "Completed" "Completed" "Completed" "Completed" .

Write and export your dataframeNow that I am satisfied with my data that only contains Australian Open tournaments I will write the file and export it out as a csv file so that I can use it for pre-processing my data and also data visualisation in R and Tableau.

# Save the dataframe to a csv file to write the csv file into R working folder:write.

csv(aust_open,file = "aust_open.

csv", row.

names = FALSE)My next post will include:Pre-processing the data in RData visualisation in R to see what the data looks likeExploratory data analysis of the data in RTreatment of missing dataThanks for reading and stay tuned, happy coding!.. More details

Leave a Reply