Data Cleaning with R and the Tidyverse: Detecting Missing Values

Data Cleaning with R and the Tidyverse: Detecting Missing ValuesJohn SullivanBlockedUnblockFollowFollowingMar 21Data cleaning is one of the most important aspects of data science.

As a data scientist, you can expect to spend up to 80% of your time cleaning data.

In a previous post I walked through a number of data cleaning tasks using Python and the Pandas library.

That post got so much attention, I wanted to follow it up with an example in R.

In this post you’ll learn how to detect missing values using the tidyr and dplyr packages from the Tidyverse.

The Tidyverse is the best collection of R packages for data science, so you should become familiar with it.

Getting StartedA good way to start any data science project is to get a feel for the data.

This is just a quick look to see the variable names and expected variable types.

Looking at the dimensions of the data is also useful.

Exploratory data analysis (EDA) is extremely important, so it deserves its own blog post.

We won’t go over a full EDA in this article.

Before we get started, head on over to our github page to grab a copy of the data.

Make sure to put a copy in the same working directory where your R code will be.

Here’s a quick look at our data:This is a small customer churn dataset.

For purposes of learning, this dataset shows some great real-world examples of missing values.

To start, load the tidverse library and read in the csv file.

library(tidyverse)# set working directorypath_loc <- "C:/Users/Jonathan/Desktop/data cleaning with R post"setwd(path_loc)# reading in the datadf <- read_csv("telecom.

csv")Usually the data is read in to a dataframe, but the tidyverse actually uses tibbles.

These are similar to dataframes, but also slightly different.

To learn more about tibbles, check out this chapter from R for Data Science.

I like to use the glimpse function to look at the variable names and types.

# taking a quick lookglimpse(df)> glimpse(df)Observations: 10Variables: 5$ customerID chr "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7.

$ MonthlyCharges dbl 29.

85, 56.

95, NA, 42.

30, 70.

70, NaN, 89.

10, .

$ TotalCharges chr "109.

9", "na", "108.

15", "1840.

75", NA, "820.

$ PaymentMethod chr "Electronic check", "Mailed check", "–", "B.

$ Churn chr "yes", "yes", "yes", "no", "no", "yes", "no".

We can see that there’s 5 variables.

customerIDMonthlyChargesTotalChargesPaymentMethodChurnThere’s also a description of the type for each variable:customerID: chr which stands for character, another name for a stringMonthlyCharges: dbl which stands for double, which is a numeric typeTotalCharges: chr characterPaymentMethod: chrcharacterChurn: chrcharacterThere’s 10 observations, which means there’s 10 rows of data.

Now that we’ve taken a quick look to become familiar with the data, let’s go over some basic data manipulation.

A Grammar of Data Manipulation: dplyrBefore we get started with missing values, let’s go over the dplyr library.

This is just a quick introduction, so be sure to check out the official dplyr documentation as well as Chapter 5 Data Transformation from R for Data Science.

This library uses a “grammar of data manipulation” which basically means that there’s a set of functions with logical verb names for what you want to do.

For example, maybe you want to only look at customers that churned.

You can filter the data on Churn values equal to “yes”.

We can quickly do that using the filter function from dplyr.

# filter on customers that churneddf %>% filter(Churn=="yes")# A tibble: 5 x 5 customerID MonthlyCharges TotalCharges PaymentMethod Churn chr dbl chr chr chr1 7590-VHVEG 29.

8 109.

9 Electronic check yes2 5575-GNVDE 57.

0 na Mailed check yes3 3668-QPYBK NA 108.

15 — yes4 9305-CDSKC NaN 820.

5 — yes5 6713-OKOMC NA N/A NA yesTaking a look we can see that R returned an organized tibble that only includes customers that churned.

If you’re not familiar with the %>% operator, also known as the “pipe operator” check out this great blog post.

The pipe is a useful operator that comes from the magrittr package.

It allows us to organize our code by eliminating nested parentheses so that we can make our code more readable.

For example, let’s say we had the following calculation:# nested functionslog(sin(exp(2)))> log(sin(exp(2)))[1] -0.

1122118With all of the parentheses, this isn’t very readable.

Now let’s look at a piped example.

# piped functions2 %>% exp() %>% sin() %>% log()It’s easy to see that the piped example is much more readable.

Okay, back to dplyr.

We just used the filter function to quickly filter out rows with a Churn value equal to “yes”.

Maybe we also want to just select the customerID and TotalChargescolumns.

We can quickly do that as well using the select function.

# filter on customers that churned,# select customerID and TotalCharges columnsdf %>% filter(Churn=="yes") %>% select(customerID, TotalCharges)# A tibble: 5 x 2 customerID TotalCharges chr chr1 7590-VHVEG 109.

92 5575-GNVDE na3 3668-QPYBK 108.

154 9305-CDSKC 820.

55 6713-OKOMC N/AWe can see just how easy it is to manipulate our data using these dplyr functions.

Chaining functions together vertically makes our code extremely readable.

This way of coding might seem a little strange at first, but after a little practice it will become extremely useful.

Standard Missing ValuesNow that we’re a little bit more familiar with the pipe operator and dplyr, let’s dive right in to detecting missing values.

We’ll start by looking at standard missing values that R recognizes.

Go ahead and take a look at the MonthlyCharges column.

We can see that there’s three missing values.

There’s two empty cells, and one with “Nan”.

These are obviously missing values.

We can see how R recognizes these using the is.

na function.

First let’s print out that column and then apply is.

na.

# looking at MonthlyChargesdf$MonthlyChargesis.

na(df$MonthlyCharges)> df$MonthlyCharges [1] 29.

85 56.

95 NA 42.

30 70.

70 NaN 89.

10 NA 104.

80[10] 54.

10> is.

na(df$MonthlyCharges) [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSEWe can see that the two missing cells were recognized as “NA” and the other missing value with Nan was identified by R as “NaN”.

When we run the is.

na function, R recognizes both types of missing values.

We can see this because there’s three TRUE values that are returned when we run is.

na.

It’s important to note the difference between “NA” and “NaN”.

We can use the help function to take a closer look at both values.

# using the help function to learn about NAhelp(NA)Taking a look at the bottom right window we can see that “NA” or “Not Available” is used for missing values.

“NaN” or “Not a Number” is used for numeric calculations.

If a value is undefined, such as 0/0, “NaN” is the appropriate way to represent this.

There is also a is.

nan function.

Try running this with both “NA” and “NaN”.

You’ll see that it returns a value of TRUE for “NaN” but FALSE for “NA”.

The is.

na function on the other hand is more generic, so it will detect both types of missing values.

Let’s go ahead and use dplyr to summarize our data a little bit.

We can use the distinct function to look at the distinct values that show up in the MonthlyCharges column.

# looking at the distinct valuesdf %>% distinct(MonthlyCharges)# A tibble: 9 x 1 MonthlyCharges dbl1 29.

82 57.

03 NA4 42.

35 70.

76 NaN7 89.

18 105.

9 54.

1We can see there’s 9 distinct values.

There’s 10 rows of data, but “NA” shows up twice, so there’s 9 distinct values.

If we want to get a quick count of the distinct values we can use the summarisefunction.

# counting unique valuesdf %>% summarise(n = n_distinct(MonthlyCharges))# A tibble: 1 x 1 n int1 9This returns a simple tibble with a column that we named “n” for the count of distinct values in the MonthlyCharges column.

What we’re really after is the count of missing values.

We can use the summarise function along with is.

na to count the missing values.

# counting missing valuesdf %>% summarise(count = sum(is.

na(MonthlyCharges)))# A tibble: 1 x 1 count int1 3As we saw above, the number of missing values is 3.

Maybe we want to do multiple things at once.

Let’s say we want to get a count of unique values, as well as missing values, and also the median value of MonthlyCharges.

Here’s how we can do that using summarise:# counting unique, missing, and median valuesdf %>% summarise(n = n_distinct(MonthlyCharges), na = sum(is.

na(MonthlyCharges)), med = median(MonthlyCharges, na.

rm = TRUE))# A tibble: 1 x 3 n na med int int dbl1 9 3 57.

0This produces an organized little tibble of our summary data.

Now that we’ve identified the missing values, let’s replace them with the median value of MonthlyCharges.

To do that, we can use the mutate function from dplyr.

# mutate missing valuesdf %>% mutate(MonthlyCharges = replace(MonthlyCharges, is.

na(MonthlyCharges), median(MonthlyCharges, na.

rm = TRUE)))# A tibble: 10 x 5 customerID MonthlyCharges TotalCharges PaymentMethod Churn chr dbl chr chr chr 1 7590-VHVEG 29.

8 109.

9 Electronic check yes 2 5575-GNVDE 57.

0 na Mailed check yes 3 3668-QPYBK 57.

0 108.

15 — yes 4 7795-CFOCW 42.

3 1840.

75 Bank transfer no 5 9237-HQITU 70.

7 NA Electronic check no 6 9305-CDSKC 57.

0 820.

5 — yes 7 1452-KIOVK 89.

1 1949.

4 Credit card no 8 6713-OKOMC 57.

0 N/A NA yes 9 7892-POOKP 105.

3046.

05 Electronic check no10 8451-AJOMK 54.

1 354.

95 Electronic check noWe can see that the missing values were replaced with the median value 57 in three different spots.

Just to double check that this worked, lets print out the whole tibble again.

df# A tibble: 10 x 5 customerID MonthlyCharges TotalCharges PaymentMethod Churn chr dbl chr chr chr 1 7590-VHVEG 29.

8 109.

9 Electronic check yes 2 5575-GNVDE 57.

0 na Mailed check yes 3 3668-QPYBK NA 108.

15 — yes 4 7795-CFOCW 42.

3 1840.

75 Bank transfer no 5 9237-HQITU 70.

7 NA Electronic check no 6 9305-CDSKC NaN 820.

5 — yes 7 1452-KIOVK 89.

1 1949.

4 Credit card no 8 6713-OKOMC NA N/A NA yes 9 7892-POOKP 105.

3046.

05 Electronic check no10 8451-AJOMK 54.

1 354.

95 Electronic check noIt looks like all the missing values are back.

So what happened?This brings up an important point.

The dplyr package won’t modify the data in place.

Basically this means if we apply a mutate to some of the data with just a pipe operator, it will show us a modified view of the data, but it won’t be a permanent modification.

To permanently modify the data, we need to assign the mutate to the original data using the assignment operator <-.

Here’s how we would do that:# mutate missing values, and modify the dataframedf <- df %>% mutate(MonthlyCharges = replace(MonthlyCharges, is.

na(MonthlyCharges), median(MonthlyCharges, na.

rm = TRUE)))Now if we take another look at the data, it should be modified.

df# A tibble: 10 x 5 customerID MonthlyCharges TotalCharges PaymentMethod Churn chr dbl chr chr chr 1 7590-VHVEG 29.

8 109.

9 Electronic check yes 2 5575-GNVDE 57.

0 na Mailed check yes 3 3668-QPYBK 57.

0 108.

15 — yes 4 7795-CFOCW 42.

3 1840.

75 Bank transfer no 5 9237-HQITU 70.

7 NA Electronic check no 6 9305-CDSKC 57.

0 820.

5 — yes 7 1452-KIOVK 89.

1 1949.

4 Credit card no 8 6713-OKOMC 57.

0 N/A NA yes 9 7892-POOKP 105.

3046.

05 Electronic check no10 8451-AJOMK 54.

1 354.

95 Electronic check noThis time the MonthlyCharges column was modified permanently.

Keep in mind that when you want to permanently mutate your data with dplyr, you need to assign the mutate to the original data.

Non-Standard Missing ValuesA lot of times you won’t be lucky enough to have all standard missing value types that R will recognize right away.

Let’s take a quick look at the next column, TotalCharges, to see what I mean.

We can see there’s three different missing values, “na”, “NA”, and “N/A”.

In the previous example we saw that R recognized “NA” as a missing value, but what about “na” and “N/A”?Let’s take a look at this column and use is.

na to see if R recognizes all of these as missing values.

# looking at missing valuesdf$TotalChargesis.

na(df$TotalCharges)> is.

na(df$TotalCharges) [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSELooking at the results we can see that R only identified “NA” as a missing value.

Let’s use the summarise function to see how many missing values R found.

# counting missing valuesdf %>% summarise(count = sum(is.

na(TotalCharges)))# A tibble: 1 x 1 count int1 1The result confirms that R only found one missing value.

We’ll need to replace both “na” and “N/A” with “NA” to make sure that R recognizes all of these as missing values.

Let’s use the mutate function to replace these with the correct missing value types.

Keep in mind that we need to use the assignment operator to make sure the changes are permanent.

# replacing with standard missing value type, NAdf <- df %>% mutate(TotalCharges = replace(TotalCharges, TotalCharges == "na", NA)) %>% mutate(TotalCharges = replace(TotalCharges, TotalCharges == "N/A", NA))If we take a look at this column again, we can see that now all of the missing values have been correctly identified by R.

# taking another lookdf$TotalChargesis.

na(df$TotalCharges)> df$TotalCharges [1] "109.

9" NA "108.

15" "1840.

75" NA "820.

5" [7] "1949.

4" NA "3046.

05" "354.

95"> is.

na(df$TotalCharges) [1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSENow we can see that R picked up all three missing values.

Before we replace the missing values, there’s still another problem.

R thinks that the column values are characters.

We can confirm this with the glimpse function.

> glimpse(df$TotalCharges) chr [1:10] "109.

9" NA "108.

15" "1840.

75" NA "820.

5" "1949.

4" NA .

Let’s change these to numeric types.

# changing to numeric typedf$TotalCharges <- as.

numeric(df$TotalCharges)glimpse(df$TotalCharges)> df$TotalCharges <- as.

numeric(df$TotalCharges) > glimpse(df$TotalCharges) num [1:10] 110 NA 108 1841 NA .

Finally, let’s finish up by replacing the missing values with the median.

# replace missing values with mediandf <- df %>% mutate(TotalCharges = replace(TotalCharges, is.

na(TotalCharges), median(TotalCharges, na.

rm = T)))df$TotalCharges> df$TotalCharges [1] 109.

90 820.

50 108.

15 1840.

75 820.

50 820.

50 1949.

40 820.

50 [9] 3046.

05 354.

95An even simpler way to change all of the missing values is to change the column to numeric before doing anything else.

Let’s import the data again so that we have the missing values again.

# importing the data againdf <- read_csv("telecom.

csv")df$TotalCharges> df$TotalCharges [1] "109.

9" "na" "108.

15" "1840.

75" NA "820.

5" [7] "1949.

4" "N/A" "3046.

05" "354.

95"Now let’s try changing the column to numbers.

# change TotalCharges to numeric typedf$TotalCharges <- as.

numeric(df$TotalCharges)df$TotalCharges> df$TotalCharges <- as.

numeric(df$TotalCharges)Warning message:NAs introduced by coercion > df$TotalCharges [1] 109.

90 NA 108.

15 1840.

75 NA 820.

50 1949.

40 NA [9] 3046.

05 354.

95This time all of the different missing value types were changed automatically.

Although this is a little bit shorter, I don’t always prefer this solution.

This worked for our specific example, but if you’re trying to detect anomalies or other dirty data, this might not be a good solution.

Always make sure to read the R console for warnings like this.

It can provide valuable information.

More Missing ValuesSo far we’ve looked at standard missing values like “NA” and non-standard values like “n/a” and “N/A”.

There’s numerous other ways to represent missing data.

Maybe I was manually entering in data and chose to use “ — ” for missing values.

On the other hand, maybe you prefer to just leave the cell blank.

Let’s learn about detecting some of these more unusual types of missing values.

Take a look at the PaymentMethod column:We can see that there’s three missing values.

Two are represented with “ — ” and one is just an empty cell.

Let’s see what R thinks about these:# looking at PaymentMethoddf$PaymentMethodis.

na(df$PaymentMethod)> is.

na(df$PaymentMethod) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSER was only to identify one of the missing values, the empty cell.

Let’s go ahead and use mutate to change “ — “ to NA.

# replacing "–" with NAdf <- df %>% mutate(PaymentMethod = replace(PaymentMethod, PaymentMethod == "–", NA))is.

na(df$PaymentMethod)df$PaymentMethod> df$PaymentMethod [1] "Electronic check" "Mailed check" NA [4] "Bank transfer" "Electronic check" NA [7] "Credit card" NA "Electronic check"[10] "Electronic check"Now we can see that all three missing values now show up.

So far we’ve either left missing values alone, or replaced them with a median.

What about dealing with missing values in a column of character types?Since all of the entries in the PaymentMethod column are strings, there’s no median value.

Rather than just exclude the missing values, let’s convert the NAs to a new category, called “unavailable”.

# replace NA with "unavailable"df <- df %>% mutate(PaymentMethod = replace(PaymentMethod, is.

na(PaymentMethod), "unavailable"))df$PaymentMethod> df$PaymentMethod [1] "Electronic check" "Mailed check" "unavailable" [4] "Bank transfer" "Electronic check" "unavailable" [7] "Credit card" "unavailable" "Electronic check"[10] "Electronic check"Now we can see that our three missing values, NA, have been converted to a new category, “unavailable”.

Sometimes there’s a reason why values are missing, so it’s good to keep that information to see how it influences the results in our machine learning models.

We won’t get in to those details in this post, but keep in mind that throwing out missing values might not always be a good idea.

ConclusionIn this post we learned about data cleaning, one of the most important skills in data science.

Specifically, we looked at detecting different types of missing values.

We also learned about replacing both numeric and character type missing values.

You can expect to spend up to 80% of your time cleaning data, so this is a valuable skill to have.

For information on data cleaning and detecting missing values with Python, check out this post.

.

. More details

Leave a Reply