10 not so intuitive things about programming with R

10 not so intuitive things about programming with RJyoti Prakash MaheswariBlockedUnblockFollowFollowingAug 18, 2018Why Use R for Data ScienceR has traditionally been regarded as the preferred computing language of statisticians and certain academic researcher’s, owing to the large number of available statistical packages and large community.

Arguably the biggest strength of R lies in the ease with which data manipulation and statistical analysis tasks are achieved.

Also, packages such as ggplot2 generate high-quality visualizations, making data analysis and report making a delight, without the steep learning curve of other scripting languages.

That being said, R can also be quite quirky in its behavior.

Having recently taken a course on Exploratory Data Analysis with R as a part of MSDS program at the University of San Francisco, I have and learned some of these quirks the hard way.

I have generated a Top-10 list of quirks below, which may not be very intuitive.

Being aware of the quirks can save a lot of time and frustration while analyzing data and debugging code.

1) Use of require() vs library() when loading a packageAs discussed above, the major advantage of R is the huge number of packages, which can be easily retrieved and loaded into an active R session.

Both, require() and library() functions allows us to load installed packages into active memory,but there is a nuanced difference in their outputs when loading the package in question throws an exception.

While library() throws an error when the package is not available to be loaded into active memory, require() throw a warning and return a logical response based on availability of the package.

> library(randomForest)Error in library(randomForest) : there is no package called ‘randomForest’> b <- require(randomForest)Loading required package: randomForestWarning message:In library(package, lib.

loc = lib.

loc, character.

only = TRUE, logical.

return = TRUE, : there is no package called ‘randomForest’> print(b)[1] FALSErequire() can normally be used as a conditional check to see if the package is present or not and install it.

2) NA vs NULL in RThe difference between NAand NULLcan be a point of confusion, as one may intuitively think that both of them represent missing/undefined values.

R,However, deals with both of these reserved words differently.

NA is a logical constant of length 1 which contains a missing value indicator while NULL represents the null object.

An intuitive explanation of above statement can be shown through the following code:> v <- c(1,2,3,NA)> v[1] 1 2 3 NA> w <- c(1,2,3,NULL)> w[1] 1 2 3The above code helps us understand a key difference between NA and NULL .

NULL is an object of its own type and can not be coerced to any other type.

Hence when we try to include it in vector form, it is ignored.

On the other hand, NA can be coerced to various types such as NA_interger, NA_character etc.

However, if we create a list(as lists can store different types together), both of them can be included as an element.

> w <- list(1,2,3,NULL)> w[[1]][1] 1[[2]][1] 2[[3]][1] 3[[4]]NULL3) Sub-setting using floating point values and recycling of logical vectorsTo access a specific value in vectors/matrices/array, we can use index values or logical vectors for sub-setting.

One interesting thing to note is that we can use floats and integer alike to call a specific value as shown below:> w <- c(1,2,3,4,5)> w[3.

5][1] 3> w[-4.

9][1] 1 2 3 5We can clearly see that the floating part of the index is ignored while sub-setting irrespective of the sign.

So 3.

5 give us the 3rd element in the vector and -4.

9 ignores the 4th element.

Another thing to be cautious of while using logical vectors for sub-setting is the vector length.

If the sub-setting vector is of smaller length,instead of throwing an error, R recycles the smaller vector to make it of the same length.

> w <- c(1,2,3,4,5)> x <- c(T,F)> x[1] TRUE FALSE> w[x][1] 1 3 5We can see that vector x has a length of 2 and vector w has a length of 5.

However, when we sub-set using x , it is recycled to c(T,F,T,F,T) and we get all the alternate values in w .

4) Type preservation vs simplification in ListsLists have the ability to store different types of data types together which can be very useful.

Type preservation and Type simplification are two interesting concepts which come into play,when we try to extract elements of a list using ( [ ] or [[ ]])> a <- list(1,2,3,c(4,5,6))# a has 1,2,3 and c(4,5,6) as its elements> a[1][[1]][1] 1#Element one as list:Type preservation> a[4][[1]][1] 4 5 6#Element 4 as list:Type preservation> a[[4]][1] 4 5 6#Element 4 as vector, type simplification> a[[4]][1][1] 4#First element of 4th element of a, type simplification> a[4][1][[1]][1] 4 5 6#[] Outputs the whole vector as we did not simplify the data type to vector from list.

From the above example, we can see that [] helps in Type preservation and the output is also a list,similar to the initial list.

On the other hand [[]] does Type simplification and gives us the simplest possible type of the underlying data.

Also, [[]] is important when we are trying to access an element of a vector which is contained in a list as shown in the last line of code.

5) Accessing columns in a matrix and use of dropMatrix operations play a great role while implementing a lot of machine learning models for vectorization and speed.

A lot of time we need to subset the matrix to access a particular column from a matrix and perform certain operations.

However, when you try to extract a single column from a matrix, a strange thing happens as shown below:> a <- matrix(c(1:9),ncol=3,nrow=3)> a [,1] [,2] [,3][1,] 1 4 7[2,] 2 5 8[3,] 3 6 9> a[,1][1] 1 2 3> class(a[,1])[1] "integer"#When you extract a single column, it is converted to a vector> a[,1,drop=F] [,1][1,] 1[2,] 2[3,] 3> class(a[,1,drop=F])[1] "matrix"# drop=F helps us retain the matrix formWhen we try to access a single column, R drops it to a vector form.

This may be sometimes unwanted as your downstream code may get affected due to this conversion(learned the hard way while implementing k- means algorithm from scratch in R).

To avoid this, we must use drop=F as an argument while accessing a single column.

6) Accessing out of bound elements in vectors and listsThis is one of the most unusual things we can observe in R when we try to access an element which is out of bound.

Look at the code below:> a <- c(1:5)> a[1] 1 2 3 4 5# a has 5 elements as shown above> a[7][1] NA#When we try to access 7th element, we get NA instead of an error> b <- list(c(1,2,3,4))> b[[1]][1] 1 2 3 4#b has 1 element which is a vector of 4 elements> b[2][[1]]NULL#When we try to access 2nd element, we get NULL instead of an errorWhen we access an out of bound index from a vector or list, instead of getting an error, we get NA and NULL as output which can be concerning and sometimes makes testing/debugging difficult.

7) if {} else{} vs ifelse()When we have a vector and we want to check for a given condition, if {} else{} works only on the first element of the vector and lets out a warning.

However, if we want the condition to be accessed for every element in the vector we must use ifelse() .

It compares every value in the vector and returns a vector.

Suppose, we have a vector of 5 elements and we want to check if they are even or odd> a <- c(5:10)> a[1] 5 6 7 8 9 10> if(a%%2==0){x <- "Even"} else{x <- "Odd"}Warning message:In if (a%%2 == 0) { : the condition has length > 1 and only the first element will be used> x[1] "Odd"> y <- ifelse(a%%2,"Even","Odd")> y[1] "Even" "Odd" "Even" "Odd" "Even" "Odd"We can see that if{} else{} compares only the first element of a, while ifelse() achieves the required result.

8) Calling functions with in-sufficient number of argumentsR allows to call a function with in-sufficient number of arguments as long the missing argument is not called.

This is a very important point to keep in mind and is different from other programming languages.

If we explicitly want all the arguments to be present in the function call, we can use force()among various other options.

> f <- function(x,y){ print(x)}> f(2)[1] 2#Function call works with one argument as y is not used.

> f(2,3)[1] 2#Calling with both arguments> f()Error in print(x) : argument "x" is missing, with no default#Since x is called inside function and is missing, we get error#Explicitly checking for both x and y using force()> f <- function(x,y){force(x,y); print(x)}> f(2,3)[1] 2> f(2)Error in force(y) : argument "y" is missing, with no defaultforce() checks if both x and y are present or not and throws an error when missing.

9) Functional masking and use of ::A lot of times, different packages have functions with same name but different functionality.

If we want to use a particular function from a specific package, we may need to specifically specify it.

In absence of a specific call, the function from the package which was recently loaded masks all the other same named functions.

For example, library chron and tseries both have is.

weekend() as its function.

library(chron)library(tseries) ‘tseries’ version: 0.

10-45 ‘tseries’ is a package for time series analysis and computational finance.

See ‘library(help="tseries")’ for details.

Attaching package: ‘tseries’The following object is masked from ‘package:chron’: is.

weekendis.

weekend(x)When we call is.

weekend() ,the function from tseries package is used as it is the most recently loaded package.

If we specifically want to use chron we need to do the following:chron::is.

weekend(x):: helps us specify which package to use.

 :: also helps us use a function without loading the package.

However, we need to specifically call the function using :: every time we want to use a function present in that package.

10) Argument matching while calling a functionWhile calling a function, arguments are matched in the following order in R: a)Exact name match b)Partial name matchc)Positional match> f <- function(a1,b1,a2){ print(paste("a1:",a1,"b1:",b1,"c1:",a2))}> f(a1=2,b1=3,a2=5)[1] "a1: 2 b1: 3 c1: 5"#Example of Exact match, each argument has same name> f(b1=3,5,2)[1] "a1: 5 b1: 3 c1: 2"#Example of exact match and positional match.

Since b1 is matched, 5 and 2 are assigned based on position> f(3,b=5,2)[1] "a1: 3 b1: 5 c1: 2"#Partial name match, b matches to b1.

Rest matched based on position> f(3,5,2)[1] "a1: 3 b1: 5 c1: 2"> f(3,5,a=2)Error in f(3, 5, a = 2) : argument 3 matches multiple formal arguments#Since a matches to both a1 and a2 we get error as R does not know where the value needs to be assigned.

End Notes:R opens up a world of immense possibility for us and its thriving community and ecosystem makes it a great programming language to learn.

I hope the above points become useful for anyone working with R and help them avoid a lot of common mistakes.

This blog post is my effort to highlight a few points from Exploratory Data Analysis in R course taught by Paul Intrevado at the University of San Francisco.

Reach out to me to discuss/edit any specific point.

About Me: Graduate Student ,Masters in Data Science from University of San FranciscoLinkedIn: https://www.

linkedin.

com/in/jyoti-prakash-maheswari-940ab766/GitHub: https://github.

com/jyotipmahesReferencesExploratory Data Analysis with R by Paul Intervado.. More details

Leave a Reply