Advanced Programming and Non-Standard Evaluation with dplyr

For most data analyses, this surprising ambiguity isn’t a problem; however, it does cause problems when we are trying to create custom functions that look and behave like dplyr functions and when we want to generalize dplyr code.

In terms of creating custom functions, it is not clear how to pass data frame fields into the functions.

In the attempt below, dplyr looks for a field in the data frame named “group_col,” not whatever value is passed into the function.

Beyond just creating functions that can distinguish field names, we often want to to take code and parametrize it.

Effective parameterization provides the ability to programmatically manipulate data many times in a consistent way without duplicating the commands, and it allows us to very quickly change what data we are transforming.

For example, we may want to compare a list of columns to a list of different columns in a data frame.

Dplyr’s ambiguities make this unobvious.

We can obtain this functionality in our higher-level dplyr code with metaprogramming, but we need to understand how dplyr works first.

Non-Standard Evaluation — It’s Complicated Under the HoodDplyr does not just use the evaluated value of a function’s argument, but it actually evaluates the argument’s expression in a custom way.

This is called non-standard evaluation, which is often abbreviated to NSE.

(Surprise! Standard evaluation is using the argument’s value in the function.

) Dplyr’s non-standard evaluation is best explained through an example:There are two arguments to the mutate function call: the mtcars data frame, which is passed implicitly via the magrittr (%>%), and the named expression cyl + 1.

The R interpreter does not evaluate the expression before it is passed to the mutate function.

Since the expression is passed in a raw format to the function, mutate can do whatever it wants with it and evaluate it however it likes — in our case looking first for field names in the mtcars data frame named “cyl,” evaluating the expression using those fields values, and then creating a new column in the data frame with argument’s name and the evaluated value.

Dplyr and the tidyverse heavily leverage non-standard evaluation — this is how it’s able to make sense of the expression cyl + 1 even though this would return an error directly entered into the R interpreter.

Understanding that dplyr uses non-standard evaluation is the first step in using it programmatically.

The next is to identify how we can programmatically generate expressions that do more complicated things.

Metaprogramming with Symbols and Quosures: Making Our Own ExpressionsThe idea of code generating other, different code is called metaprogramming.

We can use metaprogramming to build our own expressions and give these recipes to dplyr to do what we want.

To do this, we need to understand quosures and symbols, two of the key components of metaprogramming in the tidyverse, and how we can manipulate them.

A quosure is an unevaluated expression along with the expression’s default, associated environment.

Think of a quosure as a code snippet stored in a variable.

Prefix them with !!.when calling the quosure for evaluation in a function argument or call tidy_eval(my_quosure) to evaluate it in place.

We say the quosure is a quoted expression and the !!.operator unquotes it.

To capture a quosure, simply capture an expression within the quo() function.

Plug in quosures anywhere dplyr expects an expression as a function argument using the !!.operator.

In dplyr, quosures can be used with all of the core functions.

Symbols are the names of R variables, which is how dplyr calls reference data frame fields.

When metaprogramming in dplyr, you can often think of a symbol as a really simple expression that just includes one object or variable name.

You’ll need to create a symbol anytime you have variable names stored as text that you want to use with dplyr, which often happens when you dynamically name columns in a data frame.

To create a symbol, simply convert a string or a variable containing a string with the sym() function.

Plug in these symbols anywhere dplyr expects a symbol or an expression of only one field name as a function argument using the !!.operator.

You can create more complicated quosures from other quosures and symbols.

Simply nest and combine quosures and symbols within the quo() function, calling !!.before any quosure or symbol within it.

Use the resulting quosure with dplyr just as you would use a simple one.

You can also use quosures to store expressions so that your code stays DRY — just change your expression in one place and it flows to all places the quosure is used!Often times we have list-like objects of expressions or of variable names stored as text, and we would like to capture or convert all of them simultaneously.

The tidyverse provides this functionality with the quos() and syms() functions and the !!!.operator.

Using these effectively requires a little more knowledge of how dplyr’s non-standard evaluation operates when given list-like objects, which we will discuss momentarily.

However, the general practice is the same for the singular form functions.

Leveraging dplyr’s NSE: Follow the LeaderNow that we know how to create expressions programmatically, we simply need to understand dplyr’s NSE so we can build whatever functionality we want.

When in doubt of the correct approach, look at simple dplyr calls, see what dplyr expects, then identify how to create that from the pieces that you have.

An example: you want to arrange your data frame by several columns you recently created and you have the names stored in several variables as strings.

Generally arrange expects a comma-delimited set of expressions (which are most often just field names).

The solution is to convert your variables to symbols and pass them to the arrange call with the appropriate unquoting operator.

There are a few more complicated pieces of dplyr’s NSE that can be hard to decipher from simple examples.

Both the mutate and summarize functions expect named function arguments, but sometimes we have a name that is stored in a variable.

To solve this interactively, we must use the := operator.

Dplyr will then treat the left-hand side of a named function argument as a variable, quosure, or symbol and evaluate it instead of using it symbolically.

For example:Finally, all dplyr functions allow for lists and named lists of arguments — we can use this to programmatically create dplyr arguments in batches.

Words alone don’t capture the power this unlocks, so let’s look at an example instead.

We create a list of named quosures and then create many new columns simultaneously with mutate.

We add additional commentary in case any of the functions we use enroute are new.

It’s worth explicitly cataloging how the core dplyr functions utilize names on lists:select, summarize, and mutate — dplyr uses the name of the quosure as the column name in the resulting data frame, overwriting any existing column with that name as needed.

group_by — dplyr effectively passes the expression through mutate, creating a new column with that name.

It then groups the data frame by the new column.

arrange — dplyr ignores the list’s names.

filter — dplyr throws an error if a list has names.

dplyr-Style Custom Functions:If you’ve made it this far, you might have one lingering question: the dplyr functions accept unquoted expressions in their raw usage — can we create functions like this?.The answer is yes; we can do this by using the enquo() and enquos() functions.

These functions take the raw expressions supplied to a function and quote them.

This doesn’t change the philosophy of metaprogramming in dplyr; it just moves the quotation piece inside a function.

Common Gotchas:There are three common gotchas that have burned Shipt’s Data Science team in the past:1.

group_by with text does not explicitly fail.

Sometimes you may be attempting to use NSE and accidentally pass strings (not symbols) into a group_by call.

group_by won’t actually squawk at you if you do pass it a string — it just creates a field with that string value.

Example:2.

select is very laid back.

For the most part, anything you pass to select will work.

You can pass a vector of strings or you can turn them into symbols — select doesn’t differentiate.

The key idea here is to not look to select for guidance on how to do something if you’re struggling with another dplyr function.

3.

You can’t unquote (!!) symbols or quosures stored in a data frame with dplyr syntax.

When variables are unquoted in dplyr arguments (with !!), dplyr looks for their values in the current environment NOT the data frame.

Said another way, even though you can store symbols and quosures in a data frame, you can’t use the !!.operators to unquote them within a dplyr call.

We encourage defining a function before entering a dplyr-centric pipeline, using a purrr or lapply-like function after the pipeline, or doing the quosure composition entirely outside dplyr’s functionality.

All three of these techniques effectively create a new environment which has direct visibility to the quosures and symbol objects.

Note that an anonymous function definition within a dplyr argument will still fail for the same NSE reasons.

It is hard to overstate the power of programmatically generating expressions for dplyr functions.

With minimal complication we gain the ability to effectively parametrize functions, to dynamically create column names and use them later, and the possibility of a huge reduction in the lines of code necessary to achieve some goal.

Shipt’s Data Science team regularly uses these metaprogramming techniques to transform our ad-hoc analysis to production-quality, maintainable code.

We can then ship it without relying on other teams for deployment.

Finally, we can iterate on our code easily, which is made much simpler due to well parameterized design.

When you find yourself repeating code multiple times in the dplyr framework, think about how you can make your code simpler with metaprogramming.

If this discussion of elegant data wrangling excites you and you are passionate about effective predictive modeling and targeted problem solving, check out the open Data Science positions at Shipt!.The R side of our ambidextrous team are heavy dplyr users, and we are passionate about continually creating clean and clever data products all across Shipt’s exciting datascape.

TL;DR:Dplyr’s framework makes data wrangling really straightforward, but code can’t be generalized without metaprogramming.

However, metaprogramming isn’t hard, it’s just different!Metaprogramming is when our code generates other code; symbols and quosures are just objects containing unevaluated code.

Use sym() to convert data frame field names stored as text to symbols.

Use quo() to capture expressions and save them for later as quosures.

Use !!.to use a symbol or quosure as a function argument or piece of a function argument in a dplyr call.

Programmatically create complex quosures by nesting and combining quosures and symbols within another quo() call.

List-like objects of quosures and symbols can be passed directly to dplyr functions for completely programmatic use.

About ShiptShipt is a membership-based online grocery marketplace delivering fresh foods and household essentials through a community of shoppers and a convenient app.

Shipt offers quality, personalized grocery delivery to members for $99 per year, and is available to nearly 70 million households in more than 200 markets across the country.

For more information (including how to become a Shipt Shopper), visit Shipt.

com!Interested in joining a Shipt team?.As part of our “people first” mentality, we value our open, collaborative culture and work environment where everyone has the opportunity to grow and succeed.

Learn more about careers at Shipt.

.

. More details

Leave a Reply