Filtering Data with Aggregate and Window Calculations

To answer this question, you can simply add ‘group_by()’ function right before the ‘filter’ step like below.

flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>% group_by(CARRIER) %>% filter(ARR_DELAY > mean(ARR_DELAY, na.

rm = TRUE))flights with arrival delay time greater than average per carrierYes, it’s that simple.

This ‘group_by()’ function makes the data frame to be ‘grouped’ based on a given column(s) so that any aggregate functions like ‘mean()’ in the following steps would do the aggregate calculations for each group.

Let’s find out the result is really reflecting our intention.

We can quickly calculate each airline carrier’s average arrival delay time with ‘summarize()’ function like below.

flight %>% group_by(CARRIER) %>% summarize(average = mean(ARR_DELAY, na.

rm = TRUE))Average arrival delay time per carrierAs you can see the average arrival delay time for carrier ‘EV’, for example, is about 19.

69 minutes.

Now when you look at the previous result of the filtering, you can see that the arrival delay times for ‘EV” are all greater than 19.

69.

Filtering with Window functionsFiltering with aggregate functions is cool, but there is more.

Let’s say you have a question like, “What are the worst 10 flights based on the arrival delay time ?” To answer this question, we can simply use one of the rank functions called ‘min_rank()’ from dplyr and call it directly inside the ‘filter()’ function like below.

flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>% filter(min_rank(ARR_DELAY) <= 10)By the way, I’m using ‘min_rank()’ function here but there is another rank function called ‘dense_rank()’ from dplyr.

Both functions return the ranking number based on a given measure column (e.

g.

Arrival Delay time), and only the difference is when there are ties like below.

As you can see, ‘flight D’ is considered as #4 with ‘min_rank()’ function because ‘flight B’ and ‘flight C’ are ties at #2, and ‘min_rank()’ function reserves #3 before giving the next rank number.

On the other hand, ‘dense_rank()’ function doesn’t reserve the rank #3 and instead give that to whatever the one at the next in order.

Anyway, if I run the command above, we’ll get something like below.

But when you look closer at ‘ARR_DELAY’ column though, it looks that it is showing the best 10, not the worst 10, because the less the arrival time means the flights actually had arrived earlier in this context.

So basically we want the order to be flipped.

For this, you can use ‘desc()’ function just like you would do with SQL.

flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>% filter(min_rank(desc(ARR_DELAY)) <= 10)Now we get the worst 10 flights in January 2014.

I see a lot of American Airlines there, interesting.

But just like what we did with the aggregate function, what if you want to see the worst 10 flights for each airline carrier, instead of the overall worst 10 ?.Yes, you guessed it right, you can simply add ‘group_by()’ function right before the filter step again.

flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>% group_by(CARRIER) %>% filter(min_rank(desc(ARR_DELAY)) <= 10)To make it easier to see if the result really reflects our intention, let’s try to see only the worst 1 for each carrier by chaning the condition from ‘<=10’ (less than 10 or equal) to ‘<=1’ (less than 1 or equal).

flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>% group_by(CARRIER) %>% filter(min_rank(desc(ARR_DELAY)) <= 1)Now we can see the worst flight based on the arrival delay time for each of the 14 carriers.

What we have done here is often called ‘Window operation’ with other tools like SQL and BI tools.

However, with dplyr you don’t have to think about it as ‘Window operation’ problem trying to adopt your question into a complex query structure.

Instead, you focus on building your analysis questions in a quick and iterative way.

The combination of the ‘pipe (%>%)’ operation, dplyr’s beautiful command interface design, and its well thought and rich functionality, you can perform this type of analysis quickly and in an intuitive way without even realizing how complicated it could have been with other tools.

Great win for anybody who want to do a quick, iterative, and solid data analysis.

R packages used in this posthadley/dplyrdplyr – Plyr specialised for data frames: faster & with remote datastoresgithub.

comSubscribe to my Medium postsEnter your email to recieve automatic updates.

powered.

by.

rabbut.

com.. More details

Leave a Reply