Data Manipulation and Exploration with Dplyr

Furthermore, the new dataframe only has as many rows as there were unique values in the variable grouped by – in our case, “country.” There were 49 unique countries in this column when we started out, so this new dataframe has 49 rows and 2 columns..From there, we use arrange() to sort the entries by count..Passing desc(count) as an argument ensures we’re sorting from the largest to the smallest value, as the default is the opposite..The next step, top_n(10) selects the top ten producers..Finally, select() retains only the “country” column and our final object “selected_countries” becomes a one-column dataframe..We transform it into a character vector using as.character() as it will become handy later on..selected_countries = wine %>% group_by(country) %>% summarize(count=n()) %>% arrange(desc(count)) %>% top_n(10) %>% select(country) selected_countries = as.character(selected_countries$country) So far we’ve already learned one of the most powerful tools from dplyr, group-by aggregation, and a method to select columns..Now we’ll see how we can select rows..# creating a country and points data frame containing only the 10 selected countries data select_points=wine %>% filter(country %in% selected_countries) %>% select(country, points) %>% arrange(country) In the above code, filter(country %in% selected_countries) ensures we’re only selecting rows where the “country” variable has a value that’s in the “selected_countries” vector we created just a moment ago..After subsetting these rows, we use select() to select the two columns we want to keep and arrange to sort the values..Not that the argument passed into the latter ensures we’re sorting by the “country” variable, as the function by default sorts by the last column in the dataframe – which would be “points” in our case since we selected that column after “country.” At a high level, we want to know if higher-priced wines are really better, or at least as judged by Wine Enthusiast..To achieve this goal we create a scatterplot of “points” and “price” and add a smoothed line to see the general trajectory.. More details

Leave a Reply