Filter with Text data

In that case, you can use another function called ‘str_to_lower()’ from ‘stringr’ package, to make all the text data to be lowercase before finding the matching text like below.flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>% filter(str_detect(str_to_lower(ORIGIN_CITY_NAME), "new york"))The result would be the same as before.Filter with Regular ExpressionLet’s say you are interested in finding the cities whose names contain a text of ‘New’.flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%filter(str_detect(ORIGIN_CITY_NAME, "New")) %>%count(ORIGIN_CITY_NAME)I’m calling ‘count()’ function at the end to make it easier to see if the result is reflecting the intention of the ‘filter()’ command.As you see there are 6 cities in the data whose names contain a text ‘New’.Now let’s say we want to keep only ‘New York, NY’ and ‘Newark, NJ’..This is when this ‘str_detect()’ function starts shining even more..You can actually use the regular expression as the matching text inside the function..If you are people like me, the regular expression can be intimitaing and I’m not going to pretend that I like it..But what I’m going to cover here is a very basic and simple one that is actually pretty cool..Let’s go step by step.To get only the city names that are either ‘New York, NY’ or ‘Newark, NJ’, we want to find some common patterns between those two text values..Realize that both values start with ‘New’ and have ‘rk’ in the middle before the ‘,’ (comma).We can use a wild card symbol — ‘.’ (dot) — to represent any letter including special characters..So if I type something like ‘New.rk’,flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%filter(str_detect(ORIGIN_CITY_NAME, "New.rk")) %>%count(ORIGIN_CITY_NAME)I would get something like below..It has only ‘Newark, NJ’, and it actully doesn’t have ‘New York, NY’.This is because the ‘.’ (dot) represents any character but just one character..So instead if I type three dots in the middle like below,flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%filter(str_detect(ORIGIN_CITY_NAME, "New…rk")) %>%count(ORIGIN_CITY_NAME)I would get only ‘New York, NY’ because this is the only city value that has three characters between ‘New’ and ‘rk’.But, we still want to get both ‘New York, NY’ and ‘Newark, NJ’, so we need a way to say “we want the city names that have any number of any character between ‘New’ and ‘rk’.” In that case, we can use ‘*’ (asterisk) symbol right after the ‘.’ (dot)..‘*’ (asterisk) is used to match any preceeding characters zero or more times..You can simply update the matching text to be something like below.flight %>% select(FL_DATE, CARRIER, ORIGIN, ORIGIN_CITY_NAME, ORIGIN_STATE_ABR, DEP_DELAY, DEP_TIME, ARR_DELAY, ARR_TIME) %>%filter(str_detect(ORIGIN_CITY_NAME, "New.*rk")) %>%count(ORIGIN_CITY_NAME)Now, happily both ‘New York, NY’ and ‘Newark, NJ’ are returned.Note that there is another symbol you could use, that is ‘+’ (plus) symbol.. More details

Leave a Reply