Scraping web data with R & the Tidyverse

The code below plots our ratings for each episode and generates a clean-looking chart.

library(hrbrthemes)plot_episode_popularity <- function(episode_table) { ggplot(episode_table, aes(x = forcats::fct_reorder(title, episode), y = rating)) + geom_point(aes(color = season), shape = 15, size = 4) + geom_text(aes(label = title), angle = 270, size = 3, hjust = 0, nudge_y = – 0.

2) + labs(x = "Episodes", y = "IMDB Rating", title = "Twin Peaks episodes ranked by IMDB") + theme_ipsum_rc(grid="Y") + theme(axis.

text.

x=element_blank(), panel.

grid.

major = element_line(linetype = 4), legend.

title = element_blank()) + ylim(0, NA)}twin_peaks_episodes %>% mutate(episode = row_number()) %>% plot_episode_popularity()Now that we’ve successfully reproduced the first chart from Katie’s tweet, let’s have a look at the second chart, which visualizes the centrality of different characters to the show’s plot, based on the number of times they are mentioned in the episode summaries on IMDB.

In the case of Twin Peaks, we actually have the credits for each episode easily available to us on the Twin Peaks Wiki.

We’ll simply need to scrape the cast data, and then filter-out any irrelevant information, like credits for stunt-doubles, etc.

You can see where we’ll be scraping this data from in the GIF below.

Each cast member exists as a hyperlink inside-of an "li" element.

This information will be used to select the names of the cast members when we scrape the data.

Below you’ll see a similar functions that we used to scrape the titles from the Twin Peaks Wiki, only modified to select the episode cast and characters information.

We’ll also use the exact same episode_urls variable from above.

The amount of nodes being scraped here quite large, so I’m outputting the cast for episode 21 only to show the raw scraped data.

get_cast <- function(episode_url) { read_html(episode_url) %>% html_nodes(".

WikiaArticle > div > ul > li") %>% html_text()}cast <- map(c("https://twinpeaks.

fandom.

com/wiki/Pilot", episode_urls), get_cast)cast[21]The credits variable contains all of the scraped credits data from all 30 episode pages (including the pilot).

But it contains a lot of irrelevant data too.

Have a look at the credits data for episode 21 above, we can see that it includes credits for voice actors, non-appearing cast members ("credit only"), and even some notes about the episode as well.

We can remove some of the white-space and formatting from the credits using a few simple functions.

Notice that the list below is just slightly cleaner than the one above.

This is because we removed all of the "!." patterns from the strings.

clean_cast <- function(episode_cast) { str_squish(episode_cast) %>% str_trim() %>% str_remove_all("[([^]])]") %>% # remove citations str_replace_all("'", '"') # convert single-quotes to double}cast <- map(cast, clean_cast)cast[21]Removing the voice-only credits, and other records that we aren’t interested in is also quite simple.

The function below will keep the records that we want, and remove any of the ones that we don’t.

Each episode’s cast list gets mapped-to the filter_credits function, and the output is a much tidyer list of actual cast members who appeared in the episode.

filter_credits <- function(episode_credits) { keep(episode_credits, str_detect(episode_credits, " as ")) %>% discard(str_length(.

) > 50) %>% # remove any actual text that was scraped discard(str_detect(.

, "credit only")) %>% # remove non-appearing cast members discard(str_detect(.

, "performer")) %>% # remove unknown actors discard(str_detect(.

, "voice")) %>% # remove voice-only cast members discard(str_detect(.

, "deleted scene")) %>% discard(str_detect(.

, "archive footage")) %>% discard(str_detect(.

, "citation needed")) %>% discard(str_detect(.

, "Invitation To Love")) # remove soap opera credits}cast <- map(cast, clean_cast) %>% map(filter_credits)cast[21]Much nicer.

Now we have a tidy list of the cast members for each episode.

But still; what we really need are just the characters that appear in each episode.

Thankfully, this is quick-work too.

When mapped to a string, the str_extract function will remove whatever pattern we pass to it.

characters <- map(cast, str_extract, pattern = "(?<= as ).

*")characters[21]twin_peaks_episodes <- twin_peaks_episodes %>% mutate(characters)twin_peaks_episodesHow simple was that!.The new column contains rows that look like <chr [33]>.

This means that the row contains a list of 33 strings, or “characters” as they’re called in R.

No pun intended.

In order to find the total number of appearances for each character throughout the show, we need to first “un-nest” the column of characters in each episode.

The output below will show you what I mean by “un-nesting”.

The unnest function is going to expand our tibble from just 30 rows to a whopping 955 rows.It does this by creating a row for every single character appearance.

In a way, this changes the nature of our tibble from a tibble of Twin Peaks episodes, to a tibble of all character appearances throughout seasons 1 and 2 of Twin Peaks.

all_character_appearances <- twin_peaks_episodes %>% unnest()all_character_appearancesThe tibble above contains all appearances for every character in the show between seasons 1 and 2.

By applying the filter function to filter for a particular character, we can quickly find all of episodes that a character appears in.

all_character_appearances %>% filter(characters == "Betty Briggs")It’s time now to summarize the character appearances.

We want a list containing each character, as well as the total number of episodes that they appeared in.

Using the group_by function, we can group-together all of the episodes that each character appeared-in.

For Betty Briggs, this would mean that we’re grouping-together the 7 appearances that you can see in the tibble above.

group_by is different from filter though, because group_by will always return the entire data set that was passed to it.

But it enables us to use another very useful function: summarize.

We’ll use summarize to count the number of episodes that each character appears in, and to return a tibble of all 189 characters and the number of appearances for each.

appearances_per_character <- all_character_appearances %>% group_by(characters) %>% summarise(total_appearances = n()) %>% arrange(desc(total_appearances))appearances_per_characterLet’s quickly try applying the same “Betty Briggs” filter to the appearances_per_character tibble.

This time, it will return a single row, but the value of the total_appearances column will be 7; the number of rows that was returned the last time that we filtered for Betty Briggs.

It’s almost time to plot our chart.

But before we do that, let’s filter the list of characters down to a smaller, more relevant list.

It isn’t very interesting to know about minor characters who only appeared in one or two episodes.

In fact, let’s filter-out the bottom third of all characters, keeping only those who appear in more than 10 episodes.

Using the filter function, we can refine the number of characters to plot from 189 to just 29.

This should make for a far more readable chart.

frequently_appearing_characters <- appearances_per_character %>% filter(total_appearances > 10)frequently_appearing_charactersFinally; using many of the same functions that we used to produce the IMDB rankings chart, we can plot a horizontal column chart of the characters according to how many episodes they appear in.

frequently_appearing_characters %>% ggplot(aes(x = fct_reorder(characters, total_appearances), y = total_appearances)) + geom_col(mapping = aes(fill = characters)) + coord_flip() + labs(y = "Total appearances", title = "Twin Peaks characters by their appearances") + theme_ipsum_rc(grid="X") + theme(legend.

position="none", axis.

title.

y = element_blank())In this post, we scraped HTML data from several web pages to compile data about the cult classic TV series Twin Peaks, and then we created two charts from our scraped data.

But before you get the wrong impression, and think that “50 Things You Need to Know About Data” is a course about web scraping — I need to be clear that this is just one topic among many that you’ll study if you follow the course.

Other concepts covered in the course.

As a newcomer to data science and data visualization, I found this course inspiring.

And as a web developer, it was refreshingly fun.

I’d recommend it to anyone who is interested in working with data, but who hasn’t yet wrangled data programmatically.

Special thanks to Katie Segretti, whose charts were the inspiration for this post, and who graciously agreed to let me use her visualizations here as well.

.

. More details

Leave a Reply