Get any US music chart listing from history in your R console

Get any US music chart listing from history in your R consoleLearn about R’s scraping capabilities and write a simple function to grab a US music chart from any date in the pastKeith McNultyBlockedUnblockFollowFollowingMay 11We are lucky enough to live in an age where we can get pretty much any factoid we want.

If we want to find out the Top Billboard 200 albums from 1980, we just need to go to the official Billboard 200 website, enter the date, and up the list comes in a nice display with album art and all that nice stuff.

But often we don’t care about the nice stuff, and we don’t want to visit a website and go through several clicks to get the info we need.

Wouldn’t it be great if we could just get it in our console with a simple function or command.

Well, if the website is well structured and accessing data from a structured dataset, then in all likelihood you can scrape it, meaning that you can extract the precise information you want into a vector or table, allowing you to conduct analysis or do whatever.

In this article we are going to review elementary web scraping in R using the packages rvest and xml2.

These packages are remarkably easy to use.

By the end of the article we will have created a function called get_charts() which will take a date, a chart version and a vector of ranked positions as its arguments and instantly return the chart entries in those positions on that date.

I hope this will encourage you to try it on countless other sources of web data.

For this tutorial you will need to have installed dplyr, xml2 and rvest packages.

You will also need to be using the Google Chrome browser.

Getting started — Scraping this week’s Billboard Hot 100We will start by working out how to scrape this week’s Billboard Hot 100 to get a ranked list of artists and titles.

If you take a look at the Hot 100 page at https://www.

billboard.

com/charts/hot-100 you can observe its overall structure.

It has various banners and advertising and there’s quite a lot going on, but you can immediately see that there is a Hot 100 list on the page, so we know that the information we want is on this page and we will need to navigate the underlying code to find it.

The packages rvest and xml2 in R are designed to make it easy to extract and analyse deeply nested HTML and XML code that sits behind most websites today.

HTML and XML are different — I won’t go into the details of that here — but you’ll usually need rvest to dig down and find the specific HTML nodes that you need and xml2 to pull out the XML attributes that contain the specific data you want.

After we load our packages, the first thing we will want to do is read the html from the web page, so we have a starting point for digging to find the nodes and attributes we want.

# required librarieslibrary(rvest)library(xml2)library(dplyr)# get url from inputinput <- "https://www.

billboard.

com/charts/hot-100"# read html code from urlchart_page <- xml2::read_html(input)Now we have a list object chart_page which contains two elements, one for the head of the webpage and the other for the body of the webpage.

We now need to inspect the website using Chrome.

Right click on the website and choose ‘inspect’.

This will bring up a panel showing you all the nested HTML and XML code.

As you roll your mouse over this code you will see that the part of the page that it refers to is highlighted.

For example, you can see here that the section we are interested in highlights when I mouse over the highlighted <div class = "container chart-container .

"> which makes sense.

So we need to find this in the body of the page.

We use the rvest function html_nodes() to get the nodes of the body of the page, and we use the xml2 function xml_children() to find the parts of the nodes that we want to dig into.

# browse nodes in body of articlechart_nest_1 <- chart_page %>% rvest::html_nodes('body') %>% xml2::xml_children()View(chart_nest_1)This gives us a nested numbered list which we can click and browse through, like so:List element 3 (child 3) contains the main page content according to its XML attributes, so we continue diving to find its children:chart_nest_2 <- chart_nest_1 %>% xml2::xml_child(3) %>% xml2::xml_children()View(chart_nest_2)and if we again browse the resulting list, we can see under child 3 the precise XML attribute we are looking for:By going in this fashion we can get to the precise segments of the code that we need to get the contents of the Hot 100 list, which turns out to be a few children down from the original body node:# drill down XML childrenchart <- chart_page %>% rvest::html_nodes('body') %>% xml2::xml_child(3) %>% xml2::xml_child(3) %>% xml2::xml_child(1)We can now extract the attributes of the children to see what we are interested in:# get contents and attributes of childrenattrs <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attrs()View(attrs)We can now see that the chart entries are inside this list (along with other things like advertising banners and videos):We can now extract the data we want.

To get rank, artist and title we just grab the data-rank, data-artist and data-title attributes as separate vectors and combine them into a dataframe.

Some of the entries we download will not refer to chart entries but to other XML classes.

These will appear as NA in our dataframe and can be easily removed.

# get ranks, artists, and titles as vectorsrank <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-rank') artist <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-artist') title <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-title')# combine into a dataframe and remove NAschart_df <- data.

frame(rank, artist, title) chart_df <- chart_df %>% dplyr::filter(!is.

na(rank))View(chart_df)And there we have it, a nice list of exactly what we want, and good to know that there are 100 rows as expected:Generalizing to pull any chart from any dateSo that was a lot of investigative work, and digging into HTML and XML can be annoying.

There are Chrome plugins like SelectorGadget that can help with this, but I find them unpredictable and prefer to just investigate the underying code like I did above.

Now that we know where the data sits, however, we can now make this a lot more powerful.

If you play with the billboard.

com website, you’ll notice that you can get to a specific chart on any historic date by simply editing the URL.

So for example if you wanted to see the Billboard 200 as of 22nd March 1983 you just go to https://www.

billboard.

com/charts/billboard-200/1983-03-22.

So, this allows us to take the code above and easily generalize it by creating a function that accepts the date, chart type and positions we are interested in.

Let’s write that function with some default values for date (today), chart type (default to Hot 100), and positions(top 10).

get_chart <- function(date = Sys.

Date(), positions = c(1:10), type = "hot-100") {# get url from input and pull html input <- paste0("https://www.

billboard.

com/charts/", type, "/", date) chart_page <- xml2::read_html(input)# scrape datachart <- chart_page %>% rvest::html_nodes('body') %>% xml2::xml_child(3) %>% xml2::xml_child(3) %>% xml2::xml_child(1)rank <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-rank') artist <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-artist') title <- chart %>% xml2::xml_children() %>% xml2::xml_contents() %>% xml2::xml_attr('data-title')# generate a display dataframechart_df <- data.

frame(rank, artist, title) chart_df <- chart_df %>% dplyr::filter(!is.

na(rank), rank %in% positions)chart_df}OK, let’s test our function.

What were the Top 20 singles on 22nd March 1983?What were the Top 10 albums on 1st April 1970?What I love about rvest and xml2 is how simple and powerful they are.

Look how lean the content of the function is — it didn’t take much to create something quite powerful.

Give it a try with some other sources of web data and feel free to add to the Github repo here if you create any other cool scraping functions.

Originally I was a Pure Mathematician, then I became a Psychometrician and a Data Scientist.

I am passionate about applying the rigor of all those disciplines to complex people questions.

I’m also a coding geek and a massive fan of Japanese RPGs.

Find me on LinkedIn or on Twitter.

You can find out more about rvest here and xml2 here.

.. More details

Leave a Reply