Scraping and Exploring The SP500 with R Part 1

luckily, with R, going about this will be pretty simple.

Wikiepdia has a table of all 505 SP500 tickers (some companies, like Google, have multiple asset classes) located at this URL:https://en.

wikipedia.

org/wiki/List_of_S%26P_500_companiesTo get all the SP500 tickers we are going to scrape this table, using the rvest package.

The rvest package is a simple scraping package in R that is very similar to python’s beautiful soup.

In the context of programming, scraping is defined as programmatically collecting human readable content from the internet and webpages.

In the code below I scrape the wikipedia table and create a vector of all SP500 tickers:I first assign the wikipedia URL to a variableRead in the html from the URLSelect the correct html nodes and extract the html tableMake a small change to ticker names because yahoo finance uses a ‘_’ instead of ‘.

’ for certain symbol namesThe hardest part of scraping is figuring out the xpath or css to indicate which html nodes to select.

I really don’t know much about html or css, but using Google Chrome I was able to find the correct xpath (more on this below).

# get the URL for the wikipedia page with all SP500 symbolsurl <- "https://en.

wikipedia.

org/wiki/List_of_S%26P_500_companies"# use that URL to scrape the SP500 table using rvesttickers <- url %>% # read the HTML from the webpage read_html() %>% # one way to get table #html_nodes(xpath='//*[@id="mw-content-text"]/div/table[1]') %>% # easier way to get table html_nodes(xpath = '//*[@id="constituents"]') %>% html_table()#create a vector of tickerssp500tickers <- tickers[[1]]sp500tickers = sp500tickers %>% mutate(Symbol = case_when(Symbol == "BRK.

B" ~ "BRK-B", Symbol == "BF.

B" ~ "BF-B", TRUE ~ as.

character(Symbol)))How to find the xpath I passed to the html_nodes function using Google Chrome:Go to https://en.

wikipedia.

org/wiki/List_of_S%26P_500_companiesright click on webpage, select inspect optionMost webpage content is usually in the body of an html document so, expand that section.

That will look like this:Looking at the webpage, I can see the table that I want is right below the first h2 header:After navigating around the page structure I found the first h2 header and the table I wanted below itI can click on the table, right click and copy the xpath that is needed to scrape the tablexpath = //*[@id=”constituents”], this is what was passed to html_nodeshtml_nodes(xpath = '//*[@id="constituents"]'Scaling up part 2— Purrr, Iteration, and functional programming:Iterating refers to the programmatically repeating a step or sets of steps, a set number of times or until a condition is meant.

Typically, when we iterate in any programming language, we use a loop, typically a for loop.

I will need to iterate over each element in the vector of SP500 tickers and pass it to the function tq_get().

I could do this with a for loop, but using the purrr package is a better idea.

Loops in R are slow and hard to read.

The purrr package provides a suite of functions for iteration and functional programming that integrate well with the rest of the tidyverse.

The core function in purrr in map().

Most programming languages (including my favorite one python) have a map function that serves to apply a function to all elements of an object.

Functional Programming is a programming paradigm in which functions, as opposed to classes, build the structure and logic of programs.

For loops are typically avoided in functional programming.

Instead, functions are mapped or applied to lists or other objects.

As Hadley Wickham stated perfectly in his Advanced R book:“It’s hard to describe exactly what a functional style is, but generally I think it means decomposing a big problem into smaller pieces, then solving each piece with a function or combination of functions.

When using a functional style, you strive to decompose components of the problem into isolated functions that operate independently.

Each function taken by itself is simple and straightforward to understand; complexity is handled by composing functions in various ways.

”Let’s look at the difference between iteration in purrr and a for loop with an example.

Both of these operations are roughly equivalent:# get a sequence of the numbers 1 to 5numbers = seq(1:5)print('for loop')# for loop for (i in numbers){ print(i)}print('purrr :)')# purr functional programming approachlist = map(numbers, print)A few things to note:both the for loop and map function do an operation for each element in the vectorthe map function returns a nested list where each entry is the result of the function called inside of it for one of the entries in the object that being iterated over.

Using purrr’s map function is only one line and less codeUsing purrr’s map function is easier to readAlright now lets iterate to get all 505 SP500 tickers!First I need to write a function to iterate over or apply to each element of the vector.

I cannot simply use tq_get() with map because it does not return the ticker name as a column of the dataframe.

In order to get the ticker with the dataframe, I will use the mutate() function from dplyr to create a new column with the ticker name.

get_symbols = function(ticker = "AAPL"){ df = tq_get(ticker, from = date) %> mutate(symbol = rep(ticker, length(date)))}I then use this function in conjunction with map() to to iterate over the list of all symbols.

This returns a nested list containing a dataframe for each ticker.

I use the dplyr bind_rows() function to bind the dataframes together row-wise, to create one dataframe with all the SP500 tickers.

#create the dataframe of SP500 data by interating over our list of symbols and call our get symbols function each time#the map function accomplishes thistickers_df = map(symbols, get_symbols) %>% bind_rows()I also want this dataframe to contain the information from the wikipedia table, most importantly, the name of the company.

This can be achieved by joining the two dataframes by the symbol.

tickers_df = tickers_df %>% # left join with wikipedia data left_join(sp500tickers, by = c('symbol' = 'Symbol')) %>% # make names R compatible clean_names() %>% # keep only the columns we need select(date:security, gics_sector, gics_sub_industry)After joining the data we should do a quick sanity check to make sure that we have all 505 SP500 tickerstickers_df %>% # select just the symbol columnselect(symbol)%>% # get the distinct valuesdistinct()%>% # count the distinct values count() %>% # we can use select to rename columns select("Total Number of Tickers" = n)Finally, we can inspect the first few rows of the dataframe, to confirm we have gotten the data we want:tickers_df %>% head()Perfect!.Exactly what we expected.

Wrapping up:In this first blog post we have:Learned about the basics of the tidyverse ecosystem and tidyquantLearned about the basics of tidy iteration and functional programmingLearned how to leverage rvest and chrome to scrape data from wikipediaLearned how to move from reading in a single asset to the whole SP500In the next post:I will answer my original question:Which SP500 assets had the highest average return over the last 3 months?A few things to note:You can obtain a list of all SP500 stocks much more easily with tq_index(“SP500").

This function does require the XLConnect library.

I currently am having issues getting that library to import and run on my local machine.

tq_get() actually accepts lists of tickers, so using map() isn’t necessaryI intentionally wrote the code this way, to demonstrate the described conceptsDisclaimer: In college, I did a project with SP500 data and python.

It is against class policy to share this project.

While what I am doing here is somewhat similar, I am using a completely different programming language and answering a completely different question.

Given my previous statement, I assert that this is not a violation of that courses policy.

I intentionally used the R programming language as opposed to python to completely avoid any issues and to respect my former professors wishes.

The Code can be found here.

. More details

Leave a Reply