Scraping and Exploring Vanguard ETFs with R

Scraping and Exploring Vanguard ETFs with RAlex PavlakisBlockedUnblockFollowFollowingMay 27, 2018BackgroundThis post is an exploratory analysis of Vanguard’s index funds in R.

It is not meant to constitute investment advice, just to document a process for acquiring and exploring financial data in R to inform decisions for personal savings and investment.

Like most people who look for safe and low-cost saving and investment opportunities, I have been frustrated by conflicting advice and unstructured information.

Vanguard’s ETFs seemed to be a good fit for me but I wanted to develop a rigorous and reproducible workflow for epxloring my options.

Acquiring Fund DataAt time of writing Vanguard has 56 available ETFs which can be seen in the table on their website https://investor.

vanguard.

com/etf/list#/etf/asset-class/month-end-returns.

The table has a lot of information, and it’s hard to get a sense which ETFs are performing best and which are duds.

Vanguard offers some summary statistics and simple fund comparison tools, but I wanted the raw data to work with myself.

The table in this website is JavaScript, which means it’s a little tricky to scrape.

I relied use guidance from Brooke Watson’s excellent post on Scraping Javascript websites in R https://blog.

brooke.

science/posts/scraping-javascript-websites-in-r/.

The following code lets us scrape the javascript from Vangaurd’s website and render it into html on our machine.

# Javascript for creating a new file, scrape.

jswriteLines("var url = NULL; var page = new WebPage(); var fs = require('fs'); page.

open(url, function (status) { just_wait(); }); function just_wait() { setTimeout(function() { fs.

write('1.

html', page.

content, 'w'); phantom.

exit(); }, 2500); } ", con = "scrape.

js")# Function for scraping from a URLjs_scrape <- function(url, js_path = "scrape.

js", phantompath = "/usr/local/Cellar/phantomjs/2.

1.

1/bin/phantomjs"){ # Replace url in scrape.

js lines <- readLines(js_path) lines[1] <- paste0("var url ='", url ,"';") writeLines(lines, js_path) # Run from command line command <- paste(phantompath, js_path, sep = " ") system(command) }# Scrape itjs_scrape(url = "https://investor.

vanguard.

com/etf/list#/etf/asset-class/month-end-returns")Now we have the raw html saved in a file 1.

html.

Next we’ll identify where the information we want from the page lives and strip it from the html using selector gadget (http://selectorgadget.

com), rvest, sttringr.

We extract fund names, tickers, asset classes, expense ratios, prices, sec yeilds, year-to-date returns, one year returns, five year returns, ten year returns, lifetime returns, and date of inception, clean them up, and combine them into one data frame.

The raw data extracted from the html is pretty messy, so we’ll have to do a bit of work to clean and format it for exploration.

# Load packageslibrary(dplyr)library(rvest)library(stringr)library(tidyr)# Read htmlhtml <- read_html("1.

html")# Extract fund namesfund_names <- html %>% html_nodes(".

productName a") %>% html_text() # Extract tickersticker <- html %>% html_nodes(".

fixedCol+ .

ng-binding") %>% html_text()# Extract asset classesasset_class <- html %>% html_nodes(".

assetClass") %>% html_text() %>% sapply(.

, function(x) x[!str_detect(x, " ")]) %>% unlist()# Extract expense ratiosexpense_ratio <- html %>% html_nodes(".

expenseRatio") %>% html_text() %>% sub("%", "", .

) %>% as.

numeric() %>% sapply(.

, '/', 100) %>% sapply(.

, function(x) x[!is.

na(x)]) %>% unlist()# Extract priceprice <- html %>% html_nodes(".

expenseRatio+ .

ng-binding") %>% html_text() %>% sub("[$]", "", .

) %>% as.

numeric()# Extract sec yieldsec_yield <- html %>% html_nodes(".

secYield") %>% html_text()sec_yield_clean <- sec_yield[!str_detect(sec_yield, "SEC")] %>% str_replace_all(.

, ".", "") %>% str_replace_all(.

, " ", "") %>% str_replace_all(.

, "âu0080u0094", NA_character_)# Extract ytd returnsytd <- html %>% html_nodes(".

secYield+ .

ng-binding") %>% html_text() %>% sub("%", "", .

) %>% as.

numeric() %>% sapply(.

, '/', 100)# Extract one yr returnsone_yr <- html %>% html_nodes(".

ng-binding:nth-child(11)") %>% html_text() %>% str_replace_all(.

, "âu0080u0094", NA_character_) %>% sub("%", "", .

) %>% as.

numeric() %>% sapply(.

, '/', 100)# Extract five yr returnsfive_yr <- html %>% html_nodes(".

ng-binding:nth-child(12)") %>% html_text() %>% str_replace_all(.

, "âu0080u0094", NA_character_) %>% sub("%", "", .

) %>% as.

numeric() %>% sapply(.

, '/', 100)# Extract ten yr yieldsten_yr <- html %>% html_nodes(".

ng-binding:nth-child(13)") %>% html_text() %>% str_replace_all(.

, "âu0080u0094", NA_character_) %>% sub("%", "", .

) %>% as.

numeric() %>% sapply(.

, '/', 100)# Extract yield since inceptionsince <- html %>% html_nodes(".

right:nth-child(14)") %>% html_text() %>% str_replace_all(.

, ".", "") %>% str_replace_all(.

, " ", "") %>% str_split(.

, "[(]") %>% lapply(.

, head, 1) %>% unlist() %>% sub("%", "", .

) %>% as.

numeric() %>% sapply(.

, '/', 100)# Extract date of inceptioninception <- html %>% html_nodes(".

right:nth-child(14)") %>% html_text() %>% str_replace_all(.

, ".", "") %>% str_replace_all(.

, " ", "") %>% str_split(.

, "[(]") %>% lapply(.

, tail, 1) %>% str_replace_all(.

, "[)]", "") %>% unlist() %>% as.

Date(.

, "%m/%d/%Y")# Combine into one data framefund_data <- data.

frame(fund_names, ticker, asset_class, expense_ratio, price, sec_yield_clean, ytd, one_yr, five_yr, ten_yr, since, inception, stringsAsFactors = FALSE)# Drop duplicate rowsfund_data <- fund_data %>% distinct()write.

csv(fund_data, "fund_data.

csv", row.

names = FALSE)Finally we have the data in a format we can use for some simple analyses.

We’ll start by looking at each fund’s year-to-date-performance so for this year.

library(readr)library(ggplot2)source("https://raw.

githubusercontent.

com/alexpavlakis/themes/master/theme_acp.

R")# Load datafund_data <- read_csv("fund_data.

csv")# Look at first few rowsfund_data %>% select(fund_names, ticker, inception, asset_class, expense_ratio, ytd, since) %>% head()fund_data %>% ggplot() + aes(x = reorder(fund_names, ytd), y = ytd) + geom_bar(stat = "identity") + coord_flip() + xlab("") + ylab("") + ggtitle("Year-to-date returns") + theme_acp() + theme(axis.

text = element_text(size = 8))We can clearly see that some funds have lost money this year, others have made money, and others have broken even.

Next we’ll look at each fund’s total returns, plotted against their date of inception.

fund_data %>% ggplot() + aes(x = inception, y = since) + geom_hline(yintercept = 0, col = acp_orange, lwd = 1.

5, lty = 2) + geom_text(aes(label = ticker), col = acp_red, size = 5) + ylab("Average Annual Returns") + scale_x_date("Year of Inception", date_breaks = "1 year", date_labels = "%Y") + ggtitle("Average Annual Returns for Vangaurd ETFs") + theme_acp() + theme(axis.

text.

x = element_text(size = 12))There is greater variance in the average annual returns of newer funds than newer funds.

While some tickers may look like they have extremely high or low returns, that may be because they are newer.

Some funds seem to pop.

VOO has been around since late 2010 and still has average annual returns of almost 15%.

Next we look at each fund’s expense ratios.

Funds that have high returns but high expense ratios can leave you with less money in the long run than funds with lower returns and lower expense ratios.

The plot below displays the expense ratio for all funds, colored by broad asset classes.

There is a wide range of expense ratios, with international funds tending to have higher expense ratios, and bond funds tending to have lower expense ratios.

fund_data %>% select(fund_names, expense_ratio, asset_class) %>% mutate(`Asset Class` = sapply(str_split(asset_class, " "), head, 1)) %>% ggplot() + aes(x = reorder(fund_names, expense_ratio), y = expense_ratio, fill = `Asset Class`) + geom_bar(stat = "identity") + coord_flip() + xlab("") + ylab("") + ggtitle("International funds have higher expense ratios") + theme_acp() + theme(axis.

text.

y = element_text(size = 8)) + scale_fill_hue(c = 56, l = 56)Acquiring Time-Series DataAggregate data doesn’t provide enough information to choose a fund or mix of funds.

To understand long-term performance, we’ll need time-series data for each fund.

That way we’ll be able to get a better sense of the variance and correlations among them.

For that we turn to the quantmod package, which makes it easy to download and model the performance of stocks or funds.

library(quantmod)library(lubridate)# Get historical data for all fundsall_funds <- list()for(i in seq_along(fund_data$ticker)) { # Download data indto df all_funds[[i]] <- getSymbols(fund_data$ticker[i], src = "yahoo", auto.

assign = FALSE) %>% as.

data.

frame() # Add ticker column all_funds[[i]]$ticker <- fund_data$ticker[i] # Add date column all_funds[[i]]$date <- date(row.

names(all_funds[[i]])) # Standardize names names(all_funds[[i]]) <- c("open", "high", "low", "close", "volume", "adjusted", "ticker", "date")}# Combine into one dffund_history <- bind_rows(all_funds)write.

csv(fund_history, "fund_history.

csv", row.

names = FALSE)Now we have one data frame with the historical returns of all Vanguard ETFs.

We can look at the historical performance of all funds to see if any stand out.

# Time series of scaled returnsfund_history %>% group_by(ticker) %>% mutate(scaled_close = close/close[date == min(date)] – 1) %>% ungroup() %>% ggplot() + aes(x = date, y = scaled_close, col = ticker) + geom_line(lwd = 0.

5) + scale_y_continuous("Percent of initial", limits = c(-1, 3), breaks = seq(-1, 3, 1)) + scale_x_date("Date", date_breaks = "1 year", date_labels = "%Y") + ggtitle("Historical performance of Vangaurd ETFs") + theme_acp() + theme(axis.

text.

x = element_text(size = 10), panel.

grid = element_blank()) + scale_color_discrete(guide = FALSE)The VGT fund (Vanguard Information Technology ETF) seems to have broken away from the pack recently, and has been around since 2007.

However it’s very volatile.

If we are going to invest in this fund, we’ll also want to invest in some that are uncorrelated with it.

Bonds usually move in the opposite direction of stocks.

We’ll start by looking at the overall correlation between VGT and all the other funds since 2007.

vgt_data <- fund_history %>% filter(ticker == "VGT")vgt_cor <- fund_history %>% group_by(ticker) %>% summarise(correlation = cor(close, vgt_data$close[vgt_data$date >= min(date) & vgt_data$date <= max(date)], use = "complete.

obs"))vgt_cor %>% arrange(correlation) %>% head(10)ticker correlation <chr> <dbl> 1 VTC -0.

639 2 VTIP -0.

319 3 VGSH -0.

161 4 VTEB -0.

0382 5 BSV 0.

106 6 VGK 0.

141 7 VWO 0.

153 8 VDE 0.

202 9 VGIT 0.

216 10 BIV 0.

363A plot of the funds whose correlation with VGT is lowest reveals that they are mostly low-performing funds.

We may want to dig a little deeper for funds to pair with VGT.

fund_history %>% filter(ticker %in% c("VGT", "VTC", "VTIP", "VGSH", "VTEB", "BIV")) %>% group_by(ticker) %>% mutate(close = close/close[date == min(date)] – 1) %>% ungroup() %>% ggplot() + aes(x = date, y = close, col = ticker) + geom_line() + scale_x_date(date_breaks = "1 year", date_labels = "%Y") + theme_acp() + theme(axis.

text.

x = element_text(size = 8))ConclusionPersonal finance and investing can seem daunting.

With so many products available and your life’s savings on the line, it can be tempting to turn to a high-priced advisor or keep your money in a checking account.

Fortunately R can help us acquire, structure, and analyze data to make confident and informed personal financial decisions.

.

. More details

Leave a Reply