Correlation between score and comments on the front page of Reddit

Correlation between score and comments on the front page of RedditJared WhalenBlockedUnblockFollowFollowingMar 17I’m not a huge Reddit user but when I’m there I usually hangout in just a few subreddits (r/dataisbeautiful, r/geography, r/philadelphia, r/emo and a couple others).

Other than the popular data viz subreddit, none of these generate the scores you see on the front page of Reddit.

I only say all that to point out that I am not tapped into all the nuances of Reddit culture.

I did however have a question after perusing the front page of the internet one day — is there a correlation between a popular post’s score and the amount of conversation that garners around it?It looks like r/worldnews really struck up some conversations!First things first, getting the data.

I took a look at the Reddit homepage api but didn’t see a way to pull the top posts for a given date and time.

So I turned to the Wayback Machine’s API which can use a specific date and time as an endpoint and will return the url of the closest web capture.

The front page is pretty well archived it appears.

Feeling pretty confident that I could scrape a good amount of data for 2018, I jumped over to R and generated a list of complete urls to call the API.

library(lubridate)dateRange <- gsub(“-”,””,seq(ymd(‘20180101’),ymd(‘20181231’), by = ‘1 day’, truncated=2))base_url <- "https://archive.

org/wayback/available?url=reddit.

com"#create list of api urlsurl_list <- c()for (date in dateRange) { full_url <- paste(base_url, "&timestamp=",date, "120000", sep="") url_list <- c(url_list, full_url)}Now we can call the Wayback Machine in order to get a list of web captures.

#create list of archive linksarchive_list <- c()archive_times <- c()for (url in url_list) { #get raw result from api call raw.

result <- GET(url = url) raw.

result$status_code #get raw content from results this.

raw.

content <- rawToChar(raw.

result$content) #put content into list this.

content <- fromJSON(this.

raw.

content) #extract archive url from list and add to archive_list archive_times <- c(archive_times, this.

content$archived_snapshots$closest$timestamp) archive_list <- c(archive_list, this.

content$archived_snapshots$closest$url)}This gives us a list of 365 urls to captures from around noon of each day.

Now onto the actual web scraping.

There are probably quicker ways to do this, but I went with an old fashioned for loop and used the rvest package to scrape the score, number of comments, and r/subreddit of each of the page’s 25 posts.

I included some simple error handling by checking to make sure the length of the r/subreddit value is greater than 0 (i.

e.

any posts were actually pulled) before adding it to the datalist variable.

After the loop is complete, I use rbind to fill the data frame and filter out any problematic data.

#create empty listdatalist = list()#loop through archive urlsfor (i in 1:length(archive_list)) { #get all the html from the webpage webpage <- read_html(archive_list[i]) #filter all the .

things things <- webpage %>% html_node("#siteTable") %>% html_nodes(".

thing") #get votes score <- things %>% html_node(".

score") %>% html_text() #get number of comments comments <- things %>% html_node(".

comments") %>% html_text() #remove " comments" and convert to number comments <- as.

numeric(gsub(" comments","", comments)) # get post subreddit subreddit <- things %>% html_node(".

subreddit") %>% html_text() #get date of page date <- gsub("http://web.

archive.

org/web/|/https://www.

reddit.

com/", "", archive_list[i])if (length(subreddit) > 0) { print(paste(unique(date),length(subreddit),sep=" ")) #create temp df temp <- data.

frame(date = date, score = score, comments = comments, subreddit = subreddit) #add it to the list datalist[[i]] <- temp }}#make a df from the datalistmain_data = do.

call(rbind, datalist)#remove incomplete postsreddit_posts <- main_data %>% filter(score != "•", !is.

na(score), !is.

na(comments) ) %>% mutate(score = as.

numeric(sub("k", "e3", score, fixed = TRUE)), subreddit = gsub(".

*r/","r/",subreddit))How did the scrape do?.Not too bad.

The scrape successfully pulled daily posts for 75% of the year.

I didn’t investigate this too thoroughly since I had enough data to work with, but I think the Wayback Machine had some issues with the Reddit site redesign.

Now we have a freshly minted dataset, but in order to produce the visualization I want it needs some wrangling.

Identify the eight subreddits that sent the most posts to the front pageChange the subreddit value of posts that came from non-top subs as “other”Reclassify the subreddit factor levels so that they are in descending order with “other” at the end.

#get top 8 subredditstop_subs <- reddit_posts %>% group_by(subreddit) %>% summarise(count=n()) %>% top_n(8, count) %>% ungroup()#create vector of top_substop_subs <- as.

character(top_subs$subreddit)#make notin operator'%!in%' <- function(x,y)!('%in%'(x,y))reddit_posts_reduc <- reddit_posts %>% mutate(subreddit = case_when( subreddit %!in% top_subs ~ 'other', TRUE ~ as.

character(.

$subreddit) ))#get list of factors in descending orderfactor_order <- reddit_posts_reduc %>% group_by(subreddit) %>% summarise(count=n()) %>% arrange(desc(count)) %>% select(subreddit)#overwrite with listfactor_order <- as.

vector(factor_order$subreddit) #remove "other" from first positionfactor_order <- factor_order[-1]#create new factor level listfactor_order2 <- factor_order#update new factor list with ordering infofor (i in 1:length(factor_order)) { factor_order2[[i]] <- paste("#",i," ",factor_order[[i]], sep = "")}#append other to both factor listsfactor_order <- append(factor_order, "other")factor_order2 <- append(factor_order2, "other")#update dataframe levels with update factor levelsreddit_posts_reduc$subreddit_f <- mapvalues(reddit_posts_reduc$subreddit, from = factor_order, to = factor_order2)levels(reddit_posts_reduc$subreddit_f)Now, time to plot.

I plotted the number of comments on the x-axis and the score on the y-axis.

I used axis limits to account for outliers, The end result was a plot of small multiples grouped by subreddit and labeled with its correlation coefficient.

#plot datareddit_posts_reduc %>% ggplot(aes( x=score, y=comments, color=subreddit_f) ) + geom_point(size=3, alpha=0.

4) + facet_wrap(~subreddit_f, ncol = 3) + geom_smooth(se=F) + theme_fivethirtyeight() + theme(axis.

title=element_text()) + # labs(title = "Correlation between score and comments on front page", # subtitle = "Posts from the front page of Reddit in 2018 plotted to show correlation between score and the number of comments.

Posts are grouped by the eight subreddits that sent the most posts to the front page with all other posts grouped in other.

", # caption = "Data from Reddit via Archive.

org.Chart by @jared_whalen" # ) + theme(legend.

position="none") + stat_cor(method = "pearson", label.

x = 110000, label.

y = 9000) + scale_y_continuous(label=unit_format(unit = "K", scale = 1e-3, sep=""), limits=c(0,10000)) + scale_x_continuous(label=unit_format(unit = "K", scale = 1e-3, sep=""), limits=c(0,150000)) + xlab("Score") + ylab("Number of comments")Things I gained from this projectHow to use the Wayback Machine’s API to scrape archived pagesA better understanding of reassigning factor levels in order to customize ordering when plottingHere is a gist of the entire source code.

.

. More details

Leave a Reply