The Mueller Report: An investigation in R

The Mueller Report: An investigation in RAditya MangalBlockedUnblockFollowFollowingMay 5With the recent release of the Mueller Report, I thought it would be an interesting idea to do an investigation of the investigation in R.

i.

e.

an exploratory analysis of the Mueller Report.

So here goes.

Environment SetupLets first load all the libraries that we’ll need for this exploration.

rm(list = ls())library(tidyverse)library(pdftools)library(tidylog)library(hunspell)library(tidytext)library(ggplot2)library(gridExtra)library(scales)Obtaining the Data i.

e.

the Mueller ReportThe report is freely available on the Justice department’s website here and we can get access to it as an R object, like so -download.

file("https://www.

justice.

gov/storage/report.

pdf", "~/Downloads/mueller-report.

pdf")report <- pdf_text("~/Downloads/mueller-report.

pdf")I will use the preconverted CSV format of the report present here.

report <- read_csv("https://raw.

githubusercontent.

com/gadenbuie/mueller-report/master/mueller_report.

csv")Cleaning the dataSince the actual report starts a few pages in, and the pdf to text parsing returns in some failures (also the redacted portion of the report) there are some null lines in the data.

Lets filter these out.

report %>% filter(page >= 9) -> contentcontent %>% filter(!is.

na(text)) -> contentAlso, due to the parsing errors we see a lot of misspelled words in the data.

Lets find and drop the lines which have majority of words misspelled using hunspell.

content %>% rowwise() %>% mutate(num_misspelled_words = length(hunspell(text)[[1]]), num_words = length(str_split(text, " ")[[1]]), perc_misspelled = num_misspelled_words/num_words) %>% select(-num_misspelled_words, -num_words) -> contentNormalizing the lines using tidytextcontent %>% unnest_tokens(text, text, token = "lines") -> contentMost Popular WordsLets see which are the most popular words in the Mueller reporttidy_content <- content %>% unnest_tokens(word, text) %>% anti_join(stop_words)tidy_content %>% mutate(word = str_extract(word, "[a-z']+")) %>% filter(!is.

na(word)) %>% count(word, sort = TRUE) %>% filter(str_length(word) > 1, n > 400) %>% mutate(word = reorder(word, n)) %>% ggplot( aes(x=word, y=n)) + geom_segment( aes(x=word, xend=word, y=0, yend=n), color="skyblue", size=1) + geom_point( color="blue", size=4, alpha=0.

6) + coord_flip() + theme(panel.

grid.

minor.

y = element_blank(), panel.

grid.

major.

y = element_blank(), legend.

position="none") + labs(x = "", y = "Number of Occurences", title = "Most popular words from the Mueller Report", subtitle = "Words occurring more than 400 times", caption = "Based on data from the Mueller Report")As we can note from the plot above, the most popular words in the report are “president” and “trump.

Another notable words, “cohen”, “flynn”, “comey” and “mcgahn”.

Most Common Correlated WordsLets build a network of the words which are highly correlated in the reportword_cors <- tidy_content %>% add_count(word) %>% filter(n > stats::quantile(n, 0.

7)) %>% pairwise_cor(word, page, sort = TRUE)set.

seed(123)word_cors %>% filter(correlation > 0.

25, !str_detect(item1, "d"), !str_detect(item2, "d")) %>% graph_from_data_frame() %>% ggraph(layout = "fr") + geom_edge_link(aes(edge_alpha = correlation), show.

legend = FALSE) + geom_node_point(color = "lightblue", size = 5) + geom_node_text(aes(label = name), repel = TRUE) + theme_void() + labs(x = "", y = "", title = "Commonly Occuring Correlated Words", subtitle = "Per page correlation higher than 0.

25", caption = "Based on data from the Mueller Report")As we can expect, the most commonly occurring correlated words are “meeting” and “kushner” &“comey” and “investigation” among others.

SummaryAs we can see we can quickly do an exploratory analysis of the Mueller report using R.

Looking for a more in-depth analysis?.Check out my detailed blog post about the Mueller Report here -The Mueller ReportIntroduction Analysis Load libraries Downloading Report Cleaning Page range Text NA Misspelled Words Normalize Most…www.

adityamangal.

comI discuss how we can use python’s NLTK library in conjunction with R to analyze sentiment in the report and also fact-check the late night show hosts using our own custom built search engine on the report.

Let me know your thoughts here or on the blog post.

Cheers!.. More details

Leave a Reply