Random forest text classification in R: Trump v.
ObamaCan I successfully determine the differences in speech content between the 2 most recent American Presidents?Chris MarshallBlockedUnblockFollowFollowingMar 19There may never have been 2 consecutive presidents who differ so much in their character as Presidents Donald J.
Trump and Barack Obama.
I thought it would, therefore, be pretty interesting to do some analysis on the text content of a selection of their speeches, to see whether the perception that they are vastly different characters is held up by the data.
I’m going to complete some exploratory analysis on some web-scraped data before applying a random forest classification model to try and predict who is doing the talking.
Obtaining the dataA quick search brought up a fantastic resource for this project, The American Presidency Project.
This website holds transcripts of a huge variety of Presidential documentation.
Addresses, speeches, interviews, debates, it’s all there!I decided to stick to just one type of transcript for consistency, the Presidential Weekly Address.
Not only are they fairly regular, but they seem to discuss a good cross-section of Presidential policy.
The libraries I’m going to need for this project are as follows:library(rvest)library(xml2)library(stringr)library(dplyr)library(tm)library(SnowballC)library(wordcloud)library(RColorBrewer)library(ggplot2)library(caTools)library(randomForest)Using the lapply() function, I was able to obtain the 3 relevant classes from the web pages and scrape them into 3 lists.
Combined with the root URL, a quick browse quickly revealed those speeches conducted by either Trump or Obama (151 through to 441):speeches <- lapply(paste0('https://www.
presidency.
ucsb.
edu/documents/the-presidents-weekly-address-', 151:441), function(url){ url %>% read_html() %>% html_nodes(".
field-docs-content") %>% html_text()})name <- lapply(paste0('https://www.
presidency.
ucsb.
edu/documents/the-presidents-weekly-address-', 151:441), function(url){ url %>% read_html() %>% html_nodes(".
diet-title") %>% html_text()})date <- lapply(paste0('https://www.
presidency.
ucsb.
edu/documents/the-presidents-weekly-address-', 151:441), function(url){ url %>% read_html() %>% html_nodes(".
date-display-single") %>% html_text()})Data wranglingNow that the data has been imported, we need to get it into a format that will be useful going forward.
I took the 3 list elements and combined them into a data frame, before subsetting this data frame based on the President’s name:# combine lists into data framespeech_data <- do.
call(rbind, Map(data.
frame, date=date, name=name, speech=speeches))# split data into Trump and Obamaobama_speeches <- subset(speech_data, speech_data$name == 'Barack Obama')trump_speeches <- subset(speech_data, speech_data$name == 'Donald J.
Trump')Corpus creationobama_corpus <- VCorpus(VectorSource(obama_speeches$speech))trump_corpus <- VCorpus(VectorSource(trump_speeches$speech))Now that the corpora are loaded up, we need to clean everything up.
Removing stop words (and, the, etc.
) and applying stemDocument (getting rid of all the “-ing” and “-ed”) to each corpus are both important steps.
In addition to this, we need to stick everything in lower case, remove all the numbers and punctuation, and make sure there are no tabs/unnecessary white space.
obama_corpus <- tm_map(obama_corpus, content_transformer(tolower))obama_corpus <- tm_map(obama_corpus, removeNumbers)obama_corpus <- tm_map(obama_corpus, removePunctuation)obama_corpus <- tm_map(obama_corpus, removeWords, stopwords())obama_corpus <- tm_map(obama_corpus, stemDocument)obama_corpus <- tm_map(obama_corpus, stripWhitespace)trump_corpus <- tm_map(trump_corpus, content_transformer(tolower))trump_corpus <- tm_map(trump_corpus, removeNumbers)trump_corpus <- tm_map(trump_corpus, removePunctuation)trump_corpus <- tm_map(trump_corpus, removeWords, stopwords())trump_corpus <- tm_map(trump_corpus, stemDocument)trump_corpus <- tm_map(trump_corpus, stripWhitespace)Data visualisationWith both of the corpora in pretty good shape, we now have a chance to do a bit of data visualisation before applying the machine learning.
The first step here is to convert our corpora into a format where they are ready for analysis.
For the data visualisation part, this involves the TermDocumentMatrix() function, as shown below:otdm <- TermDocumentMatrix(obama_corpus)ttdm <- TermDocumentMatrix(trump_corpus)All this has done is convert our data into a matrix where each unique word has its own row, and each document (speech) has its own column.
The data within the matrix represents how many times each word was said in each speech.
With data such as these, the wordcloud() function is a fantastic way to concisely view each President’s most used words.
Once we have converted the matrices into data frames, we can generate some visualisations.
# building Obama's data frameom <- as.
matrix(otdm)ov <- sort(rowSums(om),decreasing=TRUE)od <- data.
frame(word = names(ov),freq=ov)# generating Obama's word cloudset.
seed(1234)wordcloud(words = od$word, freq = od$freq, min.
freq = 1, max.
words=200, random.
order=FALSE, rot.
per=0.
3, colors="blue")# building Trump's data frametm <- as.
matrix(ttdm)tv <- sort(rowSums(tm),decreasing=TRUE)td <- data.
frame(word = names(tv),freq=tv)# generating Trump's word cloudset.
seed(1234)wordcloud(words = td$word, freq = td$freq, min.
freq = 1, max.
words=200, random.
order=FALSE, rot.
per=0.
3, colors='red')Combining each of the word clouds generated from the above code gives us something that looks like this:The Presidential word cloudsAt first glance, there’s not a whole lot we can gather from this image.
The one obvious point that does stand out is Trump’s fairly strong reliance on the word “American”.
The word makes up 2.
99% of all the words Trump says (minus the stop words).
This is nearly 150% higher than Obama’s use of the same word (1.
21% of his total words).
To take this a little further, let’s look at what words each of the presidents like to use alongside “American”.
Because of Trump’s more common use of the word, there are significantly more words above a higher correlation limit, as shown below:# Trump's word associationsfindAssocs(tdtm, "american", corlimit = 0.
5)worker million town crush terribl encourag hire0.
58 0.
57 0.
57 0.
55 0.
54 0.
53 0.
53plan provid0.
53 0.
50# Obama's word associationsfindAssocs(odtm, "american", corlimit = 0.
25)tax health aheadchang begun bid grip honesti0.
29 0.
28 0.
26 0.
26 0.
26 0.
26 0.
26huge sought auto campaign trade0.
26 0.
26 0.
25 0.
25 0.
25We can see that quite a few of Trump’s top associations focus on jobs: “worker”, “hire” and “provide”.
This does reflect Trump’s campaign promises on improving the job market.
Although there are lower correlations among the Obama results, “tax” and “health” top the pile, again understandable considering the major of the Obama administration’s economic policy was based around moderate tax increases on high-income Americans being used to fund healthcare reform.
Building the random forest classification modelNow that we’ve had a brief look at the data we’re working with, we can put together the classification model.
First, we need to load up another corpus that contains data from both Presidents and clean it up, just like we did previously.
corpus = VCorpus(VectorSource(speech_data$speech))corpus = tm_map(corpus, content_transformer(tolower))corpus = tm_map(corpus, removeNumbers)corpus = tm_map(corpus, removePunctuation)corpus = tm_map(corpus, removeWords, stopwords())corpus = tm_map(corpus, stemDocument)corpus = tm_map(corpus, stripWhitespace)Next, we need to build a document-term matrix (simply a transposition of the term-document matrix we saw above — it has each speech as a row, which is important for when we attach our dependent variable below).
With the document-term matrix, we can also clean it a little further by removing some of the noise (the least frequent words).
With removeSparseTerms() set to 0.
9, we only keep 90% of the most frequent words.
dtm <- DocumentTermMatrix(corpus)dtm <- removeSparseTerms(dtm, 0.
9)In order to conduct the classification, we need to convert our matrix into a data frame.
After this, the data frame will still be missing the dependent variable (i.
e.
the President’s name), which we will need to train the model — so that has been added in below.
data <- as.
data.
frame(as.
matrix(dtm))data$name <- speech_data$nameBy setting a split ratio of 0.
75, we’re letting our model learn from a random selection of 75% of our speeches and then letting it try and determine the correct President on the remaining 25%.
set.
seed(1234)split <- sample.
split(data$name, SplitRatio = 0.
75)training_set <- subset(data, split == TRUE)test_set <- subset(data, split == FALSE)Once these sets are ready, we can fit the random forest classification to the training set, ensuring our data frame of predictors does not include our dependent variable (hence the [-470] on the training set).
classifier <- randomForest(x = training_set[-470], y = training_set$name, nTree = 10)We can then use our classifier to make a prediction on the test set that we created above.
y_pred <- predict(classifier, newdata = test_set[-470])The final step involves creating the confusion matrix which will compare the predicted values generated above with the true values that are stored in the test set.
We can see that 69 out of 73 speeches were classified correctly ([63+6]/[63+4+6]), an accuracy of 94.
52%.
We have just 4 erroneous (false negative) predictions, where a speech by President Trump was classified as one conducted by President Obama.
cm <- table(test_set[, 470], y_pred)cm Barack Obama Donald J.
TrumpBarack Obama 63 0Donald J.
Trump 4 6The result seems satisfactory, however, perhaps I’ve chosen 2 Presidents who are so wildly different in their approach to their role that the accuracy of the model was destined to always be high.
Thanks for reading!.I’m no expert so I welcome all feedback and constructive criticism in the comments.
.. More details