Spooky Author Identification – Exploratory Data Analysis Using ggplot2 and dplyr

The size of each word is mapped to its maximum deviation ( max_i(p_{i,j}-p_j) ), and its angular position is determined by the document where that maximum occurs.’_ See below the comparison cloud between all authors… comparison_data <- spooky_trainining_tidy_1n %>% dplyr::select(author, word) %>% dplyr::anti_join(stop_words) %>% dplyr::count(author,word, sort = TRUE) comparison_data %>% reshape2::acast(word ~ author, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("red", "violetred4", "rosybrown1"), random.order = F, scale=c(7,.5), rot.per = .15, max.words = 200) See below the comparison clouds between the authors, two authors at any time… par(mfrow = c(1,3), mar = c(0,0,0,0)) comparison_EAP_MWS <- comparison_data %>% dplyr::filter(author == "EAP" | author == "MWS") comparison_EAP_MWS %>% reshape2::acast(word ~ author, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("red", "rosybrown1"), random.order = F, scale=c(3,.2), rot.per = .15, max.words = 100) comparison_HPL_MWS <- comparison_data %>% dplyr::filter(author == "HPL" | author == "MWS") comparison_HPL_MWS %>% reshape2::acast(word ~ author, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("violetred4", "rosybrown1"), random.order = F, scale=c(3,.2), rot.per = .15, max.words = 100) comparison_EAP_HPL <- comparison_data %>% dplyr::filter(author == "EAP" | author == "HPL") comparison_EAP_HPL %>% reshape2::acast(word ~ author, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("red", "violetred4"), random.order = F, scale=c(3,.2), rot.per = .15, max.words = 100) Question: How many unique words are needed in the author dictionary to cover 90% of the used word instances?.words_cov_author_1 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "EAP") words_cov_author_2 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "HPL") words_cov_author_3 <- plot_word_cov_by_author(x = spooky_trainining_tidy_1n, author = "MWS") gridExtra::grid.arrange(words_cov_author_1, words_cov_author_2, words_cov_author_3, nrow = 1) From the plot above we can see that for EAP and HPL provided corpus, we need circa 7500 words to cover 90% of word instance..While for MWS provided corpus, circa 5000 words are needed to cover 90% of word instances..Question: Is there any commonality between the dictionaries used by the authors?.Are the authors using the same words?.A commonality cloud can be used to answer this specific question, it emphasizes the similarities between authors and plot a cloud showing the common words between the different authors. It shows only those words that are used by all authors with their combined frequency across authors..See below the commonality cloud between all authors… comparison_data <- spooky_trainining_tidy_1n %>% dplyr::select(author, word) %>% dplyr::anti_join(stop_words) %>% dplyr::count(author,word, sort = TRUE) mypal <- brewer.pal(8,"Spectral") comparison_data %>% reshape2::acast(word ~ author, value.var = "n", fill = 0) %>% commonality.cloud(colors = mypal, random.order = F, scale=c(7,.5), rot.per = .15, max.words = 200) See below the commonality clouds between the authors, two authors at any time… par(mfrow = c(1,3), mar = c(0,0,0,0)) mypal <- brewer.pal(8,"Spectral") comparison_EAP_MWS <- comparison_data %>% dplyr::filter(author == "EAP" | author == "MWS") comparison_EAP_MWS %>% reshape2::acast(word ~ author, value.var = "n", fill = 0) %>% commonality.cloud(colors = mypal, random.order = F, scale=c(7,.5), rot.per = .15, max.words = 200) comparison_HPL_MWS <- comparison_data %>% dplyr::filter(author == "HPL" | author == "MWS") comparison_HPL_MWS %>% reshape2::acast(word ~ author, value.var = "n", fill = 0) %>% commonality.cloud(colors = mypal, random.order = F, scale=c(7,.5), rot.per = .15, max.words = 200) comparison_EAP_HPL <- comparison_data %>% dplyr::filter(author == "EAP" | author == "HPL") comparison_EAP_HPL %>% reshape2::acast(word ~ author, value.var = "n", fill = 0) %>% commonality.cloud(colors = mypal, random.order = F, scale=c(7,.5), rot.per = .15, max.words = 200) Question: Can Word Frequencies be used to compare different authors?.First of all we need to prepare the data calculating the word frequencies for each author… word_freqs <- spooky_trainining_tidy_1n %>% dplyr::anti_join(stop_words) %>% dplyr::count(author, word) %>% dplyr::group_by(author) %>% dplyr::mutate(word_freq = n/ sum(n)) %>% dplyr::select(-n) Then we need to spread the author (key) and the word frequency (value) across multiple columns (note how NAs have been introduced for word not used by an author)….word_freqs <- word_freqs%>% tidyr::spread(author, word_freq) Lets start to plot the word frequencies (log scale) comparing two authors at a time and see how words distribute on the plane..Words that are close to the line (y = x) have similar frequencies in both sets of texts..While words that are far from the line are words that are found more in one set of texts than another..As we can see in the plots below, there are some words close to the line but most of the words are around the line showing a difference between the frequencies.. More details

Leave a Reply