Analysis of Twitter Data Using R — Part 2 : Word Cloud

Analysis of Twitter Data Using R — Part 2 : Word CloudRohit NairBlockedUnblockFollowFollowingMay 15, 2016In the last article we learnt how to get authentication from Twitter to extract tweets and the understood the procedure to extract tweets using R code.In this article we will create a word cloud using those tweets.What is Word Cloud ?A word cloud is a text mining method that allows us to highlight the most frequently used keywords in a paragraph of textsIt is a visual representation showing the most relevant words..A word cloud can be an handy tool when you need to highlight the most commonly cited words in a text using a quick visualization..Being an R enthusiast, I always wanted to produce this kind of images within R.In this post we will use R to visualize tweets as a word cloud to find out what people are tweeting about the Instagram’s new logo (#instagramlogo).Requirements :We will require three R packages for this :install.packages("SnowballC")library(wordcloud)library(SnowballC)library(tm)Step 1 : Extract tweets from Twitter.insta<- searchTwitter(“#instagramlogo”, n=3000, lang=”en”)Step 2 : Identiy & create text files to turn into a cloud.The first step is to identify & create text files on which you want to create the word cloud.insta_text <- sapply(insta, function(x) x$getText())Step 3 : Create a corpus from the collection of text files.Corpus is just a way to store a collection of documents in a R software readable format.The package “tm” and other text mining packages operate on a format called corpus.insta_text_corpus <- Corpus(VectorSource(insta_text))Step 4 : Data Cleaning on the text filesThis is the most important step in the entire Data cleaning process..Here, we will find out those keywords, which builds up the meaning of the sentence.The clean up of the text will be done by using lowercase and removing punctuation, usernames, links or replacing symbols like “/” or “@” with a blank space.4.1 Remove punctuation.insta_text_corpus <- tm_map(insta_text_corpus, removePunctuation)4.2 Transform text to lower case.insta_text_corpus <- tm_map(insta_text_corpus, content_transformer(tolower))4.3 To remove stopwords.Stop words are just common words which we may not be interested in..If we look at the result of stopwords (“english”) we can see what is getting removed.The information value of ‘stopwords’ is near zero due to the fact that they are so common in a language..Removing this kind of words is useful before further analysis.insta_text_corpus <- tm_map(insta_text_corpus, function(x)removeWords(x,stopwords()))4.4 Remove your own stop word# specify your stopwords as a character vectorinsta_text_corpus <- tm_map(insta_text_corpus, removeWords, c(“RT”, “are”,”that”))Depending out what you are trying to achieve with your analysis, you may want to do the data cleaning step differently.You may want to know what punctuation is being used in your text or the stop words might be an important part of your analysis..So use your head and have a look at the getTransformations() function to see what your data cleaning options are.4.6 Remove URL’s from textremoveURL <- function(x) gsub(“http[[:alnum:]]*”, “”, x)insta_text_corpus <- tm_map(insta_text_corpus, content_transformer(removeURL))So, what have we just done?We’ve transformed every word to lower case, so that ‘Apple’ and ‘apple’ now count as the same word..We’ve removed all punctuation — ‘apple’ and ‘apple!’ will now be the same..We stripped out any extra whitespace and we removed stop words and URL’s .Step 5 : Build a term-document matrixDocument matrix is a table containing the frequency of the words..Column names are words and row names are documents.The function TermDocumentMatrix() from text mining package can be used as follow :insta_2 <- TermDocumentMatrix(insta_text_corpus)insta_2 <- as.matrix(insta_2)insta_2 <- sort(rowSums(insta_2),decreasing=TRUE)#Converting words to dataframeinsta_2 <- data.frame(word = names(insta_2),freq=insta_2)#The frequency table of wordshead(insta_2, 10)word freqinstagramlogo instagramlogo 2952instagram instagram 1772new new 1417logo logo 753now now 325instagramupdate instagramupdate 323get get 281like like 256design design 255this this 242Step 6 : Plot word frequenciesThe frequency of the first 10 frequent words are plotted :barplot(insta_2[1:10,]$freq, las = 2, names.arg = insta_2[1:10,]$word,col =”yellow”, main =”Most frequent words”,ylab = “Word frequencies”)Step 7 : Generate the Word cloudset.seed(1234)wordcloud(insta_text_corpus,min.freq=1,max.words=80,scale=c(2.2,1), colors=brewer.pal(8, "Dark2")),random.color=T, random.order=F)The order of words is completely random but the size of the words are directly proportional to the frequency of occurrence of the word in text files.The diagram directly helps us identify the most frequently used words in the text files.Albert Einstein quoted ,“One picture is worth a thousand words.”By seeing the word cloud I second his thought.SummaryWord Cloud gives us alot information about the data. We need to dig deeper to fully understand how all the words are related to #instagram logo.In the next article we will try to understand the sentiments of the users on the extracted tweets by performing Sentimental Analysis.Happy learning :). More details

Leave a Reply