London Design Festival 2018 (Part 2): Natural Language ProcessingPart 2: Natural language processing of 11,000 tweetsIntroductionIn part 1 of this series, I presented an exploratory data analysis of 11,000 tweets about the London Design Festival 2018, a seven-day design festival that happened between Saturday 15 to Sunday 23 September 2018.London Design Festival 2018 (LDF18) had a very active events programme spanning 11 different ‘Design Districts’, five ‘Design Destinations’, and three ‘Design Routes’ across London. It’s another fantastic example of London’s flexibility as a built environment to act as a canvas to display creative ideas.The aim of this article is to present the findings of my natural language processing analysis on those 11,000 tweets to understand the sentiment and what people thought about LDF18.Please scroll down to view my analysis via interactive data visualizations!Image by Ben Terrett on FlickrData and MethodsThe official hashtag for the event was #LDF18. After collecting 11,000 tweets containing this hashtag at the time of the event via the Twitter API, I first preprocessed and cleaned the textual data in a Python notebook.I then used Google’s langdetect library to filter out non-English tweets and dropped all the retweets from the NLP analysis so that there was no doubling-up. After these steps, I was left with 5,700 unique tweets. Next, I used the Google Cloud Natural Language API to get the sentiment for each tweet.Finally, I used the gensim library’s Word2Vec model to get word-embeddings vectors for each word in the entire corpus of tweets in relation to the word “LDF18”. Word2Vec is used to compute the similarity between words from a large corpus of text — Kavita Ganesan’s article is a great explanation.Once I had vectors for each word, I then used the scikitlearn library to perform principal component analysis (PCA) for dimensionality reduction and to plot the most similar words (nearest neighbours) to “LDF18”.You can check my Kaggle kernel here for all the analysis for this post.Analyzing the tweetsIn this section, I present the findings of my natural language processing (NLP) analysis. Below, I report on the following three metrics:Sentiment analysis of the tweets per day;Word frequency and hashtag frequency analysis;The output of the Word2Vec model: Principal Component Analysis (PCA) and nearest neighbour analysis.Alphabet by Kellenberger White Photography by @leemawdsley 3— Taken from FlickrSentiment analysisThe sentiment for each tweet was calculated using Google’s Cloud NLP API. The bar graph below shows the average sentiment of tweets per day, with -1 being very negative sentiment and +1 being very positive sentiment.We see that LDF18 started out with a relatively low average sentiment that gradually increased as the week went on. There was a peak of 0.53 on Tuesday 18th September, which the day after a dinner for medal winners — people were clearly happy about this!Figure 1: Line chart showing the average sentiment of the tweets per dayThe table below shows the top five (green) and bottom five (red) tweets by sentiment. You can clearly see by the tweets on the left that positive language was detected by the Cloud NLP API, similarly, negative tweets on the right.Many of the positive tweets were about the medal winners of the design prizes and also the dinner parties that took place that week… the canapes were excellent, I’ll have you know!A few of the installations presented at LDF18 were about plastic waste and climate change, and unfortunately, the NLP API classified tweets mentioning those installations as “negative”; this highlights some of the problems with sentiment analysis API’s.Table 1: Tabel showing the top five (left) and bottom five (right) tweets by sentiment scoreText Frequency analysisThe bar graphs below show the number of times a word and also a hashtag appears throughout all the tweets, left and right, respectively. Predictably, “ldf18” appears the most.However, these results are not very useful at telling us what people thought about the event because the frequency of hashtags is clearly a confounding variable for the frequency of words. In future analysis, I will remove the hashtag words from the text frequency analysis.Figure 2: Bar graphs showing the count of words and hashtags appearing in all the tweetsNearest neighboursWord2Vec is a neural language machine learning model. It takes a large corpus of text, in this case, the text from the 11,000 tweets, as input and produces a vector space, typically of several hundred dimensions, with each unique word corresponding to a vector in space — a word embedding. Principal Component Analysis is then used to reduce the dimensions of the Word2Vec space down to x and y coordinates.Importantly, Word2Vec is used to capture the similarity and relationships between words from the 11,000 tweets. Specifically, objects that are closer together in a space mean that they are similar. “Nearest neighbours” are the handful of words from the Word2Vec model that are most similar to “LDF18 ” based on a cosine metric similarity score.Figure 3: PCA output of the nearest neighbours of #LumiereLDN from the Word2Vec modelThe scatter plot shows the nearest neighbours for “LDF18”. We see nouns such as “architecture”, “craft”, “textiles” and “designs” are closely related.But importantly, the adjectives “beautiful”, “innovative” and “inspiration” are also closely related. A very positive outcome! It statistically demonstrates that these words most represent how people felt when they tweeted about LDF18.Waugh Thistleton Architects: MultiPly —Taken from Flickr — All rights reserved by the London Design FestivalConclusionSo there you have it! I’ve presented the findings of my NLP on 11,000 tweets about the London Design Festival 2018. The output from the Word2Vec model shows that people thought positively of the event.If you have any thoughts or suggestions, please do leave comments below or on my Kaggle kernel — an upvote on Kaggle would be really appreciated :)There are so many NLP libraries, it is likely that I will revisit this analysis in the future using GloVe, Tensorflow or Bert.Next time…In my next article (part 3), I’m going to present the findings of my computer vision analysis. Expect to see which artworks appeared the most. Stay tuned.Thanks for reading!VishalBefore you leave…If you found this article helpful or interesting, please share the article on Twitter, Facebook or LinkedIn so that everyone can benefit from it too.Vishal is a Research Student at The Bartlett at UCL in London. He is interested in the economic and social impact of culture in cities. You can get in touch with him on Twitter or LinkedIn. See more of Vishal’s work on Instagram or on his website.