What’s happened in a world last month: world news analysis

Breakdown by all countries and languages:war <- test %>% filter(str_detect(text, " war |warships|wars|войн| guerra | la guerre |战争")) %>% filter(name != "Austria") %>% filter(name != "Germany") %>% group_by(name) %>% count(name)View()#'war' for a German language means different thing so we process it separatelyAustria <- test %>% filter((name == "Austria"))Austria <- Austria %>% filter(str_detect(text, " Krieg | Krieges")) %>% count(name)Germany <- test %>% filter((name == "Germany"))Germany <- Germany %>% filter(str_detect(text, " Krieg | Krieges")) %>% count(name)war_full <- rbind.

data.

frame(war, Austria, Germany)war_full <- as.

data.

frame(war_full)#breakdown by countrywar_full %>% mutate(name = fct_reorder(name, n)) %>% ggplot(aes(name, n, fill = name)) + geom_col() + theme(legend.

position = "none") + labs( title = "News frequency with a 'war' word", subtitle = "Breakdown by Country", x = "", y = "" )Which country talk about “trade war” the most?test %>% filter( str_detect( text, " trade war | trade wars | торговая война | guerre commerciale | guerra comercial | Handelskrieg |贸易战" ) ) %>% count(name) %>% arrange(desc(n)) %>% mutate(name = fct_reorder(name, n)) %>% ggplot(aes(name, n, fill = name)) + geom_col() + coord_flip() + expand_limits(y = 0) + theme(legend.

position = "none") + labs( title = "News frequency with a 'trade war' word", subtitle = "Breakdown by Country", x = "", y = "" )But do we have news with a “peace” word?.Is it real to find a peace in our world?test %>% filter(str_detect(text, "peace | frieden | pace | paz |和平")) %>% count(name) %>% mutate(name = fct_reorder(name, n)) %>% ggplot(aes(name, n, fill = name)) + geom_col() + theme(legend.

position = "none") + labs( title = "News frequency with a 'peace' word", subtitle = "Breakdown by Country", x = "", y = "" )And what about “fake” and “true” news?.Does news agencies use that words or is it an anachronism?fake <- test %>% filter(str_detect(text, " fake | falso | faux | falschung|假")) %>% count(name)fake$id <- "fake"true <- test %>% filter(str_detect(text, " true | wahr | vrai | vero | cierto|真正")) %>% count(name)true$id <- "true"faketrue <- rbind(fake, true)faketrue %>% mutate(name = fct_reorder(name, n)) %>% ggplot() + geom_bar(aes(name, n, fill = id), stat = "identity", position = "dodge") + coord_flip() + theme( legend.

position = "bottom", legend.

title = element_blank(), axis.

title = element_blank() ) + labs( title = "Frequency of 'fake' and 'true' words in news", subtitle = "Breakdown by Country", x = "", y = "" )“Fake” news are popular in Ireland and China, while New Zealand has the most “true” word frequency in their news along with Canada and UK.

Let’s see which country have the most “shocking” news:test %>% filter(str_detect(text, " shock | choque | choc | schock|休克|震动|浓密的")) %>% count(name) %>% mutate(name = fct_reorder(name, n)) %>% ggplot(aes(name, n, fill = name)) + geom_col() + theme(legend.

position = "none", axis.

title = element_blank()) + labs( title = "Which countries have the most 'shocking' news", subtitle = "Breakdown by Country", x = "", y = "" )Take a closer look:That’s a news that shocked our world.

From presidential selections to Doctor Who fans.

Which countries have the most “bad” and “good” words in their news?bad <- test %>% filter(str_detect(text, " bad |坏")) %>% count(name)bad$id <- "bad"good <- test %>% filter(str_detect(text, " good |好")) %>% filter(name != "Austria") %>% filter(name != "Germany") %>% count(name)good$id <- "good"badgood <- rbind(bad, good)badgood %>% mutate(name = fct_reorder(name, n)) %>% ggplot() + geom_bar(aes(name, n, fill = id), stat = "identity", position = "dodge") + coord_flip() + theme( legend.

position = "bottom", legend.

title = element_blank(), axis.

title = element_blank() ) + labs( title = "Which countries have the most 'bad' and 'good' news?", subtitle = "Breakdown by Country", x = "", y = "" )Well, as we clearly see “good” news are prevail.

But is it really true?.We will check it later with a sentiment analysis of the whole dataset.

Coming next, we’ll plot a chart with a “death” word frequency through different countries.

test %>% filter(str_detect(text, " death | Tod | muerte | mort| смерть|死亡")) %>% count(name) %>% arrange(desc(n)) %>% mutate(name = fct_reorder(name, n)) %>% ggplot(aes(name, n, fill = name)) + geom_col() + coord_flip() + expand_limits(y = 0) + theme(legend.

position = "none") + labs( title = "News frequency with a 'death' word", subtitle = "Breakdown by Country", x = "", y = "" )Italy has the leading position with a biggest numbers.

We should take a closer look on it applying topic modelling technique(LDA algorithm).

Checking our topics with Google Translator we’d get some understanding about what’s going on.

Hope that somebody still reading this:) so let’s take a look on a domestic news that always cover government personas at some point.

I think that will be interesting to see some type of popularity between a couple of government leaders through a whole world.

For example, we may choose:Trump <- test %>% filter(str_detect(text, "Trump|特朗普| Трамп ")) %>% count(name)Trump$id <- "Trump"Putin <- test %>% filter(str_detect(text, "Putin |普京| Путин ")) %>% count(name)Putin$id <- "Putin"Merkel <- test %>% filter(str_detect(text, "Merkel |默克爾| Меркель ")) %>% count(name)Merkel$id <- "Merkel"Jinping <- test %>% filter(str_detect(text, " Xi Jinping |习近平| Си Цзиньпин")) %>% count(name)Jinping$id <- "Xi Jinping"popularity <- rbind(Trump, Putin, Merkel, Jinping)pop <- popularity %>% count(id) %>% arrange(desc(nn)) %>% View()popularity %>% mutate(name=fct_reorder(name,n))%>% ggplot() + geom_bar(aes(name, n, fill = id), stat = "identity", position = "dodge") + coord_flip() + theme( legend.

position = "bottom", legend.

title = element_blank(), axis.

title = element_blank() ) + labs( title = "Citations on selected national leaders in a news", subtitle = "Breakdown by Country", x = "", y = "" )President Trump has a leading position covering 24 country at all, but this is obvious for anybody who check news sites quite often.

Chancellor of Germany — Angela Merkel takes second place following by President of the People’s Republic of China Xi Jinping and president Putin.

Surpsingly there are some countries that covers only US president and nobody else from our selection.

China and the United States have biggest numbers covering their national leaders, while Germany and Russia have more coverage on a foreign presidents.

Maybe country citations shows some difference?US <- test %>% filter(str_detect(text, " United States | US | USA | Stati Uniti | Etats-Unis| США | 美国")) %>% count(name)US$id <- "United States"Germany <- test %>% filter(str_detect( text, " Germany | Deutschland | Alemania | Germania | Allemagne | Германия |德国" )) %>% count(name)Germany$id <- "Germany"China <- test %>% filter(str_detect(text, " China | Chine | Cina |Китай| 中国")) %>% count(name)China$id <- "China"Russia <- test %>% filter(str_detect(text, " Russia | Russland| Rusia | Russie |Россия|俄罗斯")) %>% count(name)Russia$id <- "Russia"popularity <- rbind(US, Germany, China, Russia)pop <- popularity %>% count(id) %>% arrange(desc(nn)) %>% View()popularity %>% mutate(name=fct_reorder(name,n))%>% ggplot() + geom_bar(aes(name, n, fill = id), stat = "identity", position = "dodge") + coord_flip() + theme( legend.

position = "bottom", legend.

title = element_blank(), axis.

title = element_blank() ) + labs( title = "Citations on selected countries in a news", subtitle = "Breakdown by Country", x = "", y = "" )So this is how the balance of power looks like (in a global news coverage of course).

The United States, China and Russia have something similar coverage level around the globe!.China has the most coverage in Pakistan, Singapore (geographical role explain it well).

According to our findings Italy has the lowest interest in United States and Russia have the most highest number of citations on the United States.

What if we apply same approach to find some brand awarness?.We’ll select Apple and Samsung as an example of core players on a market.

Apple <- test %>% filter(str_detect(text, " Apple ")) %>% count(name)Apple$id <- "Apple"Samsung <- test %>% filter(str_detect(text, " Samsung ")) %>% count(name)Samsung$id <- "Samsung"popularity <- rbind(Samsung, Apple)popularity %>% mutate(name=fct_reorder(name,n)) %>% ggplot() + geom_bar(aes(name, n, fill = id), stat = "identity", position = "dodge") + coord_flip() + theme( legend.

position = "bottom", legend.

title = element_blank(), axis.

title = element_blank() ) + labs( title = "Citations on selected brand in a news", subtitle = "Breakdown by Country", x = "", y = "" )pop <- popularity %>% count(id) %>% arrange(desc(nn)) %>% View()Both brands have the same coverage by country in our dataset.

But what about chart?Pakistan and Turkey have covered Samsung more than Apple, while other countries prefer to look on a Apple’s products more.

What if we would like to see the top 5 most common words that appears in news in a every country?.I think that this plot will show some value.

#I already preprocess dataset and exclude languages that are not supported in tm packagetop10full %>% group_by(name)%>% filter(name!="Russia")%>% filter(name!="Norway")%>% filter(name!="China")%>% filter(name!="Turkey")%>% ggplot(aes(word,freq, fill= name))+ geom_col() + facet_wrap(~name, scales = 'free_x')+ theme(legend.

position = "none")+ labs(title="The most top 5 words used in a every news article ", subtitle = "breakdown by country", x = "", y= "" )From that plot, we could say that there is two types of countries.

One of them covers their’s government activity the most due to a strong presence on a international or domestic area.

The second type is more into different subjects of life or just their government didn’t provide such information to the public.

I think that time to apply sentiment analysis on our dataset.

First, we’ll look on a timeseries with a different methods :The default “Syuzhet” lexicon was developed in the Nebraska Literary Lab under the direction of Matthew L.

JockersThe “afinn” lexicon was develoepd by Finn Arup Nielsen as the AFINN WORD DATABASE See: See http://www2.

imm.

dtu.

dk/pubdb/views/publication_details.

php?id=6010 The AFINN database of words is copyright protected and distributed under “Open Database License (ODbL) v1.

0” http://www.

opendatacommons.

org/licenses/odbl/1.

0/ or a similar copyleft license.

The “bing” lexicon was develoepd by Minqing Hu and Bing Liu as the OPINION LEXICON See: http://www.

cs.

uic.

edu/~liub/FBS/sentiment-analysis.

htmlThe “nrc” lexicon was developed by Mohammad, Saif M.

and Turney, Peter D.

as the NRC EMOTION LEXICON.

See: http://saifmohammad.

com/WebPages/lexicons.

htmlmelted %>% group_by(variable) %>% ggplot(aes( timestamp, value, color = variable, label = variable, size = value )) + geom_point() + labs(title = "Scores by different sentiment lexicons", x = "", y = "") + geom_smooth(aes(group = 1))The result is vary day to day, but the median is below zero, so we may assume that our global news is not so positive.

But let’s take a look on a sentiment breakdown by emotions.

emo_sum %>% ggplot(aes(emotion,count, fill=emotion)) + geom_col() + theme(legend.

position = "none",axis.

title = element_blank()) + labs(title="Recent sentiments on a global news dataset", x = "", y = "" )The next thing that I wanted to do is another topic modelling.

For that purpose we’ll use an amazing tutorial by Julia Silge where she describes every step.

We’ll plot model diagnostics and the result:#plot result after preprocessing data and modelling (Assumption of 100 models max)k_result %>% transmute(K, `Lower bound` = lbound, Residuals = map_dbl(residual, "dispersion"), `Semantic coherence` = map_dbl(semantic_coherence, mean), `Held-out likelihood` = map_dbl(eval_heldout, "expected.

heldout")) %>% gather(Metric, Value, -K) %>% ggplot(aes(K, Value, color = Metric)) + geom_line(size = 1.

5, alpha = 0.

7, show.

legend = FALSE) + facet_wrap(~Metric, scales = "free_y") + labs(x = "K (number of topics)", y = NULL, title = "Model diagnostics by number of topics", subtitle = "These diagnostics indicate that a good number of topics would be around 80")As we can see above, Held-out likelihood maximal and Residuals minimal values is somewhere at 80 topics.

And the result is:gamma_terms %>% top_n(20, gamma) %>% ggplot(aes(topic, gamma, label = terms, fill = topic)) + geom_col(show.

legend = FALSE) + geom_text(hjust = 0, nudge_y = 0.

0005, size = 3, family = "IBMPlexSans") + coord_flip() + scale_y_continuous(expand = c(0,0), limits = c(0, 0.

09), labels = percent_format()) + theme_tufte(base_family = "IBMPlexSans", ticks = FALSE) + theme(plot.

title = element_text(size = 16, family="IBMPlexSans-Bold"), plot.

subtitle = element_text(size = 13)) + labs(x = NULL, y = expression(gamma), title = "Top 20 topics by prevalence in the our dataset", subtitle = "With the top words that contribute to each topic")The last thing that I wanted to show is a wordcloud.

Like a really big dot at the end:) Simplicity and complex in a one picture.

So what this news all about?wordcloud(data, colors = viridis::viridis_pal(end = 0.

8)(10), random.

order = FALSE, random.

color = TRUE, min.

words=10,max.

words=Inf,rot.

per = 0.

3)That type of dataset have a great potential to discover any thing that you wanted to know.

What we did is just a little part of analysis and I think that this is a really helpful tool for getting some insights.

With a continuous updating we may keep our hand on a pulse on a various topics and their presence.

We may identify some brand awareness, human actions, foreign affairs and many other things that happens every day in our world.

.. More details

Leave a Reply