Exploration and Visualization on each Presidential Candidate Supporter’s Tweets in Indonesia

Remember that the result of the analysis will not be seen only for you.

The result will be seen by others.

For the sake of UX (User Experience) , We need to present the data by visualizing it to make sure the readers become comfortable to see the results.

I do not think my article will be seen by many people if I present the result here by cropping the screenshot of the printed result in CLI (Command Line Interface).

In this article, we will do EDA in Jupyter Notebook since it has beautiful UX and also can be used to present our data.

TechnologyWe will use:Python 3.

6Numpy : A Python package for scientific computingPandas : A Python package for processing tabular dataTwint : A Python package to scrape posts (tweet) in TwitterPlotly : A Python package for visualizing the data interactively.

This package has beautiful UX on visualizing the data.

We can even manipulate the result of the plotting (i.

e : hide several labels in the plot)Jupyter Notebook : Awesome Python IDE that can also be used to make a presentation or educational code.

Step One : Research QuestionOkay, before we formulate the questions, let me tell you that in 2019, there will be a presidential election in Indonesia.

There are 2 candidates.

They are Joko Widodo (Jokowi)— Maaruf Amin (NO 01) and Prabowo Subianto — Sandiaga Uno (Sandi) (NO 02).

What we will focus on in this article is how are the supporter of each candidates in social media.

In this article, we will focus on Twitter Social Media.

By researching and googling, I’ve found some hashtags used by Jokowi’s and Prabowo’s Supporter to declare their support.

For the simplicity, I’ve narrowed down their hashtags by:Jokowi — Maaruf : #Jokowi2Periode OR #JokowiLagi OR OR #01IndonesiaMaju OR #2019JokowiLagi OR #2019TetapJokowi OR #TetapJokowiPrabowo — Sandi : #2019GantiPresiden OR #2019PrabowoSandi OR #2019PrabowoPresidenWe will limit the year of the tweet only on 2018.

That’s it, Let’s create the research question.

Here are my research questions :How is the frequency of each supporter’s tweets in the dataset?How is the frequency of each supporter’s tweets each months?By seeing the largest frequency tweet by month, How is the word’s frequency in that month on each supporter’s tweet?By seeing the largest frequency tweet by month, How is the word’s frequency in that month on Jokowi supporter’s tweet?How is the word’s top-30 frequency in that month on Prabowo supporter’s tweet ?How is the top-30 frequency of Token accompany ‘prabowo’ word in Jokowi Supporter’s Tweet on that month?How is the top-30 frequency of Token accompany ‘jokowi’ word in Prabowo Supporter’s Tweet on that month?How is the top-30 frequency of Token accompany ‘prabowo’ word in Prabowo Supporter’s Tweet on that month?How is the top-30 frequency of Token accompany ‘jokowi’ word in Jokowi Supporter’s Tweet on that month?How is the top-30 frequency of Hashtags in Prabowo and Jokowi Supporter’s Tweet?How is the mean of length char in Jokowi and Prabowo Supporter’s Tweet?How is the mean of length word in Jokowi and Prabowo Supporter’s Tweet?We’re done !.Actually there are too many questions that we can ask.

For the sake not to make this article become 1 hour read time, we will cut the question we want to ask to twelve.

Step Two : Collect the DataFor this steps, we will use Twint as our library.

For this article, we will scope our data as follow:Only tweet that contains above hashtags are scrapedWe will take the tweets that was posted in 2018I scraped 706.

208 tweets in this stepStep Three : Preprocess DataOkay, since this article will focus for EDA, we will make this step as short as possible.

First, we will read the csv with pandastweet = pd.

read_csv(‘tweet.

csv’, encoding=’utf-8')To make it simple, we will only formalize word and remove the stopwordsWe will formalize in two ways, first is with regex and then substitute known slang words to formal words.

The later will need a dictionary of slang word and its formal word.

This is how I formalized the word:There should be a better way to do the cleaning on the formalize_rule function such as use nltk TweetTokenizer.

Well, I want to try regex and that’s it.

I also implement on removing stopword on the formalize_word .

We apply it into our DataFrame:Note that in this step, in reality, we will come back to this step after doing several EDAs.

Sometimes, we will find the ‘pests’ in EDA step.

We are done, move onto last step!Step Four: Exploratory Data AnalysisSince all of the Research Questions that we want to answer can be answered here, we will end the Data Science process in this step.

Without further ado, let’s answer all of the questions!Wait, before we do that.

We should define some functions that will be used multiple times.

We have defined the DataFrame filter that we will use often later on.

We also create some functions which output the statistic in the DataFrame.

The statistic will be plotted with Plotly.

0.

How many instances in the data?tweet.

shape[0] # 7050131.

How is the frequency of each supporter’s tweets in the dataset?How to do it?We take the attribute shape of each supporter’s tweets.

freq_hashtag = []for filter_tweet in array_filter: freq_hashtag.

append(tweet[filter_tweet].

shape[0])We need to do this to show the plot in Jupyter:init_notebook_mode(connected=True)Let’s set up the plotlabel = ['Jokowi Supporter', 'Prabowo Supporter'] data = [ go.

Pie(labels=label, values=freq_hashtag, textinfo='value')]layout = go.

Layout( title='Tweet Frequency of Each Supporters')fig = go.

Figure(data=data, layout=layout)There are many ways to visualize this data.

Since each of them is comparable, we can use Pie chart to visualize the data.

There are two minimal components needed to plot on plotly.

First is the ‘data’.

The ‘data’ is a set of data with the type of chart that we want to visualize.

We can combine multi kinds type of chart here.

For example, you can visualize a Bar chart with a Pie Chart in the visualization.

Second is the layout.

The layout is the container of the visualization.

This is where we can customize the title, legend, axis, and many more.

Then we combine the container and the charts by putting it into go.

Figure (a figure).

The figure is ready to be plotted.

iplot(fig,'test')Plot It!AnalysisPrabowo’s supporter tweets frequency is higher than Jokowi’s supporter tweets2.

How is the top-30 frequency of each supporter’s tweets each months?How to do it?Since the data that we want to plot is sequential.

We can plot in line chart.

First, we filter and loop for each monthsfreq_hashtag = []for filter_tweet in array_filter_month: for filter_tweet_prez in array_filter: freq_hashtag.

append(tweet[filter_tweet][filter_tweet_prez].

shape[0])Reverse the list (from Dec — Jan to Jan — Dec)j_freq = freq_hashtag[::2][::-1]p_freq = freq_hashtag[1::2][::-1]Then plot itlabel = ['Jokowi Supporter', 'Prabowo Supporter'] data = [ go.

Scatter(x=month_unique, y=j_freq, name=label[0]), go.

Scatter(x=month_unique, y=p_freq, name=label[1])]layout = go.

Layout( title='Tweet Frequency of Each Supporters / Month')fig = go.

Figure(data=data, layout=layout)iplot(fig)Plot It!AnalysisThe Prabowo’s supporter tweets usually have more frequency than Jokowi’s supporter tweets.

Their tweets are at its peak in September.

3.

By seeing the largest frequency tweet by month, How is the word’s top-30 frequency in that month on each supporter’s tweet?How to do it?First, we will set that the largest frequency tweet is in September.

i = SepThen, we find the highest frequency word by using above functions.

stat_word = get_stat_with_func(tweet, lambda x: get_tweet_common_word(x, text_column="tweet3", most_common=30),'month')We will take only tweets posted in September and plot it.

We will limit it into TOP 30 highest frequency.

Since the data is not sequential, Bar chart is the right choice here.

Word cloud also good in how we visualize the word frequency if we don’t want to know the frequency of each words.

plotting = stat_word.

loc[i]freq = plotting.

apply(lambda x: x[1]).

valuesword = plotting.

apply(lambda x: x[0]).

valuesdata = [go.

Bar(x=word, y=freq),]layout = go.

Layout( title='Word Freq in {}'.

format(i))fig = go.

Figure(data=data, layout=layout)iplot(fig)Plot It!Analysis“jokowi’ has the highest frequency in September.

It has around 35k — 40k frequency.

It is followed by Indonesia, Orang (Person), Presiden (President), Rakyat (Citizen), Dukung (Support), Allah (God), Prabowo, and the others.

4.

By seeing the largest frequency tweet by month, How is the Top-30 word’s frequency in that month on Jokowi supporter’s tweet?How to do it?It’s similar on how we do it on RQ (Research Question) 3.

The difference is that we need to filter the Jokowi’s supporter tweet.

tweet_2019_jokowi = tweet[filter_jokowi]stat_word = get_stat_with_func(tweet_2019_jokowi, lambda x: get_tweet_common_word(x, text_column="tweet3", most_common=30),'month')i='Sep' plotting = stat_word.

loc[i]# print(plotting)freq = plotting.

apply(lambda x: x[1]).

valuesword = plotting.

apply(lambda x: x[0]).

valuesdata = [go.

Bar(x=word, y=freq),]layout = go.

Layout( title='Word Freq in {} for Jokowi Hashtag'.

format(i))fig = go.

Figure(data=data, layout=layout)iplot(fig)Plot it!AnalysisThe ‘jokowi’ word is also the highest here and the difference with other words is big.

It has positive words such as ‘berbagi’ (sharing), ‘tulus’ (sincere) and ‘bergerak’ (move’).

It also has ‘Allah’ word there.

5.

How is the Top-30 word’s frequency in that month on Prabowo supporter’s tweetHow to do it?Again, it is similar on doing RQ 4.

We will filter to Prabowo supporter’s tweet.

tweet_2019_prabowo = tweet[filter_prabowo]stat_word = get_stat_with_func(tweet_2019_prabowo, lambda x: get_tweet_common_word(x, text_column="tweet3", most_common=30),'month')i = 'Sep'plotting = stat_word.

loc[i]# print(plotting)freq = plotting.

apply(lambda x: x[1]).

valuesword = plotting.

apply(lambda x: x[0]).

valuesdata = [go.

Bar(x=word, y=freq),]layout = go.

Layout( title='Word Freq in {} for Prabowo Hashtag'.

format(i))fig = go.

Figure(data=data, layout=layout)iplot(fig)Plot it!AnalysisIt’s unexpected that the ‘jokowi’ frequency is higher than ‘prabowo’.

The highest one is ‘indonesia’.

The difference of each word’s frequency is not too big.

The words that we should notice are ‘ulama’ (Muslim’s Scholar or Cleric), ‘rezim’ (regime), ‘cebong’ (tadpole, the ‘bad’ alias for jokowi’s supporter by prabowo’s supporter) , ‘emak’ (group of mothers) and ‘bangsa’ (nation).

It also has ‘Allah’ word there.

6.

How is the Top-30 frequency of Token accompany ‘prabowo’ word in Jokowi Supporter’s Tweet on that month?Before we do it, since we often plot by writing the code many times, we should create a function that is reusable.

def plot_freq_word(x,y,title='Title'): data = [ go.

Bar(x=x, y=y), ] layout = go.

Layout( title=title, xaxis=dict( title='Kata', titlefont=dict( family='Latto', size=8, color='#7f7f7f' ), tickangle=45, tickfont=dict( size=8, color='black' ), ) ) fig = go.

Figure(data=data, layout=layout) iplot(fig)After that, we will filter the dataframe according to what we need.

tweet_prabowo_in_jokowi = tweet[(filter_jokowi) & (tweet.

tweet3.

str.

contains('prabowo', False))]stat_word = get_stat_with_func(tweet_prabowo_in_jokowi, lambda x: get_tweet_common_word(x, text_column="tweet3", most_common=30),'month')i='Sep'plotting = stat_word.

loc[i][1:].

dropna()freq = plotting.

apply(lambda x: x[1]).

valuesword = plotting.

apply(lambda x: x[0]).

valuesplot_freq_word(word,freq,"Frequency of Token accompany 'prabowo' Token in Jokowi Supporter's Tweet on {}".

format(i))Plot it!AnalysisJokowi is the highest frequency here.

We will notice some words that is interesting, which is ‘uang’ (money), ‘thecebongers’ (the tadpole), ‘prestasinya’ (the achievment), ‘survei’ (survey), and ‘asing’ (foreign countries)7.

How is the Top-30 frequency of Token that accompany ‘jokowi’ word in Prabowo Supporter’s Tweet on that monthWe will filter the dataframe according to what we need.

tweet_jokowi_in_prabowo = tweet[(filter_prabowo) & (tweet.

tweet3.

str.

contains('jokowi', False))]stat_word = get_stat_with_func(tweet_jokowi_in_prabowo, lambda x: get_tweet_common_word(x, text_column="tweet3", most_common=30),'month')i = 'Sep'plotting = stat_word.

loc[i][1:].

dropna()freq = plotting.

apply(lambda x: x[1]).

valuesword = plotting.

apply(lambda x: x[0]).

valuesplot_freq_word(word,freq,"Frequency of Token accompany 'Jokowi' Token in Prabowo Supporter's Tweet on {}".

format(i))Plot it!AnalysisPrabowo is the highest frequency here.

We will notice some words that is interesting, which is ‘gerakan’ (movement), ‘ulama’, ‘mahasiswa’ (college student), ‘rupiah’ (Indonesia currency), and ‘rezim’ (regime).

8.

How is the Top-30 frequency of Token accompany ‘prabowo’ word in Prabowo Supporter’s Tweet on that month?We will filter the dataframe according to what we need.

Analysis‘sandi’ has the highest frequency here.

It has big gap to other words.

the words that got my attention are ‘ulama’, ‘allah’, ‘emak’, ‘gerakan’, and ‘ijtima’ (ulama’s/muslim schoolars’s decision).

‘jokowi’ is also the second highest frequency here.

9.

How is the top-30 frequency of Token accompany ‘jokowi’ word in Jokowi Supporter’s Tweet on that month?How we do it?We will filter the dataframe according to what we need:tweet_jokowi_in_jokowi = tweet[(filter_jokowi) & (tweet.

tweet3.

str.

contains('jokowi', False))]stat_word = get_stat_with_func(tweet_jokowi_in_jokowi, lambda x: get_tweet_common_word(x, text_column="tweet3", most_common=30),'month')i = 'Sep'plotting = stat_word.

loc[i]plotting = plotting.

dropna()freq = plotting.

apply(lambda x: x[1]).

values[1:]word = plotting.

apply(lambda x: x[0]).

values[1:]plot_freq_word(word,freq,"Frequency of Token accompany 'jokowi' Token in Jokowi Supporter's Tweet on {}".

format(i))Plot it!Analysis‘prabowo’ is not in the 20 highest frequency here.

It’s different from the above.

Anyway, word that got my attention are ‘blokir’ (blocked), ‘pembangunan’ (construction), ‘kepemimpinan’ (leadership), ‘allah’, ‘hebat’ (great), and ‘bergerak’ (move)10.

How is the top-30 frequency of Hashtags in Prabowo Supporter’s Tweet?How to do it?Since the hashtag’s column is in string format, we need to cast the type into ‘list’ by using eval .

After that, we join the content of the list by ‘ ‘ and call our previous function.

We will see the statistic of the data not limited on September.

hashtag_di_prabowo = tweet[(filter_prabowo)]hashtag_di_prabowo['hashtags'] = hashtag_di_prabowo['hashtags'].

apply(eval)hashtag_joined = hashtag_di_prabowo['hashtags'].

apply(lambda x: ' '.

join(x))hashtag_di_prabowo['hashtag_joined'] = hashtag_joinedstat_word = get_stat_with_func(hashtag_di_prabowo, lambda x: get_tweet_common_word(x, text_column="hashtag_joined", most_common=20),'month')i = 'all'plotting = stat_word.

loc[i]plotting = plotting.

dropna()freq = plotting.

apply(lambda x: x[1]).

valuesword = plotting.

apply(lambda x: x[0]).

valuesplot_freq_word(word,freq,"Frequency of Hashtags in Prabowo Supporter's Tweet")Plot it!AnalysisHashtags that my eyes are set on are ‘2019tetapantipki” (2019 Will Stay Anti-communism) , “mahasiswabergerak” (College student move), “rupiahlongsor jokowilengser” (Rupiah Fall Jokowi stepped down), and “jokowi2periode” (Jokowi two Periods).

The last hashtags should be the hashtags for Jokowi’s supporter.

The hashtags mostly talks about changing the president and negative things about Jokowi.

11.

How is the Top-30 frequency of Hashtags in Jokowi Supporter’s Tweet?How to do it?It’s really similar to RQ 10.

hashtag_di_jokowi = tweet[(filter_jokowi)]hashtag_di_jokowi['hashtags'] = hashtag_di_jokowi['hashtags'].

apply(eval)hashtag_joined = hashtag_di_jokowi['hashtags'].

apply(lambda x: ' '.

join(x))hashtag_di_jokowi['hashtag_joined'] = hashtag_joinedstat_word = get_stat_with_func(hashtag_di_jokowi, lambda x: get_tweet_common_word(x, text_column="hashtag_joined", most_common=20),'month')i='all'plotting = stat_word.

loc[i]plotting = plotting.

dropna()freq = plotting.

apply(lambda x: x[1]).

values[1:]word = plotting.

apply(lambda x: x[0]).

values[1:]plot_freq_word(word,freq,"Frequency of Hashtags in Jokowi Supporter's Tweet")Plot it!AnalysisHashtags that my eyes are set on are ‘indonesiamaju’ (Advanced Indonesia), ‘jokowimembangunindonesia’ (Jokowi Build Indonesia), ‘kerjanyata’ (Visible Work), ‘diasibukkerja’ (He’s busy working).

Mostly, the hashtags are about keeping Jokowi as the president and positive things about Jokowi.

And again, there is a ‘2019gantipresiden” that should be the hashtags for Prabowo’s supporter.

12.

How is the mean of length char in Jokowi and Prabowo Supporter’s Tweet?How we do it?We will do it in Line chart.

Since these data are comparable, we will visualize them in one figure.

We will make our default line chartdef show_plotly_linechart(x_data, y_data, color_function=place_default_color, title="This is title", dash=[], mode=[], x_axis_dict_lyt=None, name=None, y_axis_dict_lyt=None, custom_layout=None, x_title=None, y_title=None): assert len(x_data) == len(y_data) line_chart = [] for idx, (x, y) in enumerate(zip(x_data,y_data)): color = color_function(x, idx) current_dash = 'dash' current_mode = 'lines+markers' if len(dash) > 0: current_dash = dash[idx] if len(mode) > 0: current_mode = mode[idx] if name == None: name = ["Trace"] * len(x_data) line_chart.

append(go.

Scatter(x=x,y=y, mode = current_mode,name=name[idx])) layout = custom_layout if layout is None: layout = default_define_layout(x_axis_dict_lyt, y_axis_dict_lyt, title, x_title, y_title) fig = go.

Figure(data=line_chart, layout=layout) iplot(fig)And we also need a new functions:def get_length_char(df, text_column="tweet3"): if df.

shape[0] > 0: return df[text_column].

apply(len).

mean()def get_word_length(df, text_column="tweet3"): if df.

shape[0]>0: return df[text_column].

apply(lambda x : len(x.

split())).

mean()We are done, let’s plot them:title_label = ["Length Char Jokowi", "Length Char Prabowo"]counter = 0x_l = []y_l = []for char_len in [jokowi_char_length, prabowo_char_length]: x = char_len[0].

index[:12] y = char_len[0][:12].

values x_l.

append(x[::-1]) y_l.

append(y[::-1])show_plotly_linechart(x_l,y_l,title="Length Char", name=title_label)We will make a list that contains 2 line charts and show it in one figure.

The [::-1] Means that we will reverse the month.

The default will start from December to January.

Plot it!AnalysisJokowi’s mean of char’s length is tend to rise and it’s at its peak at November.

Wheras Prabowo’s mean of char’s length tend to rise until August and keep tend to fall after that month13.

How is the mean of length word in Jokowi and Prabowo Supporter’s Tweet?How to do it?Our last RQ.

It’s the same as above but we need new function:jokowi_word_length = get_stat_with_func(tweet[filter_jokowi], get_word_length, label='month')prabowo_word_length = get_stat_with_func(tweet[filter_prabowo], get_word_length, label='month')That’s all let’s plot it:title_label = ["Jokowi", "Prabowo"]counter = 0x_l = []y_l = []for word_len in [jokowi_word_length, prabowo_word_length]: x = word_len[0].

index[:12] y = word_len[0][:12].

values x_l.

append(x[::-1]) y_l.

append(y[::-1])show_plotly_linechart(x_l,y_l,title="Word Length", name=title_label)Plot it!AnalysisAs expected, it has almost got the similar result with the answer of RQ 12.

ConclusionWe have answered all the Research Question that we have defined.

There are many interesting points from the answers.

Such as the kinds of word in the top-30 frequency word of each president’s supporter tweets and how each supporter talk about their president candidate or their president candidate’s opponent.

I won’t dive deeper on the statistic here as it will make this article longer.

After we do EDA, we should notice that there are some thing that should be cleaned to make the data better.

For example, there are some tweets that contains the hashtag of Jokowi’s support and Prabowo’s supporter in one tweet.

These tweets should be removed from the dataset.

We should move back to the cleaning step and do EDA again.

Afterwordshttps://pixabay.

com/en/cat-tired-yawn-stretch-814952/That’s it folks for my article mostly about EDA.

Actually, I have more RQs that I’ve answered.

But for the sake of shorting this article, I select a few of them.

You must be wondering about some of the result of our finding.

For that, you need to dive deeper on exploring the data.

I will share the dataset if there are many readers who want it.

There are many tasks that can be done for that dataset such as Topic Modelling, Sentiment Analysis, Detecting Anomaly (Such as detecting buzzer), And many interesting tasks.

If anyone want me to write about it, I will think about writing it.

I welcome any feedback that can improve myself and this article.

I’m in the process of learning on writing.

I really need a feedback to become better.

Just make sure to give feedback in a proper manner ????.

For my several next articles, I’ll go back to NLP or Computer Vision (maybe) topics.

https://pixabay.

com/en/calligraphy-pen-thanks-thank-you-2658504/RepositoryTBDSourceThe Data Science ProcessAt Springboard, our data students often ask us questions like "what does a Data Scientist do?".

Or "what does a day in…www.

kdnuggets.

com.. More details

Leave a Reply