Visualisation of Information from Raw Twitter Data — Part 2

Lets check it out!For this we need to download and import the Botometer Python library, and get a key to be able to use their API.

The information on how do to this can be found on the following link:Botometer API Documentation (OSoMe) | RapidAPIBotometer (formerly Truthy BotOrNot) checks the activity of a Twitter account and gives it a score based on how likely…rapidapi.

comAlso, we will need to retrieve our Twitter API keys, as we will need them to allow Botometer to access the information from the accounts whose activity we want to study.

First we will import both libraries.

Take into account that Botometer and Tweepy both have to be previously downloaded using a package manager of your choice.

#Now we will see the probabilities of each of the users being a bot #using the BOTOMETER API:import botometerimport tweepyAfter this, we will input the API keys that are needed:#Key from BOTOMETER APImashape_key = "ENTER BOTOMETER API KEY"#Dictionary with the credentials for the Twitter APIstwitter_app_auth = { 'access_token' : "ENTER ACCESS TOKEN", 'access_token_secret' : "ENTER ACCESS TOKEN SECRET", 'consumer_key' : "ENTER CONSUMER KEY", 'consumer_secret' : "ENTER CONSUMER SECRET", }Like in the previous posts, replace the ‘ENTER…’ with the corresponding key and you’re good to go.

Run the following block of code to access the Botometer API, and lets see which accounts have the highest chance of being Bots out of the top 25 tweeting users!#Connecting to the botometer APIbom = botometer.

Botometer(wait_on_ratelimit = True, mashape_key = mashape_key, **twitter_app_auth)#Returns a dictionary with the most active users and the porcentage #of likeliness of them bein a Bot using botometerbot_dict = {}top_users_list = dict_keysfor user in top_users_list: user = '@'+ user try: result = bom.

check_account(user) bot_dict[user] = int((result['scores']['english'])*100) except tweepy.

TweepError: bot_dict[user] = 'None' continueThe output of this block is a dictionary (bot_dict) where the keys are the names of the accounts we are checking, and the value is a numerical score between 0 and 1 that depicts the probability of each user being a bot by taking into account certain factors like the ration of followers/followees, the description of the account, frequency of publications, type of publications, and more parameters.

For some users, the Botometer API gets a rejected request error, so these will have a ‘None’ as their value.

For me, I get the following results when checking bot_dict:{'@CryptoKaku': 25, '@ChrisWill1337': 'None', '@Doozy_45': 44, '@TornadoNewsLink': 59, '@johnnystarling': 15, '@brexit_politics': 42, '@lauramarsh70': 32, '@MikeMol1982': 22, '@EUVoteLeave23rd': 66, '@TheStephenRalph': 11, '@DavidLance3': 40, '@curiocat13': 6, '@IsThisAB0t': 68, '@Whocare31045220': 'None', '@EUwatchers': 34, '@c_plumpton': 15, '@DuPouvoirDachat': 40, '@botcotu': 5, '@Simon_FBFE': 42, '@CAGeurope': 82, '@botanic_my': 50, '@SandraDunn1955': 36, '@HackettTom': 44, '@shirleymcbrinn': 13, '@JKLDNMAD': 20}Out of these, the account with the highest chance of being a Bot is @CAGeurope, with a probability of 82%.

Lets check out this account to see why Botometer assigns it such a high probability of being a Bot.

Twitter account of @CAGeuropeIt looks like a legit account, however, there are various reasons that explain why Botometer gave it such a high probability of being a Bot.

First, the account follows almost 3 times as many accounts as the number of accounts that follow it.

Secondly, if we look at the periodicity of their tweet publications, we can see that they consistently produce various tweets every hour, sometimes in 5 minute intervals, which is a LOT of tweets.

Lastly, the content of their tweets is always very similar, with a short text, an URL and some hashtags.

In case you don’t want to code anything or get an API key, Botometer also provides a web based solution, where you can also check the probability of an account being a Bot:Web based solution offered by BotometerLooks like I’m going to have to stop spamming the retweet button and mass following people in order to make my Twitter account more human-like :PCool!.We can see a lot more information about the users through the ‘user’ object in the tweet’s JSON, however, this will be left for a different post.

Now, lets make a Time series of the tweet publications, so we can see on which days there were more tweets about the chosen topic being produced, and try to find out which events caused these higher tweet productions.

We will plot the number of tweets being published on each day of a specific month.

To show a plot similar to this one, but for a longer period of time, some additional code would have to be added.

First we need to modify the ‘Timestamp’ field of our dataframe, to convert it to a Datetime object, using Pandas incorporated function to_datetime.

tweets['Timestamp'] = pd.

to_datetime(tweets['Timestamp'], infer_datetime_format = "%d/%m/%Y", utc = False)Then, we create a function that returns the day of the DateTime object, and apply it to our ‘Timestamp’ field to create a new column for our dataframe that stores the day when the tweet was published.

Also, we will group the days together, count the number of tweets (using the ‘text’ field) produced on each day, and create a dictionary (timedict) with the results, where the keys are the number corresponding to the day of the month and the values are the number of tweets published on that day.

def giveday(timestamp): day_string = timestamp.

day return day_stringtweets['day'] = tweets['Timestamp'].

apply(giveday)days = tweets.

groupby('day')daycount = days['text'].

count()timedict = daycount.

to_dict()After doing this, we are ready to plot our results!fig = plt.

figure(figsize = (15,15))plt.

plot(list(timedict.

keys()), list(timedict.

values()))plt.

xlabel('Day of the month', fontsize = 12)plt.

ylabel('Nº of Tweets', fontsize=12)plt.

xticks(list(timedict.

keys()), fontsize=15, rotation=90)plt.

title('Number of tweets on each day of the month', fontsize = 20)plt.

show()Time Series of 2 days tweet collection for the #Brexit (Left) and for a whole month on the #Oscars (right)If like me, you only collected tweets for a couple of days, you will get a very short time series, like the image on the left.

The one on the right however, shows a full month time series made from a Dataset of tweets about the #Oscars, which was built by querying the Streaming API for tweets for more than one month.

In this second time series, we can see how there are very few tweets being produced at the beginning of the month, and as the day of the ceremony comes closer the tweet production starts going up, to reach its peak on the night of the event.

Awesome!.Now, we will make a plot about the devices where the tweets are being produced from.

As the code is pretty much the same code that was used for the previous bar plots, I will just post it here with no further explanation:#Now lets explore the different devices where the tweets are #produced from and plot these resultsdevices = tweets.

groupby('device')devicecount = devices['text'].

count()#Same procedure as the for the mentions, hashtags, etc.

device_dict = devicecount.

to_dict()device_ordered_list =sorted(device_dict.

items(), key=lambda x:x[1])device_ordered_list = device_ordered_list[::-1]device_dict_values = []device_dict_keys = []for item in device_ordered_list: device_dict_keys.

append(item[0]) device_dict_values.

append(item[1])Now we plot and see the results:fig = plt.

figure(figsize = (12,12))index = np.

arange(len(device_dict_keys))plt.

bar(index, device_dict_values, edgecolor = 'black', linewidth=1)plt.

xlabel('Devices', fontsize = 15)plt.

ylabel('Nº tweets from device', fontsize=15)plt.

xticks(index, list(device_dict_keys), fontsize=12, rotation=90)plt.

title('Number of tweets from different devices', fontsize = 20)plt.

show()Plot of tweet production from different devicesBy looking at this chart we can see that most tweets are published from smartphones, and that inside of this category Android devices beat Iphones by a small margin.

The web produced tweets could also be from a mobile device, but are produced from a browser and not from the Twitter app.

Aside from this web produced tweets (which we can not tell if are published from a PC, Mac or mobile web browser), there are very few tweets coming from recognised Macs or Windows devices.

These results fit very well with the relaxed and easy going nature of the Social Network.

Lastly, lets look at some additional information that can be easily obtained from the gathered data#Lets see other useful information that can be gathered:#MEAN LENGTH OF THE TWEETSprint("The mean length of the tweets is:", np.

mean(tweets['length']))#TWEETS WITH AN URLurl_tweets = tweets[tweets['text'].

str.

contains("http")]print(f"The percentage of tweets with Urls is {round(len(url_tweets)/len(tweets)*100)}% of all the tweets")#MEAN TWEETS PER USERprint("Number of tweets per user:", len(tweets)/tweets['Username'].

nunique())For me this is 145 for the mean length of the tweets, 23% of the tweets have an URL, and the mean tweet production per user is of 2.

23 tweets.

Thats it!.You can find the Jupyter Notebook used for this post and the previous one here, along with the scripts and notebooks for my other posts regarding Twitter data collection.

Also, feel free to follow me on Twitter @jaimezorno, on this platform, or contact me on LinkedIn.

Thanks a lot for reading, please clap, keep tweeting and see you soon!.

. More details

Leave a Reply