Downloading Data From Twitter Using the REST API

This is the second article of a list of publications about adquiring data from Twitter and using it to gain certain insights, like the most influential users on a certain trend, topic modelling and much more.

If you have not read the first article, you can take a look at it here:Downloading Data From Twitter Using the Streaming APIIn this post we will cover how to use the Streaming API to get Tweets that contain certain words or hashtags, and how…medium.

comWhile the previous article discussed how to gather data from Twitter that is being produced on real time, this new article will cover how to collect historical data, like the previous tweets of a certain user, his followers, or his friends.

Lets get started!Using the REST API to collect historical dataWhile using the Streaming Twitter API we collected data that was produced on real time, the REST API serves the opposite purpose: gathering data that was produced before the time of collection, ie.

historical data.

Using this API we can collect old tweets containing certain keywords, similarly to how it was done before, but we can also gather other information that is relevant to the platform, like the friends and followers of different user accounts, retweets from a certain account, or retweeters of a certain tweet.

Users inside the Twitter APIs are identifided by two different variables:The user screen_name, which is the Twitter name with the @ that we are all used to.

For example “@jaimezorno”.

The user_id, which is an unique numerical identifier for each Twitter user, which is a very long numerical string, like 747807250819981312 for example.

In the data collection process, when we want to specify the user that we want to collect data from, we can either do it using the screen_name or the user_id of such user, so before diving into the more complex functions that are provided by the REST API, we will look at the how to obtain the Twitter Id of a certain user for whom we have the username and vice versa.

Obtaining Twitter’s Id with the screen name and vice versaGoing from the Twitter Id to the user Screen name is needed as some of the functions that we will describe later return the Twitter identifier instead of the user screen names, so we need this functionality if we want to see who are the actual users associated with the corresponding ids.

As always the first step is to collect to the Twitter API.

import tweepy import time access_token = "ENTER YOUR ACCESS TOKEN" access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET" consumer_key = "ENTER YOUR CONSUMER KEY" consumer_secret = "ENTER YOUR CONSUMER SECRET" auth = tweepy.

OAuthHandler(consumer_key, consumer_secret) auth.

set_access_token(access_token, access_token_secret) api = tweepy.

API(auth)In the code, replace “ENTER YOUR ….

” with your credentials, and then run the three last lines to create a connection to the Twitter REST API.

Notice how this time we do not create a stream object, like we did to use the streaming API, but an api object.

Once we have done this, going from screen name to id and vice versa is very simple, and it is done by running the lines in the following block of code:user = api.

get_user(screen_name = 'theresa_may') print(user.

id)This block, which queries the REST API for the user_id of Theresa May’s official Twitter acount, returns: 747807250819981312, which is the id associated which such account.

It is important here to see that the screen_name does not include the @.

To do this in the opposite direction, and gather the screen name of an account for whom we know the id, it is as easy as:user = api.

get_user(747807250819981312) print(user.

screen_name)which would print: theresa_may.

As we can see both, the Id and the screen name, are attributes of the user object returned by the API, which contains a lot of valuable information like the user follower count, number of publications, date of creation of the account and much more.

These parameters will be explored on a different post.

It is as easy as that to go from username to id and the other way around.

Now lets explore more complex and useful functionalities of the REST API.

Gathering the tweets from the Timeline of a certain userThe timeline of a certain user are the past tweets that he or she has published or retweeted.

It is useful to collect this information to get an idea of the previous activity of a certain account within the social network.

We have to know however, that the method that will be used can only return tweets from the last 3200 of a specific user, so if we are gathering posts of a very active account and want tweets from a very long time ago, we will not be able to obtain them.

This is a known limitation of the Twitter API, with no fix in the horizon, as by doing this, Twitter does not have to store ALL of the the tweets that have ever been produced by every single Twitter account.

After having created a connection to the Twitter REST API like described above, to collect the timeline of a user we have to use a code structure similar to what appears in the following block:try: for tweet in tweepy.

Cursor(api.

user_timeline, screen_name="theresa_may", exclude_replies=True).

items(): print(tweet) except tweepy.

TweepError: time.

sleep(60)As we can see, this code introduces a new concept inherent to the Twitter API: the Cursor Object.

As intimidating as it might seem, it is nothing more than the way that the API has to handle pagination and be able to deliver content in an efficient and ordered manner.

In this case we would be collecting the historical tweets from the user @theresa_may excluding the replies to tweets from other users.

A similar parameter include_rts can be added to eliminate the retweets from this users timeline.

Also, try-except duo was added to handle any errors we could find, like request rate exceeded or protected users.

This is very frequent when opperating with APIs of this sort.

The output of this code is a very ugly looking object called a Status Object for every tweet, that looks like this:Status(_api=<tweepy.

api.

API object at 0x000001C52728A710>, _json={'created_at': 'Sun May 12 11:55:41 +0000 2019', 'id': 1127542860520329216, 'id_str': '1127542860520329216', 'text': 'Congratulations to @SPendarovski on your inauguration as President of North Macedonia.

I witnessed the strong relat…………Another post, like with the user object case, will explain in detail the nature of these objects and their attributes, however for now we will only describe how to collect some of the most interesting fields from it.

Let’s see how we can do this.

We will keep the same code structure than in the previous block, but adding some extra lines, which we will use to grab the parts of the status object that we find most relevant.

try: for tweet in tweepy.

Cursor(api.

user_timeline, screen_name="theresa_may", exclude_replies=True, count = 10).

items(): tweet_text = tweet.

text time = tweet.

created_at tweeter = tweet.

user.

screen_name print("Text:" + tweet_text + ", Timestamp:" + str(time) + ", user:" + tweeter) except tweepy.

TweepError: time.

sleep(60)This time, executing this block of code should print something like:Text:We’re driving the biggest transformation in mental health services for more than a generation.

https://t.

co/qOss2jOh4c, Timestamp:2019-06-17 07:19:59, user:theresa_mayText:RT @10DowningStreet: PM @Theresa_May hosted a reception at Downing Street to celebrate:✅ 22 new free schools approved to open ✅ 19,000 ad…, Timestamp:2019-06-15 13:53:34, user:theresa_mayText:Two years on from the devastating fire at Grenfell Tower, my thoughts remain with the bereaved, the survivors and t… https://t.

co/Pij3z3ZUJB, Timestamp:2019-06-14 10:31:59, user:theresa_mayTake into account that the tweets that you will get depend on the tweets published by the user you are searching for before the time of executing the code, so you will most likely not get the same tweets as me if you run these blocks with theresa_may as the target user.

As nicer as the return from the previous block of code might look, we might want the data in a format that makes it easy to store and process later, like JSON for example.

We will make one last modification to our code in order to print out each tweet, along with the fields from that tweet that we want, as a JSON object.

For this we will need to import the json library and make some further changes to our code, like shown below:import json try: for tweet in tweepy.

Cursor(api.

user_timeline, screen_name="theresa_may", exclude_replies=True, count = 10).

items(): tweet_text = tweet.

text time = tweet.

created_at tweeter = tweet.

user.

screen_name tweet_dict = {"tweet_text" : tweet_text.

strip(), "timestamp" : str(time), "user" :tweeter} tweet_json = json.

dumps(tweet_dict) print(tweet_json) except tweepy.

TweepError: time.

sleep(60)This time we will be outputting the same fields as before but in a JSON format, that makes it easy for other people to process and understand.

The output for the same tweet in this case would be:{"tweet_text": "Weu2019re driving the biggest transformation in mental health services for more than a generation.

https://t.

co/qOss2jOh4c", "timestamp": "2019-06-17 07:19:59", "user": "theresa_may"}{"tweet_text": "RT @10DowningStreet: PM @Theresa_May hosted a reception at Downing Street to celebrate:.u2705 22 new free schools approved to open.u2705 19,000 adu2026", "timestamp": "2019-06-15 13:53:34", "user": "theresa_may"}{"tweet_text": "Two years on from the devastating fire at Grenfell Tower, my thoughts remain with the bereaved, the survivors and tu2026 https://t.

co/Pij3z3ZUJB", "timestamp": "2019-06-14 10:31:59", "user": "theresa_may"}After having seen how to efficiently collect and process the timeline of a certain user, we will look at how we can collect their friends and followers.

Gathering the followers of a certain user.

Fetching the followers of a group of users is one of the most looked for actions in Twitter research, as creating follower/followee networks can provide some very interesting insights into a certain group of users who tweet about a topic or hashtag.

To get the followers of a certain user, it’s as easy as connecting to the API using our credentials like it was done before and then running the following code:try: followers = api.

followers_ids(screen_name="theresa_may") except tweepy.

TweepError: time.

sleep(20)By setting the parameter wait_on_rate_limit to True in api = tweepy.

API(auth, wait_on_rate_limit=True) when we make the connection to the API, the error of exceeding the rate limit when downloading any kind of data is avoided, so despite of not having used it in the previous parts of this post, I suggest using it whenever you are going to be downloading large amounts of data from the Twitter REST API.

followers here would be a list with the Ids of all the followers of the account @theresa_may.

These Ids could then be translated to usernames using the api.

get_user method that we have previously described.

If we want to collect the followers for a certain group of users, we only need to add a couple of lines of code to the previous block, like so:user_list = ["AaltoUniversity", "helsinkiuni","HAAGAHELIAamk", "AaltoENG"]follower_list = [] for user in user_list: try: followers = api.

followers_ids(screen_name=user) except tweepy.

TweepError: time.

sleep(20) continue follower_list.

append(followers)In this case we would be collecting the followers of user accounts related to universities in Finland.

The output of this code would be a list (follower_list) which in each index has a list with the followers of the account from user_list with the same index.

Relating these two lists (the user and the follower lists) is very easy using the enumerate function:for index, user in enumerate(user_list): print("User: " + user + " Number of followers: " + str(len(follower_list[index])))The output of this block would be:User: AaltoUniversity Number of followers: 5000User: helsinkiuni Number of followers: 5000User: HAAGAHELIAamk Number of followers: 4927User: AaltoENG Number of followers: 144which might leave you wondering: Do the accounts @AaltoUniversity and @helsinkiuni have exactly the same number of followers and that is exactly 5000?The most obvious answer here is no.

If you check the Twitter accounts of both universities, they both have followers in the range of the tenths of thousands.

So why do we only get 5000 then?Well, this is because for issues involving pagination, the Twitter API breaks up their responses in different pages that we could think of as “chunks” of the requested information of a certain maximum size, and to go from one page to the following we need to use a special kind of object called a Cursor object, which was mentioned above.

The following code uses the same function but this time with a cursor object to be able to grab all the followers of each user:user_list = ["AaltoUniversity", "helsinkiuni","HAAGAHELIAamk", "AaltoENG"] follower_list = [] for user in user_list: followers = [] try: for page in tweepy.

Cursor(api.

followers_ids, screen_name=user).

pages(): followers.

extend(page) except tweepy.

TweepError: time.

sleep(20) continue follower_list.

append(followers)This time, if we use the enumerate loop to print each user and their number of followers, the output would be:User: AaltoUniversity Number of followers: 35695User: helsinkiuni Number of followers: 31966User: HAAGAHELIAamk Number of followers: 4927User: AaltoENG Number of followers: 144which is the real number of followers of each of the accounts.

Collecting the friends of a certain user.

In a similar way to how we can collect the followers of a certain user, we can also collect his “friends”, which is the group of people a certain user follows.

For this, as always, we will start by connecting to the API with our credentials, and then running the following code:friends = [] try: for page in tweepy.

Cursor(api.

friends_ids, screen_name="theresa_may").

pages(): friends.

extend(page) except tweepy.

TweepError: time.

sleep(20)The variable friends from this block of code would be a list with all the friends of the user whose screen_name we select (theresa_may in this case)Seeing the number of followers/friends of a certain user.

If we are not interested in who the followers/friends of a certain account are, but only in how many they are, the Twitter API allows us to collect this information without having to collect all of the followers/friends of the desired account.

To do this without actually having to collect all the followers (which can take a while if the user has many, taking into account the download rate limit) we can use the api.

get_user method that we have used before for going from user.

screen_name to user.

id and vice-versa.

The following block of code shows how:user = api.

get_user(screen_name = 'theresa_may') print(user.

followers_count) print(user.

friends_count)which would output:83939129We can also do this using the Twitter user.

id, if we know it, as seen before, like so:user = api.

get_user(747807250819981312)print(user.

followers_count) print(user.

friends_count)which would output again :83939129That as we can see from Theresa’s May official account, is the correct number of followers and friends.

Conclusion:We have described the main functionalities of the Twitter REST API, and tackled some of the possible issues we might find when collecting data from it.

This data can then be used for a lot for purposes: from trend or fake news detection using complex Machine Learning algorithms, to Sentiment Analysis for inferring how positive the feeling of a certain brand is, graph building, information diffusion models and much more.

For further research, or clarification of the information found here refer to the previous links left throughout this guide or to:· Twitter Developers page: https://developer.

twitter.

com/en/docs· Tweepy’s github page: https://github.

com/tweepy/tweepy· Tweepy’s official page: https://www.

tweepy.

org/Twitter’s advanced search: https://twitter.

com/search-advancedStay tuned for more post in Social Network Analysis!.. More details

Leave a Reply