Acquiring Free Historical Geo-located data from Twitter

Photo by Tom Parkes on UnsplashAcquiring Free Historical Geo-located data from TwitterPatrick O'ConnellBlockedUnblockFollowFollowingFeb 17In this post, I will describe how to retrieve historical geo-located data from the Twitter API for free.

This is not very complicated if you know how to do it, but I spent a lot of time trying to get this to work, and I know other people who found the process very difficult.

So, I hope you will find this useful.

Python is the language I used, so that is what will be covered here.

The back story is that I was working on a client project which was designed to identify areas hit by natural disasters using social media.

To get data to train a machine learning model for this I wanted to get historical data from an actual natural disaster.

Twitter was chosen because it would be easier to get the data compared to other major social media sites.

Photo by NASA on UnsplashThis turned out to be more complicated than originally expected.

The first complication is that Twitter has three tiers of Application Programming Interface (API) access:Standard — free but no historical dataPremium —free and paid access to historical dataEnterprise — only paidPremium is the one we want, but it again has two groupings:Sandbox — freePremium — paidIn this case we want Premium Sandbox, which provides free access to historical data.

Another complication is that, although there are many Python libraries available to access the Twitter API, (and handle the back and forth authentication under the covers,) not all of them work with the Premium API — which is relatively new.

Furthermore, search operators and syntax that work with one Twitter API won’t necessarily work with another.

So, trying to find the right form of the search query, and troubleshooting problems along the way, was very time consuming.

The Sandbox version also limits the number of Tweets returned per web request to 100 (vs 500 for the paid Premium version).

Before calling the API, it is necessary to:Create a Twitter Developer account.

Create an application by submitting a request to Twitter.

Set up a Development environment.

This should result in your getting the appropriate credentials from Twitter.

I won’t go into all the details here because they are covered elsewhere, (particularly by Twitter itself,) and are probably subject to change.

Sebastian Voortman pexels.

comIn the end I decided to use the “TwitterAPI” Python library.

The natural disaster that I chose was Hurricane Michael, which made landfall near Panama City, Florida in October 2018.

Note that Twitter only allows these types of searches out to a maximum of 25 miles.

Only tweets with (public) geo-location data will be returned, as otherwise the API could not determine that they met the search criteria.

The below search query returns all tweets from October 1st to October 18, 2018 that are within 25 miles of Panama City.

The code below was copied from a Python notebook, where the setup code was in separate cells from the main While loop.

from TwitterAPI import TwitterAPIimport jsonimport pandas as pdapi = TwitterAPI(consumer_key=CONSUMER_API_KEY, consumer_secret=CONSUMER_API_SECRET_KEY, access_token_key=ACCESS_TOKEN, access_token_secret=ACCESS_TOKEN_SECRET)PRODUCT = ‘fullarchive’LABEL = ‘Development’ # This is specific to your application# i.

e.

whatever label you set for your Dev environment, and is case sensitive# The following should be uncommented for the first web request.

# This code was run in a Python notebook, in separate cells.

# It might need to be restructured for other uses.

# list_tweets = []# next = None# web_request_count = 0while next is not None: r = api.

request(‘tweets/search/%s/:%s’ % (PRODUCT, LABEL), {‘query’:’point_radius:[-85.

6602 30.

1588 25mi]’, ’toDate’:’201810180000', ‘fromDate’:’201810010000', ‘next’: next # This has to be commented out for the first web request }) # This is for Panama City, Florida print(‘r.

status_code: ‘, r.

status_code) next = r.

json()[‘next’] print(‘web_request_count: ‘, web_request_count) web_request_count += 1 results = r.

json()[‘results’] for tweet in results: coordinates = tweet[‘coordinates’][‘coordinates’] tweet_date = tweet[‘created_at’] tweet_text = tweet[‘text’] if ‘extended_tweet’ in tweet.

keys(): tweet_text = tweet[‘extended_tweet’][‘full_text’] tweet_row = {‘long_lat’:coordinates, ‘date_utc’:tweet_date, ‘full_text’:tweet_text} list_tweets.

append(tweet_row) df = pd.

DataFrame(list_tweets) df.

to_json(‘tweets_df_panama_city_25mi_oct.

json’, orient=’records’) time.

sleep(2.

1) # wait i.

e.

only 30 requests per minute allowedHopefully you will find this helpful.

Note that there is a limit in the Premium Sandbox API on the number of web requests per month, so don’t exceed that or you will be locked out for the rest of the month.

If you have any questions, comments or suggestions, please let me know.

ThanksPatrick.

. More details

Leave a Reply