Extracting Massive Datasets in Python

Extracting Massive Datasets in PythonAbusing REST APIs for all they’re worth.

Todd BirchardBlockedUnblockFollowFollowingJul 4, 2018Taxation without representation.

Colonialism.

Not letting people eat cake.

Human beings rightfully meet atrocities with action in an effort to change the worked for the better.

Cruelty by mankind justifies revolution, and it is this writer’s opinion that API limitations are one such cruelty.

The data we need and crave is stashed in readily available APIs all around us.

It’s as though we have the keys to the world, but that power often comes with a few caveats:Your “key” only lasts a couple of hours, and if you want another one, you’ll have to use some other keys to get another key.

You can have the ten thousand records you’re looking for, but you can only pull 50 at a time.

You won’t know the exact structure of the data you’re getting, but it’ll probably be a JSON hierarchy designed by an 8-year-old.

All men may be created equal, but APIs are not.

In the spirit of this 4th of July, let us declare independence from repetitive tasks: One Script, under Python, for Liberty and Justice for all.

Project SetupWe’ll split our project up by separation of concern into just a few files:myProject├── main.

py├── config.

py└── token.

pyMain.

py will unsurprisingly hold the core logic of our script.

Config.

py contains variables such as client secrets and endpoints which we can easily swap when applying this script to different APIs.

For now we’ll just keep variables client_id and client_secret in there for now.

Token.

py serves the purpose of Token Generation.

Let’s start there.

That’s the TokenSince we’re assuming worst case scenarios let’s focus on atrocity number one: APIs which require expiring tokens.

There are some tyrants in this world who believe that in order to use their API, it is necessary to to first use a client ID and client secret to generate a Token which quickly becomes useless hours later.

In other words, you need to use an API every time you want to use the actual API.

Fuck that.

import requestsfrom config import client_id, client_secrettoken_url = 'https://api.

fakeapi.

com/auth/oauth2/v2/token'def generateToken(): r = requests.

post(token_url, auth=(client_id, client_secret), json={"grant_type": "client_credentials"}) bearer_token = r.

json()['access_token'] print('new token = ', bearer_token) return bearer_tokentoken = generateToken()We import client_id and client_secret from our config file off the bat: most services will grant these things simply by signing up for their API.

Many APIs have an endpoint which specifically serves the purpose of accepting these variables and spitting out a generated token.

token_url is the variable we use to store this endpoint.

Our token variable invokes our generateToken() function which stores the resulting Token.

With this out of the way, we can now call this function every time we use the API, so we never have to worry about expiring tokens.

Pandas to the RescueWe’ve established that we’re looking to pull a large set of data, probably somewhere in the range of thousands of records.

While JSON is all fine and dandy, it probably isn’t very useful for human beings to consume a JSON file with thousands of records.

Again, we have no idea what the nature of the data coming through will look like.

I don’t really care to manually map values to fields, and I’m guessing you don’t either.

Pandas can help us out here: by passing the first page of records to Pandas, we can generate the resulting keys into columns in a DataFrame.

It’s almost like having a database-type schema created for you simply by looking at the data coming through:import requestsimport pandas as pdimport numpy as npimport jsonfrom token import tokendef setKeys(): headers = {"Authorization":"Bearer " + token} r = requests.

get(base_url + 'users', headers=headers) dataframe = pd.

DataFrame(columns=r.

json()['data'][0].

keys()) return dataframerecords_df = setKeys()We can now store all data into records_df moving forward, allowing us to build a table of results.

No Nation for PaginationAnd here we are, one of the most obnoxious parts of programming: paginated results.

We want thousands of records, but we’re only allowed 50 at a time.

Joy.

We’ve already set records_df earlier as a global variable, so we're going to append every page of results we get to that Dataframe, starting at page #1.

The function getRecords is going to pull that first page for us.

base_url = 'https://api.

fakeapi.

com/api/1/'def getRecords(): headers = {"Authorization": "Bearer " + token} r = requests.

get(base_url + 'users', headers=headers) nextpage = r.

json()['pagination']['next_link'] records_df = pd.

DataFrame(columns=r.

json()['data'][0].

keys()) if nextpage: getNextPage(nextpage)getRecords()Luckily APIs if there are additional pages of results to a request, most APIs will provide a URL to said page, usually stored in the response as a value.

In our case, you can see we find this value after making the request: nextpage = r.

json()['pagination']['next_link'].

If this value exists, we make a call to get the next page of results.

page = 1def getNextPage(nextpage): global page page = page + 1 print('PAGE ', page) headers = {"Authorization": "Bearer " + token} r = requests.

get(nextpage, headers=headers) nextpage = r.

json()['pagination']['next_link'] records = r.

json()['data'] for user in records: s = pd.

Series(user,index=user.

keys()) global records_df records_df.

loc[len(records_df)] = s records_df.

to_csv('records.

csv') if nextpage: getNextPage(nextpage)Our function getNextPage hits that next page of results, and appends them to the pandas Dataframe we created earlier.

If another page exists after that, the function runs again, and our page increments by 1.

As long as more pages exist, this function will fire again and again until all innocent records are driven out of their comfortable native resting place and forced into our contained dataset.

There's not much more American than that.

There’s More We Can DoThis script is fine, but it can be optimized to be even more modular to truly be one-size-fits-all.

For instance, some APIs don’t tell you the number of pages you should except, but rather the number of records.

In those cases, we’d have to divide the total number of records by records per page to know how many pages to expect.

As much as I want to go into detail about writing loops on the 4th of July, I don’t.

At all.

There are plenty more examples, but this should be enough to get us thinking about how we can replace tedious work with machines.

That sounds like a flavor that pairs perfectly with Bud Light and hotdogs if you ask me.

Originally published at hackersandslackers.

com on July 4, 2018.

.

. More details

Leave a Reply