Lot’s of JSON

Lot’s of JSONMethods for making dealing with JSON files less painfulTara BoyleBlockedUnblockFollowFollowingFeb 21I recently read a blog post on important tools for data scientists in 2019.

One of the tools was JSON… lots of JSON.

Early in my learning journey, I came across the Presciption-Based Prediction dataset on kaggle.

It was perfect.

Coming from a pharmacy background, I wanted to explore drug data.

There was only one problem….

The data is in JSON….

JSON or JavaScript Object Notation is a “lightweight data-interchange format …It is easy for machines to parse and generate.

”And they say “is easy for humans to read and write”.

But, the first time I loaded a JSON file into a dataframe I would have argued otherwise.

Here we will explore some methods useful in the data cleaning and preprocessing steps of a data science workflow, and hopefully convince you that JSON really is easy for humans to read and write.

Getting StartedWe can use %%bash magic to print a sample of our file:%%bash head .

/input/roam_prescription_based_prediction.

jsonl{ "cms_prescription_counts": { "CEPHALEXIN": 23, "AMOXICILLIN": 52, "HYDROCODONE-ACETAMINOPHEN": 28}, "provider_variables": { "settlement_type": "non-urban", "generic_rx_count": 103, "specialty": "General Practice", "years_practicing": 7, "gender": "M", "region": "South", "brand_name_rx_count": 0}, "npi": "1992715205"}From this we can see the JSON data looks like a Python dictionary.

That’s not so scary!It’s easy enough to read in our .

json file using the pandas read_json method:raw_data = pd.

read_json(".

/input/roam_prescription_based_prediction.

jsonl", lines=True, orient='columns')raw_data.

head()The only problem now is that we have column values that are nested…and not entirely usable at this point.

Let’s explore some methods for unpacking these values.

List ComprehensionA list comprehension is an easy way to unpack the data in our provider_variables column.

provider = pd.

DataFrame([md for md in df.

provider_variables])Perfect!.We now have a dataframe of the provider variables!.But I would say that without some sort of key this data doesn’t do us too much good.

Let’s add the provider’s NPI numbers to this dataframe and set the npi column as the index:provider['npi'] = raw_data.

npiprovider.

set_index('npi', inplace=True)provider.

head()JSON NormalizePandas’ json_normalize method is another option for flattening our data:from pandas.

io.

json import json_normalizeprovider = json_normalize(data=raw_data.

provider_variables)provider.

head()Method EfficiencySo we have two options that yield the same results.

Which is the better method to use?We can use the magic command%timeit to find the time execution of each statement:List Comprehension:729 ms ± 8.

21 ms per loop (mean ± std.

dev.

of 7 runs, 1 loop each)JSON Normalize:4.

72 s ± 104 ms per loop (mean ± std.

dev.

of 7 runs, 1 loop each)Here we can see the list comprehension method executed faster.

ConclusionIn this article, we learned how to manipulate JSON data with Python.

We learned how to flatten nested data and convert it to a dataframe.

And we hopefully learned that JSON files really aren’t so bad after all!.. More details

Leave a Reply