Joining Data Sources

Joining Data SourcesRitvik KharkarBlockedUnblockFollowFollowingMay 3Photo by Christopher Burns on UnsplashMost “data science” in the real world involves creating a data set, a visualization, an application that requires pulling and joining data from very different sources to tell a cohesive story.

Moving past toy data sets, let’s take a look one start-to-finish application of joining disparate data sources into one comprehensive map.

Specifically, we’ll be answering the question:Which zip codes in Los Angeles County have the most Starbucks stores per capita?If, for example, it were your job to find the best place in LA County for a new Starbucks location, you might be very invested in the answer to this question.

Define the End GoalOur end goal is to create a choropleth map of LA County.

Our choropleth map is basically a map which will meet the following requirements:- All zip code boundaries should be drawn- Each zip code should be colored in variable intensities based on number of Starbucks stores per capitaWhat Types of Data Do We Need?Ok, so we definitely need some kind of geographical data to draw the zip code shapes.

We also need information on the location of each Starbucks in LA County.

And of course, we need some information on the population of LA County by zip code.

To summarize, we need:1.

Geographical data of zip code boundaries in Los Angeles County (preferably a GeoJSON file, which is a JSON describing complex shapes)2.

Location of each Starbucks store in Los Angeles County (preferably a spreadsheet with zip code of each store)3.

Population of Los Angeles County by zip code (preferably a spreadsheet)Where Can we Find the Data?Awesome!.We have identified what kind of data we ideally want.

Now, let’s think about where we would get each of these data sets.

The geographical data seems easy enough and some Google searching for “la county zipcode geojson” eventually resulted in me finding the appropriate GeoJSON, stored at my GitHub here.

The Starbucks store location data proved a lot trickier.

Either the data sets I was finding were out of date or were only state-specific rather than county-specific.

In order to get this data, I had to do some very careful API querying of the official Starbucks Store Locator.

In fact, this process was interesting enough to warrant me writing a separate Medium post about it:Quick and Dirty Data Gathering with PythonHow many Starbucks in Los Angeles County?towardsdatascience.

comFor the population data set, I was finding some pre-made spreadsheets online for what I wanted, but I honestly was not sure of their validity.

I thought it was much safer to get population data straight from the United States Census Bureau.

Getting this data required a very specific call to the Census Bureau API and I’ve got a whole article dedicated to that process:Getting Census Data in 5 Easy Stepsa gentle guide to collecting population data for your projectstowardsdatascience.

comYou can find all three resulting data sets at my GitHub:the LA County GeoJSON is herethe LA County Starbucks locations data set is herethe LA County population by zip code data set is hereHow Will We Join the Data Sets?Ok, so we’ve got the data we want.

Now, lets come up with a strategy about how to join it all together to make the map we want.

In order to make the map, we need the number of Starbucks stores per capita for each zip code in LA County.

Currently, we have the two pieces required to create that metric, population data and store data, but need to join those together.

So, it makes sense to use a two phase approach.

The first phase will join the population data with the store location data and the second phase will join the the resulting stores-per-capita data set with the geographical data.

Something like this:Phase 1: Joining Population and Store Location DataLet’s take a look at snapshots of the two spreadsheets to determine how we might join them.

Population DataStore Location DataIt seems we should be able to join the two data sets on their zip code columns but there is one issue.

The store location data contains one row for each Starbucks store in LA County, but we would really like to have one row per zip code with a total count of stores as an other column.

Let’s use python to do some quick and dirty data transformations!Once we’ve transformed the store locations data set, it contains the zip code as its index and the number of Starbucks stores in that zip code as a column.

Now, let’s join the two data sets together:Subsequently, joinedDf looks like this:How come some entries are NaN?.Well, if a certain zip code exists in the population data set but not the stores dataset, then its numStores value will be missing and vice versa.

We’ll deal with that issue in just a second.

Now, lets go ahead and create the stores per capita metric by dividing the the number of stores by the total population of each zip code.

We’ll also remove all zip codes where the resulting stores per capita is null or infinity (if the population of the zip code were zero).

The resulting joinedDf looks like:Nice!.Phase 1 is complete; we have a dataframe with stores per capita for each zip code.

Phase 2: Mapping Stores per Capita Using Geography DataNow that we have a data set containing stores per capita, we can plot these values on our map of Los Angeles County, whose shape details are stored in our geographical data.

Let’s create our choropleth map!Your resulting laChoropleth.

html map will be interactive and you’ll be able to click, drag, and zoom when you open it up.

Let’s take a look at how it turned out.

We definitely see some different shades of green in the map, but there doesn’t seem to be that much variation.

This usually happens in a choropleth map when there are big outlier values for the variable in question.

For our case, that means that there are a few zip codes with really big values for storesPerCapita and then most zip codes have relatively smaller values.

We will make a trade-off here and remove the top 10% of zip codes by storesPerCapita.

There will be more missing zip codes in the resulting map, but the zip codes that remain will show a better color spectrum.

We will also color the missing zip codes in grey so we can actually tell they are missing.

Here is our final choropleth.

As we can see, there are quite a few missing zip codes due to various reasons.

It might be because of a missing value for population or number of stores, or perhaps because the population was zero, or perhaps because we chose to remove it in the last step.

Either way, the zip codes that do remain show a much better color gradient than in the original choropleth.

The exact way you draw the map is totally up to you!Hopefully this guide demonstrating how to collect, join, and ultimately visualize disparate data sources helps you with your own future data endeavors.

Thanks for reading!.

. More details

Leave a Reply