Complex SQL on Excel Spreadsheets

The answer lies somewhere in the Department of Education’s spreadsheet data.

We can use SQL to extract it.

A vast trove of the world’s data is similarly locked away in spreadsheets.

Tools like Excel are widely used and fantastic for viewing the data.

However, as modern data gets bigger and ever-more interconnected, performing complex analysis can get difficult — illegibly long formulas, wrangling with data frames, even cleaning up CSVs in Sublime Text (I‘ve been there…).

There must be a better way.

In this post we will show how to tackle the challenge using a tool that is even older than Excel: SQL.

The QuestionsThe affordability (or unaffordability) of college in America is a hot topic in today’s news.

One of the largest sources of financial aid for students is the Federal Pell Grant Program, so we looked to find data that quantified the impact of the program.

Suppose we want to answer the following questions:Which institution in New York had its students receive the most Pell money in 2016–2017?For each state, what was the percent change of number of students who received Pell awards from 2015–2016 to 2016–2017?What is the relationship between the average grant size at an institution and whether it is all-male or all-female?A quick search reveals that data listing Pell grant disbursements by institution is publicly available on the Department of Education’s website as .

XLSX files (Excel spreadsheets).

This will be the starting point of our investigation.

Loading the DataLet’s move the disbursement data into Rockset, a service that can parse and run SQL queries on semi-structured data, including Excel files for our case.

In the Rockset console, we create a new empty collection and call it pell_disbursements.

Next, we download the Excel file from DoE and upload it directly into the Rockset console.

Rockset will index the file’s contents so that it is ready to be queried.

Finding a Data PointLet’s start by understanding the shape of the data.

We use the DESCRIBE command to list the available fields in our collection:We will run all our queries in the browser in the Rockset console.

We see that Rockset has parsed the spreadsheet’s columns in fields B through H, as well as useful metadata including:rownum – the row within the original spreadsheet_meta.

file_upload.

file – name of the original fileEach document in Rockset corresponds to one row of the spreadsheet, so let’s query the first several rows to discern the meaning of each column:We see that the fifth row has headers for each column, and the raw data starts in the sixth row.

Using a WITH clause, let's construct a subquery called pd that returns just the data we care about:This is exciting — we can use the full breadth of SQL capabilities to comb through this data!.Let’s write a query to find the answer to our first question, the institution with the most awards in New York:We see that the top institution, with about $76M in Pell awards and significantly more than the next institution, is the Borough of Manhattan Community College.

Computing Aggregate StatisticsLet’s bring in data from 2015–2016.

We download the data for that year and upload the .

XLSX file into the same pell_disbursements collection.

  We use the same steps as above, this time filtering by file name, to construct another subquery for 2015-2016 data.

Note that the second file has a slightly different format (row offset, column headers, and data type in OPE ID column), so we have to adjust our query appropriately.

Next we can match up data for each institution across the two years to compute the year-over-year change in number of Pell recipients.

We aggregate by state and order from least to greatest:We can see that in all but three states the number of Pell grant recipients actually decreased (albeit two of them, Washington, D.

C.

and the Marshall Islands, are technically not states).

This is part of a larger trend as noted by CollegeBoard in this figure.

The outlier by far is Indiana, where the recipient count decreased by 27.

7%.

  Let’s see this data on a map.

We can use this readily available US heat map demo from amCharts.

To match the data format amCharts expects, we rename the columns and then export the results as a JSON:Next we paste the JSON right into the demo code and immediately generate an interactive heat map!Run this pen to see the heat map.

We visually observe pockets of states near Utah and the Deep South that retained their Pell student counts, and again see Indiana is indeed the outlier in losing Pell students.

 What happened in Indiana that year?.Why such a dramatic change?.Let’s dig in.

We list changes in the number of students reported receiving Pell grants by institution in just Indiana:And therein lies the rub!.ITT Technical Institute contributed a drop of nearly 40,000 students, and a quick search reveals the backstory.

In August 2016, the Department of Education banned ITT Tech from enrolling new students with federal aid after a series of charges of fraud and misleading students.

ITT Tech declared bankruptcy the following month.

Enhancing with Additional DataLet’s broaden this investigation to include data about other aspects of an institution.

Specifically, we might wonder whether there is any correlation between grant size and whether an institution is categorized as all-male/all-female.

 A terrific dataset that can help us is the College Scorecard, managed by the Department of Education.

We download the data for 2016–2017 as a CSV file and upload into a new Rockset collection:In order to line this up with the Pell grant data, we use institutions’ OPE IDs (Office of Postsecondary Education Identification, assigned by the U.

S.

Department of Education).

We note that the College Scorecard has two OPE ID columns, one for 6-digit and one for 8-digit IDs:Meanwhile, the Pell grant data is slightly mangled.

The original Excel file has IDs stored as numeric (so its leading 0’s are stripped) and appends two additional digits to each ID:We can fix all this in our Rockset query.

We use SQL functions to massage the ID, then use a JOIN to compare institution names and check if they line up.

Looks good!.With a JOIN, we don’t need any prior data preparation to combine datasets — we are free to explore without predefined boundaries on which data we can use.

 The College Scorecard documents a huge variety of attributes regarding each institution, check it out here.

In our investigation, we compute the average grant size for institutions that are all-male/all-female compared to the rest:We’ll stop here, but you can imagine the endless angles to explore:Are Pell disbursements biased towards any particular fields of study?Is it harder to get accepted into college, or get a Pell grant once you’re in?How does faculty pay relate to the institution’s revenue from Pell grants?If you would like to run your own queries with this data, go wild!.Just reach out to hello@rockset.

com to get access.

(Full disclosure: I work at Rockset.

)SummarySQL is commonly used to query data in traditional databases, but here we’ve seen how it can also be useful with data outside that context.

The flexibility and complexity of SQL allows for data to be massaged on the fly, aggregated in arbitrary ways, and combined efficiently with data elsewhere.

As we find new ways to query semi-structured, I think SQL has potential to put even more and more of the world’s data to positive use.

Data used:Distribution of Federal Pell Grant Program Funds by Institution for 2016–2017 (Excel spreadsheet)Distribution of Federal Pell Grant Program Funds by Institution for 2015–2016 (Excel spreadsheet)College Scorecard Data for 2016–2017 (CSV file)Services used:Rockset — to ingest and query .

XLSX filesamCharts — to visualize query results on a US heat mapGitHub Gist and CodePen — to share the code in this post.

. More details

Leave a Reply