How to Run SQL on PDF Files

How to Run SQL on PDF FilesKshitij WadhwaBlockedUnblockFollowFollowingApr 11PDFs are the de facto standard for distributing and sharing fixed-layout documents today.

A quick survey of my laptop folders reveals account statements, receipts, technical papers, book chapters, and presentation slides — all PDFs.

Lots of valuable information finds its way into all manner of PDF files.

Which is a great reason for Rockset to support SQL queries on PDF files, in our mission to make data more usable to everyone.

Fast SQL on PDFs in RocksetRockset makes it easy for developers and data practitioners to ingest and run fast SQL on semi-structured data in a variety of data formats, such as JSON, CSV, and XLSX, without any upfront data prep.

Now add PDFs to the mix, and users can combine PDF data with data of other formats, from various sources, into their SQL analyses.

Or analyzing multiple PDFs together might be valuable too, if you have a series of electricity bills like I do, as we’ll see in our short example below.

Uploading PDFsFrom an existing collection, click the Upload File button at the top right of the console and specify PDF format to ingest into Rockset.

Querying Data in PDFsI uploaded 9 months of electricity bills.

We can use the DESCRIBE command to view the fields that were extracted from the PDFs.

> describe "elec-bills";+——————————————–+—————+———+———–+| field | occurrences | total | type ||——————————————–+—————+———+———–|| ['Author'] | 9 | 9 | string || ['CreationDate'] | 9 | 9 | string || ['Creator'] | 9 | 9 | string || ['ModDate'] | 9 | 9 | string || ['Producer'] | 9 | 9 | string || ['Subject'] | 9 | 9 | string || ['Title'] | 9 | 9 | string || ['_event_time'] | 9 | 9 | timestamp || ['_id'] | 9 | 9 | string || ['_meta'] | 9 | 9 | object || ['_meta', 'file_upload'] | 9 | 9 | object || ['_meta', 'file_upload', 'file'] | 9 | 9 | string || ['_meta', 'file_upload', 'file_upload_id'] | 9 | 9 | string || ['_meta', 'file_upload', 'upload_time'] | 9 | 9 | string || ['author'] | 9 | 9 | string || ['creation_date'] | 9 | 9 | int || ['creator'] | 9 | 9 | string || ['modification_date'] | 9 | 9 | int || ['producer'] | 9 | 9 | string || ['subject'] | 9 | 9 | string || ['text'] | 9 | 9 | string || ['title'] | 9 | 9 | string |+——————————————–+—————+———+———–+Rockset parses out all the metadata like author, creation_date, etc.

from the document along with the text.

The text field is typically where most of the information in a PDF resides, so let's examine what's in a sample text field.

+————————————————————–+| text ||————————————————————–|| .

|| .

|| Statement Date: 10/11/2018 || Your Account Summary || .

|| Total Amount Due: || $157.

57 || Amount Enclosed: || .

|+————————————————————–+Combining Data from Multiple PDFsWith my 9 months of electricity bills ingested and indexed in Rockset, I can do some simple analysis of my usage over this timespan.

We can run a SQL query to select the month/year and billing amount out of text.

> with details as ( select tokenize(REGEXP_EXTRACT(text, 'Statement Date: .

*'))[3] as month, tokenize(REGEXP_EXTRACT(text, 'Statement Date: .

*'))[5] as year, cast(tokenize(REGEXP_EXTRACT(text, 'Total Amount Due:?..

*.Amount Enclosed'))[4] as float) as amount from "elec-bills") select concat(month, '/', year) as billing_period, amountfrom detailsorder by year asc, month;+———-+——————+| amount | billing_period ||———-+——————|| 47.

55 | 04/2018 || 76.

5 | 05/2018 || 52.

28 | 06/2018 || 50.

58 | 07/2018 || 47.

62 | 08/2018 || 39.

7 | 09/2018 || <null> | 10/2018 || 72.

93 | 11/2018 || 157.

57 | 12/2018 |+———-+——————+And plot the results in Superset.

My October bill was surprisingly zero.

Was the billing amount not extracted correctly.I went back and checked, and it turns out I received a California Climate Credit in October which zeroed out my bill, so ingesting and querying PDFs is working as it should!Originally published at rockset.

com.

.

. More details

Leave a Reply