Athena vs Q: Battle of the Data Parsers

Either tool requires some interface for our data team to upload new files.

They don’t care where the files end up; they just need to ensure they are available for use.

The difference between having a web form dump those files into S3 or a disk elsewhere is marginal.

q is a command line tool, so it would not be terribly difficult to issue a system call and get the results back.

The results returned though are just the lines directly from the files.

We would have to do some additional parsing after the fact to break things down into named bits.

Hrm… That means carrying the logic of the row result structure around and needing to alter it if the data format ever changes.

Combine that with the concern on having to find and concatenate file paths and the length of said file paths on the command line execution… I think q may be on the ropes.

Amazon provides a SDK for most of the frequently used programming languages, and the ones I would need have support for using Athena.

Query results are just an API call away via the SDK, and they come back in an array of data.

The format of that array is interesting.

AwsResult Object( [data:AwsResult:private] => Array ( [UpdateCount] => 0 [ResultSet] => Array ( [Rows] => Array ( [0] => Array ( [Data] => Array ( [0] => Array ( [VarCharValue] => first_name ) [1] => Array ( [VarCharValue] => last_name ) [2] => Array ( [VarCharValue] => location ) ) ) [1] => Array ( [Data] => Array ( [0] => Array ( [VarCharValue] => Thor ) [1] => Array ( [VarCharValue] => Odinson ) [3] => Array ( [VarCharValue] => Asgard ) ) ).

The first row in the data set is essentially the table definition with subsequent rows being the actual data.

While we don’t get back the data in a typical SQL-esque fashion — each entry in the array is an associative array with the key being column name, we could easily build a parser / formatter to take that first row of data and use it to return a similar structure.

That is one bit of coding that can handle any data changes down the line.

Winner: Athena lands a solid blow to q’s gut by not needing to worry about having to update code any time the data format changes.

Current Score: Athena 4 — qHere is your winner…Athena wins with a decisive knock out blow in the final round.

While you do have some operating expenses in its use of S3 and scanning the data, Athena provides everything we need to process CSV data and serve it up to our customers.

Our developers can query the data in SQL like they would with any other data source through the SDK API calls.

Given it can parse other file formats, this is definitely a tool I can see reaching for more frequently.

Originally published at Traackr Engineering.

.. More details

Leave a Reply