BigQuery for Data Science

To be exact, due to quirks in the API you have to first export the query result to a table and then export the table to csv in a Cloud Storage Bucket.

This way, it is possible to only query the data once and then read from your storage, giving you repeatable results.

This method is a bit clunky and there is some overhead to manage the different csv outputs, buckets and tables.

Using this method, writing to csv in a Storage Bucket and reading from csv only takes about half a minute each:Export to GCS:Read from GCS:3.

BigQuery StorageAs I was writing this, Google has released the beta version of BigQuery Storage, allowing fast access to BigQuery data, and hence faster download into pandas.

This seems to be an ideal solution if you want to import the WHOLE table into pandas or run simple filters.

However, the tradeoff is that SQL type queries no longer work.

Columns can be selected while rows can be filtered, but only by comparing a column to a constant value.

If you need to sample the data then you have to rely on filtering by a value in a column.

In the example below, we try out the BigQuery Storage API.

Since SQL-like statements don’t work, we filter the pickup_location_id to ‘48’ to get about 2 million rows in return.

Getting the data into pandas took about 100 seconds, faster than loading directly from BigQuery but slower than dumping into csv.

However with BigQuery Storage there is the possibility of reading in the data with multiple streams, increasing the transfer speed.

4.

BigQuery MLBigQuery ML is an effort by Google to allow BigQuery users to build machine learning models using SQL queries.

Instead of bringing data out of BigQuery and then building a model, the model is brought to the data instead.

While the goal is for everything to be done in BigQuery, I find the solution very limited as complex data transformations cannot be performed with SQL and only linear and logistic regression models are available.

ProductionIn production a significant chunk of the data will have to be processed and robust solutions are needed.

While BigQuery Storage could be a solution to run on pandas in production, it would probably be better to avoid pandas altogether due to performance issues and build more robust processing pipelines instead.

1.

TensorFlowBigQuery stores structured data.

Why TensorFlow?.Actually, the TensorFlow framework can handle structured data as well with models such as linear regression, logistic regression, boosted trees, etc.

The BigQueryReader class allows Tensorflow to access BigQuery.

2.

SparkLike Tensorflow, BigQuery also has connectors to Spark, allowing the use of libraries like H2O.

ai’s Sparkling Water.

Unfortunately, if you’re using pyspark, the data will have to be dumped to Cloud Storage first before Spark can read it.

ConclusionWhile BigQuery enables rapid queries with SQL syntax, extracting the data for data science can be tedious.

If limiting the query size does not work for me, I personally prefer to dump to csv and read from there as it fast and allows repeatable analysis.

Further improvements to the google-cloud-bigquery library would be an interface to Spark/Dask for lazy reading of data as well as the ability to write dplyr like transformations of the data instead of writing SQL queries to access data in BigQuery.

Links:[1] Google Cloud BigQuery[2] Google Driven google-cloud-bigquery python library[3] Community Driven pandas-gbq python library[4] Google BigQuery Storage (Beta) reference[5] BigQuery ML reference[6] TensorFlow’s BigQueryReader[7] Google Cloud Dataproc BigQuery connector documentation[8] Tidyverse’s bigrquery 1.

0.

0 announcement.

. More details

Leave a Reply