Uber datasets in BigQuery: Driving times around SF (and your city too)

Interactive Data Studio dashboard:Play with the interactive Data Studio dashboard.

Step 1: Get the data from UberWe can find data for several cities in Uber Movements’s public data site.

Here I’ll download some of the San Francisco travel times datasets:Step 2: Load into BigQueryOnce we have the files on our side, loading the data CSVs is straightforward:bq load –autodetect fh-bigquery:deleting.

uber_sf_censustracts_201803_all_hourly san_francisco-censustracts-2018-3-All-HourlyAggregate.

csvbq load –autodetect fh-bigquery:deleting.

uber_sf_censustracts_201803_weekdays_hourly san_francisco-censustracts-2018-3-OnlyWeekdays-HourlyAggregate.

csvbq load –autodetect fh-bigquery:deleting.

uber_sf_censustracts_201803_weekends_hourly san_francisco-censustracts-2018-3-OnlyWeekends-HourlyAggregate.

csvHowever the geo boundaries files will pose some challenges.

They are standard GeoJSON files, but we’ll have to massage them before loading into BigQuery:Transform the GeoJSON file to a new line delimited JSON file with jq.

Load the new .

json files as CSV into BigQuery.

Parse the JSON rows in BigQuery to generate native GIS geometries.

jq -c .

features[] san_francisco_censustracts.

json > sf_censustracts_201905.

jsonbq load –source_format=CSV –quote='' –field_delimiter='|' fh-bigquery:deleting.

sf_censustracts_201905 sf_censustracts_201905.

json rowFor step 3, we can parse the loaded rows inside BigQuery:CREATE OR REPLACE TABLE `fh-bigquery.

uber_201905.

sf_censustracts`ASSELECT FORMAT('%f,%f', ST_Y(centroid), ST_X(centroid)) lat_lon, *FROM ( SELECT *, ST_CENTROID(geometry) centroid FROM ( SELECT CAST(JSON_EXTRACT_SCALAR(row, '$.

properties.

MOVEMENT_ID') AS INT64) movement_id , JSON_EXTRACT_SCALAR(row, '$.

properties.

DISPLAY_NAME') display_name , ST_GeogFromGeoJson(JSON_EXTRACT(row, '$.

geometry')) geometry FROM `fh-bigquery.

deleting.

sf_censustracts_201905` ))Find some alternatives (JavaScript, ogr2ogr) from Lak Lakshmanan and Michael Entin to load GeoJSON data in this Stack Overflow reply.

Step 3: Massage the data for performance and efficiencyNow that we have both tables inside BigQuery, let’s massage the data to partition the main table, create native BQ GIS geometry columns, and join everything together:Let’s create our main table.

This table will contain both weekdays, weekends, and overall stats — and we’ll add some census tract data to make it easier to visualize and understand.

For efficiency we’ll partition it by quarter and cluster by the type of stat and travel starting place.

I’ll also calculate some additional stats: Average distance between areas and speed given this distance.

Note that having these stats will make the query run way slower than if we skipped them:CREATE OR REPLACE TABLE `fh-bigquery.

uber_201905.

sf_hourly`PARTITION BY quarterCLUSTER BY table_source, source, destinationASSELECT *, distance * 0.

000621371192 / geometric_mean_travel_time * 3600 speed_mphFROM ( SELECT a.

*, SPLIT(_TABLE_SUFFIX, '_')[OFFSET(0)] table_source , b.

display_name source , c.

display_name destination , b.

lat_lon sourceid_lat_lon , CAST(SPLIT(b.

lat_lon, ',')[OFFSET(0)] AS FLOAT64) sourceid_lat , CAST(SPLIT(b.

lat_lon, ',')[OFFSET(1)] AS FLOAT64) sourceid_lon , c.

lat_lon dstid_lat_lon , CAST(SPLIT(c.

lat_lon, ',')[OFFSET(0)] AS FLOAT64) dstid_lat , CAST(SPLIT(c.

lat_lon, ',')[OFFSET(1)] AS FLOAT64) dstid_lon , DATE('2018-07-01') quarter , COUNT(*) OVER(PARTITION BY sourceid, dstid) source_dst_popularity , COUNT(*) OVER(PARTITION BY dstid) dst_popularity , (ST_DISTANCE(b.

geometry, c.

geometry)+ST_MAXDISTANCE(b.

geometry, c.

geometry))/2 distance #slowest FROM `fh-bigquery.

deleting.

uber_sf_censustracts_201803_*` a JOIN `fh-bigquery.

uber_201905.

sf_censustracts` b ON a.

sourceid=b.

movement_id JOIN `fh-bigquery.

uber_201905.

sf_censustracts` c ON a.

dstid=c.

movement_id)Step 4: Query and visualizeI created an interactive dashboard with Data Studio.

Note that it runs super fast thanks to our new BigQuery BI Engine.

But you can also run your own queries!For example, the worst destinations in San Francisco and time of the day when arriving to the SFO airport:SELECT ROUND(geometric_mean_travel_time/60) minutes , hod hour, ROUND(distance*0.

000621371192,2) miles , destination FROM `fh-bigquery.

uber_201905.

sf_hourly`WHERE table_source='weekdays'AND quarter='2018-07-01'AND source LIKE '100 Domestic Terminals Departures Level%' AND destination LIKE '%San Francisco%'AND (distance*0.

000621371192)>10ORDER BY 1 DESCLIMIT 20And the best places and times to travel from SFO to SF:SELECT ROUND(geometric_mean_travel_time/60) minutes , hod hour, ROUND(distance*0.

000621371192,2) miles , destination FROM `fh-bigquery.

uber_201905.

sf_hourly`WHERE table_source='weekdays'AND quarter='2018-07-01'AND source LIKE '100 Domestic Terminals Departures Level%' AND destination LIKE '%San Francisco%'AND (distance*0.

000621371192)>10ORDER BY 1 LIMIT 20The worst variations in average time (by stddev), when going from SFO to Oakland:SELECT destination , STDDEV(geometric_mean_travel_time) stddev , COUNT(*) cFROM `fh-bigquery.

uber_201905.

sf_hourly`WHERE table_source='weekdays'AND quarter='2018-07-01'AND source LIKE '100 Domestic Terminals Departures Level%' AND destination LIKE '%Oakland%'AND (distance*0.

000621371192)>10GROUP BY destinationHAVING c=24ORDER BY stddev DESCLIMIT 20Your turn to playFind the shared dataset in BigQuery.

Load more cities — Uber keeps adding more data to their shared collection!Some of the cities Uber is sharing data for.

Next stepsWant more stories?.Check my Medium, follow me on twitter, and subscribe to reddit.

com/r/bigquery.

And try BigQuery — every month you get a full terabyte of analysis for free.

Uber’s data license: Data is made available under the Creative Commons, Attribution Non Commercial license.

Data AttributionsBackground map: http://extras.

sfgate.

com/img/pages/travel/maps/sfbay_std.

gifWhen will Stack Overflow reply: How to predict with BigQueryWhen you finally post a question on Stack Overflow, a long wait starts.

Will someone answer your question?.Will the…towardsdatascience.

com.. More details

Leave a Reply