1.1 Billion Taxi Rides with SQLite, Parquet & HDFS

CREATE VIEW query_3_view AS SELECT passenger_count, STRFTIME('%Y', DATETIME(pickup_datetime / 1000, 'unixepoch')) AS pickup_year, COUNT(*) AS num_records FROM trips_0 GROUP BY passenger_count, pickup_year UNION ALL SELECT passenger_count, STRFTIME('%Y', DATETIME(pickup_datetime / 1000, 'unixepoch')) AS pickup_year, COUNT(*) AS num_records FROM trips_1 GROUP BY passenger_count, pickup_year UNION ALL ….SELECT passenger_count, STRFTIME('%Y', DATETIME(pickup_datetime / 1000, 'unixepoch')) AS pickup_year, COUNT(*) AS num_records FROM trips_54 GROUP BY passenger_count, pickup_year UNION ALL SELECT passenger_count, STRFTIME('%Y', DATETIME(pickup_datetime / 1000, 'unixepoch')) AS pickup_year, COUNT(*) AS num_records FROM trips_55 GROUP BY passenger_count, pickup_year; The following completed in 30 minutes and 11 seconds..SELECT passenger_count, pickup_year, SUM(num_records) FROM query_3_view GROUP BY passenger_count, pickup_year; The following is the view for query #4..Ive truncated the view for readability purposes..CREATE VIEW query_4_view AS SELECT passenger_count, STRFTIME('%Y', DATETIME(pickup_datetime / 1000, 'unixepoch')) AS pickup_year, ROUND(trip_distance) AS distance, COUNT(*) AS num_records FROM trips_0 GROUP BY passenger_count, pickup_year, distance UNION ALL SELECT passenger_count, STRFTIME('%Y', DATETIME(pickup_datetime / 1000, 'unixepoch')) AS pickup_year, ROUND(trip_distance) AS distance, COUNT(*) AS num_records FROM trips_1 GROUP BY passenger_count, pickup_year, distance UNION ALL ….SELECT passenger_count, STRFTIME('%Y', DATETIME(pickup_datetime / 1000, 'unixepoch')) AS pickup_year, ROUND(trip_distance) AS distance, COUNT(*) AS num_records FROM trips_54 GROUP BY passenger_count, pickup_year, distance UNION ALL SELECT passenger_count, STRFTIME('%Y', DATETIME(pickup_datetime / 1000, 'unixepoch')) AS pickup_year, ROUND(trip_distance) AS distance, COUNT(*) AS num_records FROM trips_55 GROUP BY passenger_count, pickup_year, distance; The following completed in 54 minutes and 46 seconds..SELECT passenger_count, pickup_year, distance, SUM(num_records) AS the_count FROM query_4_view GROUP BY passenger_count, pickup_year, distance ORDER BY pickup_year, the_count DESC; Comparing to SQLites Internal Format Ill install a multi-core gzip drop-in replacement and create a regular table in SQLite..$ sudo apt install pigz $ sqlite3 taxi.db CREATE TABLE trips ( trip_id INTEGER, vendor_id TEXT, pickup_datetime TEXT, dropoff_datetime TEXT, store_and_fwd_flag TEXT, rate_code_id INTEGER, pickup_longitude REAL, pickup_latitude REAL, dropoff_longitude REAL, dropoff_latitude REAL, passenger_count INTEGER, trip_distance REAL, fare_amount REAL, extra REAL, mta_tax REAL, tip_amount REAL, tolls_amount REAL, ehail_fee REAL, improvement_surcharge REAL, total_amount REAL, payment_type TEXT, trip_type INTEGER, pickup TEXT, dropoff TEXT, cab_type TEXT, precipitation INTEGER, snow_depth INTEGER, snowfall INTEGER, max_temperature INTEGER, min_temperature INTEGER, average_wind_speed INTEGER, pickup_nyct2010_gid INTEGER, pickup_ctlabel TEXT, pickup_borocode INTEGER, pickup_boroname TEXT, pickup_ct2010 TEXT, pickup_boroct2010 TEXT, pickup_cdeligibil TEXT, pickup_ntacode TEXT, pickup_ntaname TEXT, pickup_puma TEXT, dropoff_nyct2010_gid INTEGER, dropoff_ctlabel TEXT, dropoff_borocode INTEGER, dropoff_boroname TEXT, dropoff_ct2010 TEXT, dropoff_boroct2010 TEXT, dropoff_cdeligibil TEXT, dropoff_ntacode TEXT, dropoff_ntaname TEXT, dropoff_puma TEXT ); Ill import the same dataset, albeit from 56 gzip-compressed CSV files..I wasnt able to find a reliable way to pipe in the decompressed data via stdin so Im using a temporary file to read decompressed CSV data into SQLite..$ vi import.sql pragma journal_mode=memory; .mode csv .separator ',' .import taxi.temp trips $ for FILENAME in trips_x*.csv.gz; do echo $FILENAME pigz -d -c $FILENAME > taxi.temp sqlite3 taxi.db < import.sql rm taxi.temp done The above completed in 5.5 hours and produced a SQLite file that was 529 GB decompressed..If I used a compressed file system for that SQLite file it would sit at 137 GB in size..The following completed in 8 hours, 39 minutes and 20 seconds.. More details

Leave a Reply