A Billion Taxi Rides in PostgreSQL

$ psql trips CREATE FOREIGN TABLE trips ( trip_id BIGINT, vendor_id VARCHAR(3), pickup_datetime TIMESTAMP NOT NULL, dropoff_datetime TIMESTAMP NOT NULL, store_and_fwd_flag VARCHAR(1), rate_code_id SMALLINT NOT NULL, pickup_longitude DECIMAL(18,14), pickup_latitude DECIMAL(18,14), dropoff_longitude DECIMAL(18,14), dropoff_latitude DECIMAL(18,14), passenger_count SMALLINT NOT NULL DEFAULT '0', trip_distance DECIMAL(6,3) DEFAULT '0.0', fare_amount DECIMAL(6,2) DEFAULT '0.0', extra DECIMAL(6,2) DEFAULT '0.0', mta_tax DECIMAL(6,2) DEFAULT '0.0', tip_amount DECIMAL(6,2) DEFAULT '0.0', tolls_amount DECIMAL(6,2) DEFAULT '0.0', ehail_fee DECIMAL(6,2) DEFAULT '0.0', improvement_surcharge DECIMAL(6,2) DEFAULT '0.0', total_amount DECIMAL(6,2) DEFAULT '0.0', payment_type VARCHAR(6), trip_type SMALLINT, pickup VARCHAR(50), dropoff VARCHAR(50), cab_type VARCHAR(6) NOT NULL, precipitation SMALLINT DEFAULT '0', snow_depth SMALLINT DEFAULT '0', snowfall SMALLINT DEFAULT '0', max_temperature SMALLINT DEFAULT '0', min_temperature SMALLINT DEFAULT '0', average_wind_speed SMALLINT DEFAULT '0', pickup_nyct2010_gid SMALLINT, pickup_ctlabel VARCHAR(10), pickup_borocode SMALLINT, pickup_boroname VARCHAR(13), pickup_ct2010 VARCHAR(6), pickup_boroct2010 VARCHAR(7), pickup_cdeligibil VARCHAR(1), pickup_ntacode VARCHAR(4), pickup_ntaname VARCHAR(56), pickup_puma VARCHAR(4), dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel VARCHAR(10), dropoff_borocode SMALLINT, dropoff_boroname VARCHAR(13), dropoff_ct2010 VARCHAR(6), dropoff_boroct2010 VARCHAR(7), dropoff_cdeligibil VARCHAR(1), dropoff_ntacode VARCHAR(4), dropoff_ntaname VARCHAR(56), dropoff_puma VARCHAR(4) ) SERVER cstore_server OPTIONS(compression 'pglz', block_row_count '40000', stripe_row_count '600000'); Importing 1.1 Billion Records into ORC Format Ive created a ~/taxi-trips folder with the 56 gzip files of taxi trip data and set their ownership to the postgres user..$ cd ~/taxi-trips $ sudo chown postgres trips_x*.csv.gz I then imported the data as usual into PostgreSQL..$ screen $ for filename in *.csv.gz do echo $filename echo "COPY trips FROM PROGRAM 'gunzip -c `pwd`/$filename' DELIMITER ',' CSV;" | psql trips done When the import is finished I can see the data is around 3.5x smaller than it was when stored using PostgreSQLs standard storage format..$ sudo su – postgres -c "du -hs ~/9.5/main/* | grep [0-9]G" 81G /var/lib/postgresql/9.5/main/cstore_fdw Benchmarking cstore_fdw The following shows PostgreSQL will run a foreign scan on the cstore_fdw data when we execute a query on the trips table..EXPLAIN (COSTS OFF) SELECT cab_type, count(*) FROM trips GROUP BY cab_type; QUERY PLAN ————————————————————————– HashAggregate Group Key: cab_type -> Foreign Scan on trips CStore File: /var/lib/postgresql/9.5/main/cstore_fdw/16385/16394 The following completed in 2 minutes and 32 seconds..SELECT cab_type, count(*) FROM trips GROUP BY cab_type; The following completed in 2 minutes and 55 seconds..SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count; The following completed in 3 minutes and 55 seconds..SELECT passenger_count, EXTRACT(year from pickup_datetime) as year, count(*) FROM trips GROUP BY passenger_count, year; The following completed in 6 minutes and 8 seconds..SELECT passenger_count, EXTRACT(year from pickup_datetime) as year, round(trip_distance) distance, count(*) trips FROM trips GROUP BY passenger_count, year, distance ORDER BY year, trips desc;. More details

Leave a Reply