1.1 Billion Taxi Rides on a Large Redshift Cluster

CREATE TABLE trips ( trip_id INTEGER NOT NULL DISTKEY ENCODE LZO, vendor_id VARCHAR(3) ENCODE LZO, — Sort keys shouldn't be encoded (compressed) pickup_datetime TIMESTAMP NOT NULL, dropoff_datetime TIMESTAMP NOT NULL ENCODE LZO, store_and_fwd_flag VARCHAR(1) ENCODE RUNLENGTH, rate_code_id SMALLINT ENCODE LZO, pickup_longitude DECIMAL(18,14) ENCODE MOSTLY8, pickup_latitude DECIMAL(18,14) ENCODE MOSTLY8, dropoff_longitude DECIMAL(18,14) ENCODE MOSTLY8, dropoff_latitude DECIMAL(18,14) ENCODE MOSTLY8, passenger_count SMALLINT NOT NULL DEFAULT '0' ENCODE LZO, trip_distance DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, fare_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, extra DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, mta_tax DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, tip_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, tolls_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, ehail_fee DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, improvement_surcharge DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, total_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, payment_type VARCHAR(3) ENCODE RUNLENGTH, trip_type SMALLINT ENCODE LZO, pickup VARCHAR(50) ENCODE LZO, dropoff VARCHAR(50) ENCODE LZO, cab_type VARCHAR(6) NOT NULL ENCODE LZO, precipitation SMALLINT DEFAULT '0' ENCODE LZO, snow_depth SMALLINT DEFAULT '0' ENCODE LZO, snowfall SMALLINT DEFAULT '0' ENCODE LZO, max_temperature SMALLINT DEFAULT '0' ENCODE LZO, min_temperature SMALLINT DEFAULT '0' ENCODE LZO, average_wind_speed SMALLINT DEFAULT '0' ENCODE LZO, pickup_nyct2010_gid SMALLINT ENCODE LZO, pickup_ctlabel VARCHAR(10) ENCODE LZO, pickup_borocode SMALLINT ENCODE LZO, pickup_boroname VARCHAR(13) ENCODE LZO, pickup_ct2010 VARCHAR(6) ENCODE LZO, pickup_boroct2010 VARCHAR(7) ENCODE LZO, pickup_cdeligibil VARCHAR(1) ENCODE RUNLENGTH, pickup_ntacode VARCHAR(4) ENCODE LZO, pickup_ntaname VARCHAR(56) ENCODE LZO, pickup_puma VARCHAR(4) ENCODE LZO, dropoff_nyct2010_gid SMALLINT ENCODE LZO, dropoff_ctlabel VARCHAR(10) ENCODE LZO, dropoff_borocode SMALLINT ENCODE LZO, dropoff_boroname VARCHAR(13) ENCODE LZO, dropoff_ct2010 VARCHAR(6) ENCODE LZO, dropoff_boroct2010 VARCHAR(7) ENCODE LZO, dropoff_cdeligibil VARCHAR(1) ENCODE RUNLENGTH, dropoff_ntacode VARCHAR(4) ENCODE LZO, dropoff_ntaname VARCHAR(56) ENCODE LZO, dropoff_puma VARCHAR(4) ENCODE LZO, primary key(trip_id) ) sortkey(pickup_datetime); The following will allow my process to use 100% of the clusters memory..set wlm_query_slot_count to 5; The following will tell Redshift to load all the data from the files on S3 into the trips table..COPY trips FROM 's3://<s3_bucket>/csv/trips.manifest' CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…' DELIMITER ',' EMPTYASNULL ESCAPE GZIP MANIFEST MAXERROR 100000 REMOVEQUOTES TRIMBLANKS TRUNCATECOLUMNS; The above took 11 minutes and 13 seconds to complete..Benchmarking Redshift I ran each of the queries below multiple times in a row..The times reported were the fastest achieved..iming on set wlm_query_slot_count to 5; The following completed in 1.56 seconds..SELECT cab_type, count(*) FROM trips GROUP BY 1; The following completed in 1.25 seconds..SELECT passenger_count, avg(total_amount) FROM trips GROUP BY 1; The following completed in 2.25 seconds..SELECT passenger_count, extract(year from pickup_datetime), count(*) FROM trips GROUP BY 1, 2; The following completed in 2.97 seconds..SELECT passenger_count, extract(year from pickup_datetime), round(trip_distance), count(*) FROM trips GROUP BY 1, 2, 3 ORDER BY 2, 4 desc; Costs The cost of storing 104 GB of data on S3 is $3.12 if I were to leave the CSV data there for a month..In this case I had the data on there for a few hours so the costs were substantially less..The 6-node ds2.8xlarge Redshift cluster in eu-west-1 runs at $45.60 / hour as of this writing.. More details

Leave a Reply