All 1.1 Billion Taxi Rides on Redshift

CREATE TABLE trips ( trip_id INTEGER NOT NULL DISTKEY ENCODE LZO, vendor_id VARCHAR(3) ENCODE LZO, — Sort keys shouldn't be encoded (compressed) pickup_datetime TIMESTAMP NOT NULL, dropoff_datetime TIMESTAMP NOT NULL ENCODE LZO, store_and_fwd_flag VARCHAR(1) ENCODE RUNLENGTH, rate_code_id SMALLINT ENCODE LZO, pickup_longitude DECIMAL(18,14) ENCODE MOSTLY8, pickup_latitude DECIMAL(18,14) ENCODE MOSTLY8, dropoff_longitude DECIMAL(18,14) ENCODE MOSTLY8, dropoff_latitude DECIMAL(18,14) ENCODE MOSTLY8, passenger_count SMALLINT NOT NULL DEFAULT '0' ENCODE LZO, trip_distance DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, fare_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, extra DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, mta_tax DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, tip_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, tolls_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, ehail_fee DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, improvement_surcharge DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, total_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8, payment_type VARCHAR(3) ENCODE RUNLENGTH, trip_type SMALLINT ENCODE LZO, pickup VARCHAR(50) ENCODE LZO, dropoff VARCHAR(50) ENCODE LZO, cab_type VARCHAR(6) NOT NULL ENCODE LZO, precipitation SMALLINT DEFAULT '0' ENCODE LZO, snow_depth SMALLINT DEFAULT '0' ENCODE LZO, snowfall SMALLINT DEFAULT '0' ENCODE LZO, max_temperature SMALLINT DEFAULT '0' ENCODE LZO, min_temperature SMALLINT DEFAULT '0' ENCODE LZO, average_wind_speed SMALLINT DEFAULT '0' ENCODE LZO, pickup_nyct2010_gid SMALLINT ENCODE LZO, pickup_ctlabel VARCHAR(10) ENCODE LZO, pickup_borocode SMALLINT ENCODE LZO, pickup_boroname VARCHAR(13) ENCODE LZO, pickup_ct2010 VARCHAR(6) ENCODE LZO, pickup_boroct2010 VARCHAR(7) ENCODE LZO, pickup_cdeligibil VARCHAR(1) ENCODE RUNLENGTH, pickup_ntacode VARCHAR(4) ENCODE LZO, pickup_ntaname VARCHAR(56) ENCODE LZO, pickup_puma VARCHAR(4) ENCODE LZO, dropoff_nyct2010_gid SMALLINT ENCODE LZO, dropoff_ctlabel VARCHAR(10) ENCODE LZO, dropoff_borocode SMALLINT ENCODE LZO, dropoff_boroname VARCHAR(13) ENCODE LZO, dropoff_ct2010 VARCHAR(6) ENCODE LZO, dropoff_boroct2010 VARCHAR(7) ENCODE LZO, dropoff_cdeligibil VARCHAR(1) ENCODE RUNLENGTH, dropoff_ntacode VARCHAR(4) ENCODE LZO, dropoff_ntaname VARCHAR(56) ENCODE LZO, dropoff_puma VARCHAR(4) ENCODE LZO, primary key(trip_id) ) sortkey(pickup_datetime); The following will start the importing process..COPY trips FROM 's3://<s3_bucket>/csv/trips.manifest' CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…' DELIMITER ',' EMPTYASNULL ESCAPE GZIP MANIFEST MAXERROR 100000 REMOVEQUOTES TRIMBLANKS TRUNCATECOLUMNS; The above took 4 hours and 33 minutes to import..I ran a VACUUM on the data after the import..The following completed in 1 minute and 42 seconds..VACUUM; Erroneous Values The original PostgreSQL schema this data came from defined DECIMAL columns as NUMERIC meaning theyd accept any sort of numeric value..This meant the schema wasnt enforcing any sort of value range control..When I first started this import many of the fields such as trip_distance were defined as DECIMAL(6,3) but errors from values such as 1236007.30 would cause Redshift to abort importing after 45 – 50 minutes..The dataset combines data from different types of taxis operating in New York City, some of which report less amounts of data than others..I had to remove some NOT NULL constraints in order for these records to import as well..Again, I didnt know about these problems till 45 – 50 minutes into the import..I first started importing the data at 10:30am UTC+3 and it wasnt until 16:05pm that I got a schema together that would allow for these data quality issues..To top it off I was paying for the Redshift cluster during each one of the imports that failed.. More details

Leave a Reply