A Billion Taxi Rides on Amazon EMR running Presto

$ screen $ hive CREATE EXTERNAL TABLE trips_csv ( trip_id INT, vendor_id VARCHAR(3), pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag VARCHAR(1), rate_code_id SMALLINT, pickup_longitude DECIMAL(18,14), pickup_latitude DECIMAL(18,14), dropoff_longitude DECIMAL(18,14), dropoff_latitude DECIMAL(18,14), passenger_count SMALLINT, trip_distance DECIMAL(6,3), fare_amount DECIMAL(6,2), extra DECIMAL(6,2), mta_tax DECIMAL(6,2), tip_amount DECIMAL(6,2), tolls_amount DECIMAL(6,2), ehail_fee DECIMAL(6,2), improvement_surcharge DECIMAL(6,2), total_amount DECIMAL(6,2), payment_type VARCHAR(3), trip_type SMALLINT, pickup VARCHAR(50), dropoff VARCHAR(50), cab_type VARCHAR(6), precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel VARCHAR(10), pickup_borocode SMALLINT, pickup_boroname VARCHAR(13), pickup_ct2010 VARCHAR(6), pickup_boroct2010 VARCHAR(7), pickup_cdeligibil VARCHAR(1), pickup_ntacode VARCHAR(4), pickup_ntaname VARCHAR(56), pickup_puma VARCHAR(4), dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel VARCHAR(10), dropoff_borocode SMALLINT, dropoff_boroname VARCHAR(13), dropoff_ct2010 VARCHAR(6), dropoff_boroct2010 VARCHAR(7), dropoff_cdeligibil VARCHAR(1), dropoff_ntacode VARCHAR(4), dropoff_ntaname VARCHAR(56), dropoff_puma VARCHAR(4) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://<s3_bucket>/csv/'; CREATE EXTERNAL TABLE trips_orc ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION 's3://<s3_bucket>/orc/'; Loading 1.1 Billion Trips into ORC Format The data is in CSV format but in order to analyse the data quickly itll need to be converted into ORC format..When in ORC format the data is stored in a column-oriented form and each column will be compressed..The 104 GB of compressed CSV data (~500 GB uncompressed) will only be ~49 GB in size when its in ORC format..echo "LOAD DATA INPATH 's3://<s3_bucket>/raw' INTO TABLE trips_csv; INSERT INTO TABLE trips_orc SELECT * FROM trips_csv;" | hive The above command took 3 hours and 45 minutes to complete..This is a one-off exercise..After the above finishes you shouldnt need to convert the CSV data into ORC any more and the ORC files will live on in S3 so you can use them in another EMR cluster later on..Below is a snippet of the listing of the 192 ORC files in the S3 bucket I used..$ aws s3 ls <s3_bucket>/orc/ 2016-03-14 13:54:41 398631347 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000000 2016-03-14 13:54:40 393489828 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000001 2016-03-14 13:54:40 339863956 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000002 2016-03-14 13:54:29 319146976 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000003 …..2016-03-14 14:00:18 261801034 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000188 2016-03-14 14:00:26 266048453 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000189 2016-03-14 14:00:23 265076102 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000190 2016-03-14 14:00:12 115110402 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000191 Benchmarking Queries in Presto The following queries ran on Presto 0.130, the version of Presto that EMR ships with..$ presto-cli –catalog hive –schema default The following completed in 1 minute and 15 seconds..SELECT cab_type, count(*) FROM trips_orc GROUP BY cab_type; Query 20160314_140142_00002_gj4np, FINISHED, 4 nodes Splits: 749 total, 749 done (100.00%) 1:15 [1.11B rows, 48.8GB] [14.8M rows/s, 667MB/s] The following completed in 1 minute and 4 seconds..SELECT passenger_count, avg(total_amount) FROM trips_orc GROUP BY passenger_count; Query 20160314_140314_00003_gj4np, FINISHED, 4 nodes Splits: 749 total, 749 done (100.00%) 1:04 [1.11B rows, 48.8GB] [17.3M rows/s, 777MB/s] The following completed in 1 minute and 23 seconds.. More details

Leave a Reply