50-node Presto Cluster on Amazon EMR

$ hive CREATE EXTERNAL TABLE trips_orc ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION 's3://<s3_bucket>/orc/'; Benchmarking Queries in Presto The following queries ran on Presto 0.136, the version of Presto that EMR 4.4.0 ships with..The two key metrics Im looking at in the benchmark here is how much faster these queries perform against the original cluster and the second is the cost efficiency..Im hoping to see the queries are at least 5.45x faster to be in line with the additional cost..presto-cli –catalog hive –schema default The following completed in 43 seconds (1.74x faster, 3x less cost efficient)..SELECT cab_type, count(*) FROM trips_orc GROUP BY cab_type; Query 20160408_102901_00003_hbwkj, FINISHED, 49 nodes Splits: 753 total, 753 done (100.00%) 0:43 [1.11B rows, 48.8GB] [26.1M rows/s, 1.14GB/s] The following completed in 45 seconds (1.4x faster, 3.9x less cost efficient)..SELECT passenger_count, avg(total_amount) FROM trips_orc GROUP BY passenger_count; Query 20160408_103140_00005_hbwkj, FINISHED, 49 nodes Splits: 753 total, 753 done (100.00%) 0:45 [1.11B rows, 48.8GB] [24.6M rows/s, 1.08GB/s] The following completed in 27 seconds (3x faster, 1.8x less cost efficient)..SELECT passenger_count, year(pickup_datetime), count(*) FROM trips_orc GROUP BY passenger_count, year(pickup_datetime); Query 20160408_103248_00006_hbwkj, RUNNING, 49 nodes Splits: 753 total, 750 done (99.60%) 0:27 [1.11B rows, 48.8GB] [41.4M rows/s, 1.81GB/s] The following completed in 44 seconds (2.4x faster, 2.2x less cost efficient)..SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM trips_orc GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips desc; Query 20160408_103335_00007_hbwkj, FINISHED, 49 nodes Splits: 753 total, 753 done (100.00%) 0:44 [1.11B rows, 48.8GB] [25.1M rows/s, 1.1GB/s] As you can see there is a performance increase but none of the queries ran more than 3x faster despite all the extra nodes..Im investigating what can be done to bring performance in line with the cost increase..If anyone has any insights Id be grateful if you could drop me a line on Twitter on via email (my email address is in the top left corner of my CV).. More details

Leave a Reply