Performance Impact of File Sizes on Presto Query Times

For reference these are the table names: trips_64mb trips_256mb trips_1024mb CREATE EXTERNAL TABLE trips_64mb ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/trips_64mb/' TBLPROPERTIES ("orc.compress"="SNAPPY", "orc.stripe.size"="67108864", "orc.row.index.stride"="50000"); CREATE EXTERNAL TABLE trips_256mb ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/trips_256mb/' TBLPROPERTIES ("orc.compress"="SNAPPY", "orc.stripe.size"="67108864", "orc.row.index.stride"="50000"); CREATE EXTERNAL TABLE trips_1024mb ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/trips_1024mb/' TBLPROPERTIES ("orc.compress"="SNAPPY", "orc.stripe.size"="67108864", "orc.row.index.stride"="50000"); Importing 3.3 Billion Records I then ran the following to import the data from CSV into ORC format..$ screen $ echo "INSERT INTO TABLE trips_64mb SELECT * FROM trips_csv_64mb; INSERT INTO TABLE trips_256mb SELECT * FROM trips_csv_256mb; INSERT INTO TABLE trips_1024mb SELECT * FROM trips_csv_1024mb;" | hive Underlying Files Breaking up the source files into 64, 256 and 1024 MB chunks results in ORC file counts that match their CSV counterparts..Below you can see the 64 MB dataset has 1,701 files, the 256 MB dataset has 421 files and the 1024 MB dataset has 106 files..$ hadoop dfs -ls /trips_64mb | grep -c trips_ 1701 $ hadoop dfs -ls /trips_256mb | grep -c trips_ 421 $ hadoop dfs -ls /trips_1024mb | grep -c trips_ 106 Benchmarking File Size Differences I ran the following queries using Presto 0.144.1..$ presto –catalog hive –schema default I ran this query on each table repeatedly..Listed below are the lowest, and often most common query times..SELECT cab_type, count(*) FROM <table> GROUP BY cab_type; trips_64mb: 13 seconds trips_256mb: 13 seconds trips_1024mb: 12 seconds As you can see there is no noticeable difference in query performance despite the difference in the number of underlying files used to store each tables data..Here are the times for the following query..SELECT passenger_count, avg(total_amount) FROM <table> GROUP BY passenger_count; trips_64mb: 10 seconds trips_256mb: 10 seconds trips_1024mb: 11 seconds Again, no real difference in query times.. More details

Leave a Reply