1.1 Billion Taxi Rides on Amazon Athena

CREATE EXTERNAL TABLE trips_parquet ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS parquet LOCATION 's3://taxis-ath/parquet/'; Ive also created both the trips_orc and trips_parquet tables in Athenas query editor as well so once those tables S3 buckets are populated by Hive their data will be available in Athenas interface as well..Populating the Tables The following Hive query will populate the ORC-formatted table..The following completed in 2 hours and 47 minutes..INSERT INTO TABLE trips_orc SELECT * FROM trips_csv; Query ID = hadoop_20161214085649_60eb11e1-cf66-4e44-8d3b-2fff1eab86f1 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1481705329458_0001) ———————————————————————————————- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ———————————————————————————————- Map 1 ………..container SUCCEEDED 56 56 0 0 0 0 ———————————————————————————————- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 9509.19 s ———————————————————————————————- Loading data to table default.trips_orc OK Time taken: 10029.06 seconds The following Hive query will populate the Parquet-formatted table..The following completed in 3 hours and 13 minutes (~15% slower than the ORC table)..INSERT INTO TABLE trips_parquet SELECT * FROM trips_csv; Query ID = hadoop_20161214211841_5e63d67d-7f29-4e54-9790-bbb8a864d97b Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1481749215350_0001) ———————————————————————————————- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ———————————————————————————————- Map 1 ………..container SUCCEEDED 56 56 0 0 0 0 ———————————————————————————————- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 10315.67 s ———————————————————————————————- Loading data to table default.trips_parquet OK Time taken: 11623.85 seconds These tables were populated using stock Hive settings for their formats..There is a lot of room for tuning each formats parameters though Ive found EMRs Hive settings have had a lot of attention already by AWS engineers and are pretty good out of the box..The above queries did run at very different times of the day so I wouldnt consider this a good ORC vs Parquet shoot-out..Also, different tools in the Hadoop ecosystem have different levels of performance optimisation for each of these formats so its important to know where your data will be consumed and what the query patterns will look like before deciding on which format to go with for your dataset..The ORC Benchmark With the ORC data loaded I ran the following queries in Athenas query editor.. More details

Leave a Reply