1.1 Billion Taxi Rides: EC2 versus EMR

SELECT passenger_count, year(pickup_datetime), count(*) FROM trips_orc GROUP BY passenger_count, year(pickup_datetime); The following completed in 65 seconds..SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM trips_orc GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips desc; While running these queries I could see the CPU was becoming something of a bottleneck..top – 08:54:14 up 1:50, 4 users, load average: 0.20, 1.64, 6.43 Tasks: 360 total, 2 running, 358 sleeping, 0 stopped, 0 zombie %Cpu0 : 94.7 us, 1.7 sy, 0.0 ni, 3.3 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu1 : 95.0 us, 1.7 sy, 0.0 ni, 3.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 98.3 us, 0.3 sy, 0.0 ni, 1.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 87.3 us, 4.3 sy, 0.0 ni, 8.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 95.0 us, 1.3 sy, 0.0 ni, 3.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 98.3 us, 0.0 sy, 0.0 ni, 1.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 96.7 us, 1.3 sy, 0.0 ni, 2.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 92.7 us, 1.0 sy, 0.0 ni, 5.6 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu8 : 93.7 us, 1.3 sy, 0.0 ni, 5.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 92.3 us, 0.7 sy, 0.0 ni, 7.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 97.3 us, 0.7 sy, 0.0 ni, 2.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 97.3 us, 0.7 sy, 0.0 ni, 2.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 92.0 us, 3.0 sy, 0.0 ni, 5.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 94.9 us, 1.0 sy, 0.0 ni, 4.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 88.3 us, 3.0 sy, 0.0 ni, 8.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 92.6 us, 2.3 sy, 0.0 ni, 4.7 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu16 : 94.7 us, 2.3 sy, 0.0 ni, 2.6 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu17 : 93.0 us, 0.7 sy, 0.0 ni, 6.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu18 : 93.0 us, 3.7 sy, 0.0 ni, 3.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu19 : 91.2 us, 0.7 sy, 0.0 ni, 8.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st At first I was wondering if the NVMe driver was causing too much overhead but disk activity was nil when I captured the above output from top..Without putting together a flame graph I suspect this is just the CPU pulling data from memory and aggregating it..The memory usage on the system supports this..top – 08:56:57 up 1:53, 4 users, load average: 21.15, 9.72, 8.75 Tasks: 360 total, 1 running, 359 sleeping, 0 stopped, 0 zombie %Cpu(s): 92.9 us, 1.5 sy, 0.0 ni, 5.5 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 25175468+total, 40119300 free, 68494352 used, 14314105+buff/cache KiB Swap: 0 total, 0 free, 0 used..18175958+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 98118 root 20 0 0.138t 0.047t 33032 S 2976 20.2 176:49.70 /usr/bin/java -+ 19797 root 20 0 32.913g 390216 25668 S 34.9 0.2 9:42.22 /usr/bin/java -+ Process 98118 is the Spark Master and 19797 is the HDFS Datanode..When this benchmark was finished I shut down Sparks services in order to give Presto as many resources as possible..$ sudo /opt/spark/sbin/stop-master.sh $ sudo /opt/spark/sbin/stop-slaves.sh Presto Benchmark Results Below are the fastest times I saw running the following queries using Presto..$ presto –server localhost:8080 –catalog hive –schema default The following completed in 11 seconds..SELECT cab_type, count(*) FROM trips_orc GROUP BY cab_type; The following completed in 14 seconds..SELECT passenger_count, avg(total_amount) FROM trips_orc GROUP BY passenger_count; The following completed in 16 seconds.. More details

Leave a Reply