1.1 Billion Taxi Rides with MapD & AWS EC2

$ vi create_trips_table.sql CREATE TABLE trips ( trip_id INTEGER, vendor_id VARCHAR(3) ENCODING DICT, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag VARCHAR(1) ENCODING DICT, rate_code_id SMALLINT, pickup_longitude DECIMAL(14,2), pickup_latitude DECIMAL(14,2), dropoff_longitude DECIMAL(14,2), dropoff_latitude DECIMAL(14,2), passenger_count SMALLINT, trip_distance DECIMAL(14,2), fare_amount DECIMAL(14,2), extra DECIMAL(14,2), mta_tax DECIMAL(14,2), tip_amount DECIMAL(14,2), tolls_amount DECIMAL(14,2), ehail_fee DECIMAL(14,2), improvement_surcharge DECIMAL(14,2), total_amount DECIMAL(14,2), payment_type VARCHAR(3) ENCODING DICT, trip_type SMALLINT, pickup VARCHAR(50) ENCODING DICT, dropoff VARCHAR(50) ENCODING DICT, cab_type VARCHAR(6) ENCODING DICT, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel VARCHAR(10) ENCODING DICT, pickup_borocode SMALLINT, pickup_boroname VARCHAR(13) ENCODING DICT, pickup_ct2010 VARCHAR(6) ENCODING DICT, pickup_boroct2010 VARCHAR(7) ENCODING DICT, pickup_cdeligibil VARCHAR(1) ENCODING DICT, pickup_ntacode VARCHAR(4) ENCODING DICT, pickup_ntaname VARCHAR(56) ENCODING DICT, pickup_puma VARCHAR(4) ENCODING DICT, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel VARCHAR(10) ENCODING DICT, dropoff_borocode SMALLINT, dropoff_boroname VARCHAR(13) ENCODING DICT, dropoff_ct2010 VARCHAR(6) ENCODING DICT, dropoff_boroct2010 VARCHAR(7) ENCODING DICT, dropoff_cdeligibil VARCHAR(1) ENCODING DICT, dropoff_ntacode VARCHAR(4) ENCODING DICT, dropoff_ntaname VARCHAR(56) ENCODING DICT, dropoff_puma VARCHAR(4) ENCODING DICT ) WITH (FRAGMENT_SIZE=75000000); Ill create two environment variables with my credentials for MapD..$ read MAPD_USERNAME $ read MAPD_PASSWORD $ export MAPD_USERNAME $ export MAPD_PASSWORD The following will create the table schema using the mapdql CLI tool..$ mapd/bin/mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD < create_trips_table.sql With the table and files in place Ill load the 500 GB of CSV data into MapD..The following completed in 60 minutes and 42 seconds..$ for filename in /data/benchdata/*.csv; do echo "COPY trips FROM '$filename' WITH (header='false');" | mapd/bin/mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD done Heres a snapshot of top during the import..top – 07:06:57 up 2:13, 2 users, load average: 10.75, 10.51, 10.25 Tasks: 479 total, 1 running, 478 sleeping, 0 stopped, 0 zombie %Cpu0 : 74.3 us, 5.3 sy, 0.0 ni, 20.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 75.1 us, 1.7 sy, 0.0 ni, 23.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 74.0 us, 2.3 sy, 0.0 ni, 23.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 72.2 us, 3.6 sy, 0.0 ni, 24.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 75.3 us, 2.0 sy, 0.0 ni, 22.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 74.8 us, 2.3 sy, 0.0 ni, 22.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 72.1 us, 3.3 sy, 0.0 ni, 24.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 73.1 us, 9.3 sy, 0.0 ni, 17.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 74.2 us, 2.0 sy, 0.0 ni, 23.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 73.7 us, 2.7 sy, 0.0 ni, 23.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 74.9 us, 2.3 sy, 0.0 ni, 22.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 74.1 us, 3.0 sy, 0.0 ni, 22.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu12 : 76.5 us, 1.3 sy, 0.0 ni, 22.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu13 : 75.2 us, 2.3 sy, 0.0 ni, 22.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu14 : 72.8 us, 3.3 sy, 0.0 ni, 23.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu15 : 74.1 us, 4.3 sy, 0.0 ni, 21.3 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 12590446+total, 12531040+used, 594064 free, 897984 buffers KiB Swap: 0 total, 0 used, 0 free..11976936+cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19915 ubuntu 20 0 4611148 1.278g 20028 S 1231 1.1 538:04.36 mapd/bin/mapd_server –data /data/prod/mapd-storage/data 19910 ubuntu 20 0 63008 4520 3340 S 0.0 0.0 0:00.00 bfboost client mapd/bin/mapd_server –data /data/prod/mapd-storage/+ ….Once the import is complete 894 GB of capacity is now being used on the RAID array..$ df -H Filesystem Size Used Avail Use% Mounted on udev 65G 13k 65G 1% /dev tmpfs 13G 844k 13G 1% /run /dev/xvda1 529G 6.4G 501G 2% / none 4.1k 0 4.1k 0% /sys/fs/cgroup none 5.3M 0 5.3M 0% /run/lock none 65G 13k 65G 1% /run/shm none 105M 0 105M 0% /run/user /dev/md0 1.6T 894G 602G 60% /data Benchmarking MapD The times quoted below are the lowest query times seen during a series of runs..$ mapd/bin/mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD iming on The following completed in 0.028 seconds..SELECT cab_type, count(*) FROM trips GROUP BY cab_type; The following completed in 0.2 seconds..SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count; The following completed in 0.237 seconds.. More details

Leave a Reply