1.1 Billion Taxi Rides with MapD & 8 Nvidia Pascal Titan Xs

$ vi create_trips_table.sql CREATE TABLE trips ( trip_id INTEGER, vendor_id VARCHAR(3) ENCODING DICT, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag VARCHAR(1) ENCODING DICT, rate_code_id SMALLINT, pickup_longitude DECIMAL(14,2), pickup_latitude DECIMAL(14,2), dropoff_longitude DECIMAL(14,2), dropoff_latitude DECIMAL(14,2), passenger_count SMALLINT, trip_distance DECIMAL(14,2), fare_amount DECIMAL(14,2), extra DECIMAL(14,2), mta_tax DECIMAL(14,2), tip_amount DECIMAL(14,2), tolls_amount DECIMAL(14,2), ehail_fee DECIMAL(14,2), improvement_surcharge DECIMAL(14,2), total_amount DECIMAL(14,2), payment_type VARCHAR(3) ENCODING DICT, trip_type SMALLINT, pickup VARCHAR(50) ENCODING DICT, dropoff VARCHAR(50) ENCODING DICT, cab_type VARCHAR(6) ENCODING DICT, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel VARCHAR(10) ENCODING DICT, pickup_borocode SMALLINT, pickup_boroname VARCHAR(13) ENCODING DICT, pickup_ct2010 VARCHAR(6) ENCODING DICT, pickup_boroct2010 VARCHAR(7) ENCODING DICT, pickup_cdeligibil VARCHAR(1) ENCODING DICT, pickup_ntacode VARCHAR(4) ENCODING DICT, pickup_ntaname VARCHAR(56) ENCODING DICT, pickup_puma VARCHAR(4) ENCODING DICT, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel VARCHAR(10) ENCODING DICT, dropoff_borocode SMALLINT, dropoff_boroname VARCHAR(13) ENCODING DICT, dropoff_ct2010 VARCHAR(6) ENCODING DICT, dropoff_boroct2010 VARCHAR(7) ENCODING DICT, dropoff_cdeligibil VARCHAR(1) ENCODING DICT, dropoff_ntacode VARCHAR(4) ENCODING DICT, dropoff_ntaname VARCHAR(56) ENCODING DICT, dropoff_puma VARCHAR(4) ENCODING DICT ) WITH (FRAGMENT_SIZE=85000000); Ill create two environment variables with my credentials for MapD..$ read MAPD_USERNAME $ read MAPD_PASSWORD $ export MAPD_USERNAME $ export MAPD_PASSWORD The following will create the table schema using the mapdql cli tool..$ mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD < create_trips_table.sql Ill then check that the table has been created: $ echo " " | mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD User mapd connected to database mapd trips User mapd disconnected from database mapd MapD doesnt support loading CSV data from GZIP files at this time so Ill decompress the CSV files before loading them..$ find *.gz | xargs -n 1 -P 16 gunzip With the table and files in place Ill load the 500 GB of CSV data into MapD..$ for filename in *.csv; do echo "COPY trips FROM '/theHoard/trips/$filename' WITH (header='false');" | mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD done The above completed in 43 minutes..Benchmarking MapD The times quoted below are the lowest query times seen during a series of runs..As with all my benchmarks, I use lowest query time as a way of indicating "top speed"..$ mapdql mapd -u $MAPD_USERNAME -p $MAPD_PASSWORD iming on The following completed in 0.021 seconds..SELECT cab_type, count(*) FROM trips GROUP BY cab_type; The following completed in 0.053 seconds..SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count; The following completed in 0.165 seconds..SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, count(*) FROM trips GROUP BY passenger_count, pickup_year; The following completed in 0.51 seconds..SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, cast(trip_distance as int) AS distance, count(*) AS the_count FROM trips GROUP BY passenger_count, pickup_year, distance ORDER BY pickup_year, the_count desc; Its incredible to see how much of an improvement Nvidias Pascal-based Titan X offers.. More details

Leave a Reply