1.1 Billion Taxi Rides with BrytlytDB 2.1 & a 5-node IBM Minsky Cluster

$ sudo docker exec -ti cluster /usr/local/brytlyt/bin/psql brytlyt brytlyt CREATE NODE d04 WITH (HOST='x.x.x.213', TYPE='datanode', PORT=15432); CREATE NODE d05 WITH (HOST='x.x.x.213', TYPE='datanode', PORT=15433); CREATE NODE d06 WITH (HOST='x.x.x.213', TYPE='datanode', PORT=15434); CREATE NODE d07 WITH (HOST='x.x.x.213', TYPE='datanode', PORT=15435); CREATE NODE d08 WITH (HOST='x.x.x.103', TYPE='datanode', PORT=15432); CREATE NODE d09 WITH (HOST='x.x.x.103', TYPE='datanode', PORT=15433); CREATE NODE d10 WITH (HOST='x.x.x.103', TYPE='datanode', PORT=15434); CREATE NODE d11 WITH (HOST='x.x.x.103', TYPE='datanode', PORT=15435); CREATE NODE d12 WITH (HOST='x.x.x.104', TYPE='datanode', PORT=15432); CREATE NODE d13 WITH (HOST='x.x.x.104', TYPE='datanode', PORT=15433); CREATE NODE d14 WITH (HOST='x.x.x.104', TYPE='datanode', PORT=15434); CREATE NODE d15 WITH (HOST='x.x.x.104', TYPE='datanode', PORT=15435); CREATE NODE d16 WITH (HOST='x.x.x.105', TYPE='datanode', PORT=15432); CREATE NODE d17 WITH (HOST='x.x.x.105', TYPE='datanode', PORT=15433); CREATE NODE d18 WITH (HOST='x.x.x.105', TYPE='datanode', PORT=15434); CREATE NODE d19 WITH (HOST='x.x.x.105', TYPE='datanode', PORT=15435); SELECT pgxc_pool_reload(); With those in place there is now a data node for each of the 20 GPUs in the cluster and theyve been federated on the coordinator..Loading 1.1 Billion Trips into BrytlytDB Ill first define the table schema for the 1.1 billion taxi trip records..$ sudo docker exec -ti cluster /usr/local/brytlyt/bin/psql brytlyt brytlyt CREATE FOREIGN TABLE trips ( trip_id INTEGER, vendor_id VARCHAR(3), pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag VARCHAR(1), rate_code_id SMALLINT, pickup_longitude DOUBLE PRECISION, pickup_latitude DOUBLE PRECISION, dropoff_longitude DOUBLE PRECISION, dropoff_latitude DOUBLE PRECISION, passenger_count SMALLINT, trip_distance DOUBLE PRECISION, fare_amount DOUBLE PRECISION, extra DOUBLE PRECISION, mta_tax DOUBLE PRECISION, tip_amount DOUBLE PRECISION, tolls_amount DOUBLE PRECISION, ehail_fee DOUBLE PRECISION, improvement_surcharge DOUBLE PRECISION, total_amount DOUBLE PRECISION, payment_type VARCHAR(10), trip_type SMALLINT, pickup VARCHAR(50), dropoff VARCHAR(50), cab_type VARCHAR(10), precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel VARCHAR(10), pickup_borocode SMALLINT, pickup_boroname VARCHAR(13), pickup_ct2010 VARCHAR(6), pickup_boroct2010 VARCHAR(7), pickup_cdeligibil VARCHAR(1), pickup_ntacode VARCHAR(4), pickup_ntaname VARCHAR(56), pickup_puma VARCHAR(4), dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel VARCHAR(10), dropoff_borocode SMALLINT, dropoff_boroname VARCHAR(13), dropoff_ct2010 VARCHAR(6), dropoff_boroct2010 VARCHAR(7), dropoff_cdeligibil VARCHAR(1), dropoff_ntacode VARCHAR(4), dropoff_ntaname VARCHAR(56), dropoff_puma VARCHAR(4) ) SERVER gm_fdw_server OPTIONS ( max_size '55682651', index '10:24') DISTRIBUTE BY ROUNDROBIN; In the above table declaration there are indices for both the passenger count and cab type columns..The gm_fdw_server identifier tells PostgreSQL to connect to BrytlytDBs GPU foreign data wrapper..The 55,682,651 max_size value is the maximum number of rows Im intending to import onto each data node..There are 20 data nodes so this should cover the 1.1 billion total records size of the dataset..Im using the Round Robin data distribution policy here..This table wont be joined with any other data, if it were HASH and/or REPLICATE would be better options..Note that since my last BrytlytDB benchmark version 2.1 of the software has been released and there is now support for both TIMESTAMP and SMALLINT fields..For this benchmark Ive downloaded and decompressed the 500 GB of CSV data I created in my Billion Taxi Rides in Redshift blog post..Ill dedicate 1/5th of the dataset for each of the five servers..On each server its dataset is broken up again into four parts so that each CSV file targets one data node.. More details

Leave a Reply