1.1 Billion Taxi Rides on Vertica & an Intel Core i5

[COMMAND] execute command in shell or start interactive shell password [USER] change user's password Query Buffer e [FILE] edit the query buffer (or file) with external editor g send query buffer to server g FILE send query buffer to server and results to file g | COMMAND send query buffer to server and pipe results to command p show the contents of the query buffer
reset (clear) the query buffer s [FILE] display history or save it to file w FILE write query buffer to file Input/Output echo [STRING] write string to standard output i FILE execute commands from file o FILE send all query results to file o | COMMAND pipe all query results to command o close query-results file or pipe qecho [STRING] write string to query output stream (see o) Informational d [PATTERN] describe tables (list tables if no argument is supplied) PATTERN may include system schema name, e.g..v_catalog.* df [PATTERN] list functions dj [PATTERN] list projections dn [PATTERN] list schemas dp [PATTERN] list table access privileges ds [PATTERN] list sequences dS [PATTERN] list system tables..PATTERN may include system schema name such as v_catalog, v_monitor, or v_internal..Example: v_catalog.a* dt [PATTERN] list tables dtv [PATTERN] list tables and views dT [PATTERN] list data types du [PATTERN] list users dv [PATTERN] list views l list all databases z [PATTERN] list table access privileges (same as dp) Formatting a toggle between unaligned and aligned output mode toggle beep on command completion C [STRING] set table title, or unset if none f [STRING] show or set field separator for unaligned query output H toggle HTML output mode (currently off) pset NAME [VALUE] set table output option (NAME := {format|border|expanded|fieldsep|footer|null| recordsep|trailingrecordsep|tuples_only|title|tableattr|pager}) show only rows (currently off) T [STRING] set HTML <table> tag attributes, or unset if none x toggle expanded output (currently off) Ill create a table that will store the 1.1 billion taxi trips dataset..CREATE TABLE trips ( trip_id INTEGER, vendor_id VARCHAR(3), pickup_datetime DATETIME, dropoff_datetime DATETIME, store_and_fwd_flag VARCHAR(1), rate_code_id SMALLINT, pickup_longitude DECIMAL(18,14), pickup_latitude DECIMAL(18,14), dropoff_longitude DECIMAL(18,14), dropoff_latitude DECIMAL(18,14), passenger_count SMALLINT, trip_distance DECIMAL(6,3), fare_amount DECIMAL(6,2), extra DECIMAL(6,2), mta_tax DECIMAL(6,2), tip_amount DECIMAL(6,2), tolls_amount DECIMAL(6,2), ehail_fee DECIMAL(6,2), improvement_surcharge DECIMAL(6,2), total_amount DECIMAL(6,2), payment_type VARCHAR(3), trip_type SMALLINT, pickup VARCHAR(50), dropoff VARCHAR(50), cab_type VARCHAR(6), precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel VARCHAR(10), pickup_borocode SMALLINT, pickup_boroname VARCHAR(13), pickup_ct2010 VARCHAR(6), pickup_boroct2010 VARCHAR(7), pickup_cdeligibil VARCHAR(1), pickup_ntacode VARCHAR(4), pickup_ntaname VARCHAR(56), pickup_puma VARCHAR(4), dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel VARCHAR(10), dropoff_borocode SMALLINT, dropoff_boroname VARCHAR(13), dropoff_ct2010 VARCHAR(6), dropoff_boroct2010 VARCHAR(7), dropoff_cdeligibil VARCHAR(1), dropoff_ntacode VARCHAR(4), dropoff_ntaname VARCHAR(56), dropoff_puma VARCHAR(4) ) ORDER BY pickup_datetime, dropoff_datetime; Ill then exit to the command line and execute the following to load the dataset in..The /home/mark/trips/ folder on my system has had it and its contents set to be owned by dbadmin..There are 56 gzip-compressed CSV files that make up the 1.1-billion-record dataset..$ time (echo "COPY trips FROM '/home/mark/trips/trips_x*.csv.gz' GZIP DELIMITER ',' DIRECT;" | /opt/vertica/bin/vsql -U dbadmin -w $VERTICA_PASS) The above took 3 hours 56 minutes and 43 seconds to complete..The dataset uses 153 GB of disk capacity when stored using Verticas internal storage format..$ du -hs /home/dbadmin/trips/v_trips_node0001_data/ 153G /home/dbadmin/trips/v_trips_node0001_data/ Benchmarking Vertica Ill execute each query using the vsql command line tool..$ /opt/vertica/bin/vsql -U dbadmin -w $VERTICA_PASS To time the queries Ill switch on Verticas timing mechanism using the iming command..iming The times quoted below are the lowest query times seen during a series of runs.. More details

Leave a Reply