1.1 Billion Taxi Rides using OmniSciDB and a MacBook Pro

Many believe that for near-instant analytics on billions of records youd need dedicated Linux clusters, several GPUs or proprietary Cloud offerings.

Some of my fastest benchmarks were run on such environments.

But in 2020, an off-the-shelf MacBook Pro using OmniSciDB (formerly MapD) can happily do the job.

In large enterprises, getting sign off for new clusters and bringing in expertise to keep them operational is rarely a quick process.

If the value of a new dataset hasnt yet been proven then the bureaucratic hurdles can end up putting businesses off from using some of the fastest analytical offerings on the market.

To add to this, many industries need to either keep their data isolated from the Cloud, have policies to not spend budget with certain Cloud vendors, see the transferring of data too lengthy for their time-sensitive needs or impractical to shift their datasets reliably with their existing infrastructure and expertise.

Only software that can run on mainstream hardware on-premise or within an enterprises existing data centre is likely to see adoption.

Its rare to find anyone in an office environment that doesnt at least have a laptop but most dont come with top-notch Nvidia GPUs, let alone run CUDA-friendly Linux environments.

This is why I was excited to hear that OmniSciDB would not only target Intel CPUs as a first-class platform but their offering would install and run in a user-friendly manor on macOS, just as any other mac software would.

OmniSciDBs GPU offering still outperforms their CPU offering by some 3.

2x according to benchmarks published on their homepage but good should never be the enemy of perfect.

Ive always aimed to offer my clients the most practical solutions and sometimes that is a laptop proving there is value in their data.

In this post, Im going to see how fast OmniSciDB 5.

3.

1 can run the 1.

1 billion taxi rides benchmark using a 16" MacBook Pro running macOS.

This dataset is made up of 1.

1 billion taxi trips conducted in New York City between 2009 and 2015.

This is the same dataset Ive used to benchmark Amazon Athena, BigQuery, BrytlytDB, ClickHouse, Elasticsearch, EMR, kdb+/q, MapD, PostgreSQL, Redshift and Vertica.

I have a single-page summary of all these benchmarks for comparison.

The 16" MacBook Pro The laptop Ill be using is a 2019, Space Grey, 16" MacBook Pro running macOS 10.

15.

5.

The CPU is an 8-core Intel Core i9 running at 2.

4 GHz.

There is 64 GB of 2666 MHz DDR4 RAM and an Apple AP2048N SSD with 2 TB of capacity connected via PCIe.

AmorphousDiskMark 2.

5.

4 was able to read from this disk at 3,454.

22 MB/s and write at 3,283.

94 MB/s when working with 128 KB blocks sequentially with a queue depth of 32.

The GPU wont be used by OmniSciDB in this benchmark but for the record its an Intel UHD Graphics 630 with 1,536 MB of GPU RAM.

This GPU was a $200 upgrade over the stock GPU Apple ships with this notebook.

Nonetheless, it wont have a material impact on this benchmark.

This machine currently retails for about $4,000 in the US before any sales taxes.

Importing 1.

1 Billion Trips Into OmniSciDB The dataset Ill be using is a data dump Ive produced of 1.

1 billion taxi trips conducted in New York City over a six year period.

The raw dataset lives as 56 GZIP-compressed CSV files that are 104 GB when compressed and need 500 GB of space when decompressed.

The Billion Taxi Rides in Redshift blog post goes into detail regarding how I put this dataset together and describes the columns it contains in further detail.

OmniSciDB has a Java dependency so Ive opted to install OpenJDK 8 LTS using the HotSpot JVM from AdoptOpenJDK.

The default number of file descriptors a process can have open on macOS is 256.

OmniSciDB will likely use more depending on the size of data, number of tables and number of concurrent connections.

This limit can be increased on a per-session basis but instead Ill make the upgrade permanent with the following commands.

$ curl -O https://raw.

githubusercontent.

com/wilsonmar/mac-setup/master/configs/limit.

maxproc.

plist $ curl -O https://raw.

githubusercontent.

com/wilsonmar/mac-setup/master/configs/limit.

maxfiles.

plist $ sudo cp limit.

max{proc,files}.

plist /Library/LaunchDaemons/ $ sudo chmod 644 /Library/LaunchDaemons/limit.

max{proc,files}.

plist $ sudo launchctl load -w /Library/LaunchDaemons/limit.

maxproc.

plist $ sudo launchctl load -w /Library/LaunchDaemons/limit.

maxfiles.

plist The following proves that the file descriptors limit was increased.

$ ulimit -n 524288 I was provided with a build of OmniSciDB 5.

3.

1 for macOS that Ill install via the Terminal in my home directory.

Ive decompressed the distributable and created a symlink so that I can address the parent folder as ~/omnisci.

$ cd ~ $ tar xvf omnisci-ee-master-Darwin-x86_64-cpu.

tar.

gz $ ln -sfn $(ls -dt omnisci-ee* | head -n1) ~/omnisci Ill then create a data folder for OmniSciDB.

$ mkdir -p ~/omnisci-data Ill then initialise the data folder and launch OmniSciDBs Server.

$ cd ~/omnisci $ bin/initdb ~/omnisci-data $ bin/omnisci_server –data ~/omnisci-data The above server process was kept running in a Terminal.

You could also run it as a daemon, in a screen session or via tmux if you wish.

Ill create an alias to the OmniSciDB client and include the default credentials.

This will let me launch it by typing omnisql in the Terminal.

$ alias omnisql="~/omnisci/bin/omnisql -p HyperInteractive" The following will create the table for the taxi trips dataset.

$ omnisql DROP TABLE IF EXISTS trips; CREATE TABLE trips ( trip_id INTEGER, vendor_id TEXT ENCODING DICT(8), pickup_datetime TIMESTAMP ENCODING FIXED(32), dropoff_datetime TIMESTAMP ENCODING FIXED(32), store_and_fwd_flag TEXT ENCODING DICT(8), rate_code_id SMALLINT, pickup_longitude DECIMAL(14,2), pickup_latitude DECIMAL(14,2), dropoff_longitude DECIMAL(14,2), dropoff_latitude DECIMAL(14,2), passenger_count SMALLINT, trip_distance DECIMAL(14,2), fare_amount DECIMAL(14,2), extra DECIMAL(14,2), mta_tax DECIMAL(14,2), tip_amount DECIMAL(14,2), tolls_amount DECIMAL(14,2), ehail_fee DECIMAL(14,2), improvement_surcharge DECIMAL(14,2), total_amount DECIMAL(14,2), payment_type TEXT ENCODING DICT(8), trip_type SMALLINT, pickup TEXT ENCODING DICT(16), dropoff TEXT ENCODING DICT(16), cab_type TEXT ENCODING DICT(8), precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel TEXT ENCODING DICT(16), pickup_borocode SMALLINT, pickup_boroname TEXT ENCODING DICT(8), pickup_ct2010 TEXT ENCODING DICT(16), pickup_boroct2010 TEXT ENCODING DICT(16), pickup_cdeligibil TEXT ENCODING DICT(8), pickup_ntacode TEXT ENCODING DICT(8), pickup_ntaname TEXT ENCODING DICT(8), pickup_puma TEXT ENCODING DICT(8), dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel TEXT ENCODING DICT(16), dropoff_borocode SMALLINT, dropoff_boroname TEXT ENCODING DICT(8), dropoff_ct2010 TEXT ENCODING DICT(16), dropoff_boroct2010 TEXT ENCODING DICT(16), dropoff_cdeligibil TEXT ENCODING DICT(8), dropoff_ntacode TEXT ENCODING DICT(8), dropoff_ntaname TEXT ENCODING DICT(8), dropoff_puma TEXT ENCODING DICT(8) ) WITH (FRAGMENT_SIZE=75000000); OmniSciDB can import GZIP-compressed CSV files without needing them decompressed ahead of time.

It also allows wild stars / globs so the 56 CSV files dont have to be named individually.

This is both a big time and disk space saver.

Below Ill run a SQL command that will import the entire dataset into OmniSciDB.

COPY trips FROM '/Users/mark/taxi_csv/*.

gz' WITH (HEADER='false'); The above managed to complete in 31 minutes and 40 seconds.

The resulting import produced 294 GB of data in OmniSciDBs internal format.

Benchmarking OmniSciDB The times quoted below are the lowest query times seen during a series of runs.

As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".

$ omnisql iming The following completed in 0.

134 seconds.

SELECT cab_type, count(*) FROM trips GROUP BY cab_type; The following completed in 0.

349 seconds.

SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count; The following completed in 0.

542 seconds.

SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, count(*) FROM trips GROUP BY passenger_count, pickup_year; The following completed in 3.

312 seconds.

SELECT passenger_count, extract(year from pickup_datetime) AS pickup_year, cast(trip_distance as int) AS distance, count(*) AS the_count FROM trips GROUP BY passenger_count, pickup_year, distance ORDER BY pickup_year, the_count desc; Final Thoughts The Q1 time is the fastest for any workstation benchmark Ive done.

To get this level of performance on a regular piece of office equipment is a big game changer.

The laptop might seem expensive but its a one-off purchase that can be depreciated over a few years.

There is something magical when a client points their Tableau installation at OmniSciDB and everything they throw at it appears to run instantaneously.

The less friction between questions and answers means more time discovering the value of data.

The more questions you ask of your data the greater the chance of discovering the unexpected.

And for the record, for those without a Tableau license to hand, OmniSciDB has a visualisation package called Immerse which also does an amazing job at near-instant visualisation on billions of rows.

To have OmniSciDB running on a regular MacBook Pro and optimised for the Intel CPUs Apple ships with is going to do amazing things for the world of analytics.

.

Leave a Reply