A Billion Rows A Second

For most computers, it’s an impossibility.

So we want to convert our files into a format that Vaex loves (HDF5) rather than having Vaex convert via Pandas.

The typical case for large CSV files is to have them broken up into part-files, so I wanted to write a bash script that could handle both cases.

For our example file, we’re going to use a Kaggle dataset on air quality (~2.

5GB unzipped), and we’ll remove most of the string columns.

That leaves us with a ~1GB file.

Small as far as data goes, but good for this demo.

To start, you need to download a handy tool for file conversion called Topcat.

If you don’t have experience using .

jar files, you can read about them here.

Copy the jar into the directory of the CSV file (you could also just have a central directory of jar files but I’ll leave that to the reader).

Now we use a custom bash script to convert the CSV efficiently to HDF5And we are ready to test out Vaex.

Since Vaex doesn’t actually load the dataset into memory, it is almost instantaneous.

Let’s run a relatively simple calculation and compare.

Our dataset contains Longitudes and Latitudes for each data point (there are a total of ~ 9 Million rows.

Let’s say we want to find all points contained within New York City.

A rough box might give one the following:nyc_lat = [40.

348424, 40.

913145]nyc_long = [-74.

432885, -73.

696834]And here is the comparison of Vaex vs Pandas.

For this simple operation, Vaex is almost 25 times fast.

Keep in mind that the larger your, the more mathematical calculations you need to do, the more benefit you are going to see from Vaex.

Additionally, complex mathematical operations can be significantly boosted using a beautiful Python library called Numba (more on that on a later post).

Make Use of Data Science ToolsI always find that big-data creates its own unique set of challenges.

Datasets that number in the Terabytes, mathematical ops numbering in the trillions, and the complexities that both bring make life significantly more complicated.

Libraries like Vaex can allow data scientists without the deep computing expertise that a big-data engineer has and still able to power through incredibly large datasets in a very efficient way.

If you’re interested in learning more about handling big-data in a small way (i.

e.

by yourself), then follow me on Medium, or get in touch at https://jessemoore.

ca to discuss a potential project.

Till next time…JesseOriginally published at jessemoore.

ca.

.. More details

Leave a Reply