Here’s how you can accelerate your Data Science on GPU

Here’s how you can accelerate your Data Science on GPUGeorge SeifBlockedUnblockFollowFollowingJul 3Data Scientists need computing power.

Whether you’re processing a big dataset with Pandas or running some computation on a massive matrix with Numpy, you’ll need a powerful machine to get the job done in a reasonable amount of time.

Over the past several years, Python libraries commonly used by Data Scientists have gotten pretty good at leveraging CPU power.

Pandas, with its underlying base code written in C, does a fine job of being able to handle datasets that go over even 100GB in size.

And if you don’t have enough RAM to fit such a dataset, you can always use the convenient chunking functions that can process the data one piece at a time.

GPUs vs CPUs: Parallel ProcessingWith massive data, a CPU just isn’t going to cut it.

A dataset that goes over 100GB in size is going to have many many data points, within the millions or even billions ballpark range.

With that many points to process, it doesn’t matter how fast your CPU is, it simply doesn’t have enough cores to do efficient parallel processing.

If your CPU has 20 cores (which would be fairly expensive CPU), you can only process 20 data points at a time!CPUs are going to be better in tasks where clock-speed is more important — or you simply don’t have a GPU implementation.

If there is a GPU implementation for the process you are trying to perform, then a GPU will be far more effective if that task can benefit from parallel processing.

How a Multi-core system can process data faster.

For a single core system (left), all 10 tasks go to a single node.

For the dual-core system (right), each node takes on 5 tasks, thereby doubling the processing speedDeep Learning has already seen its fair share of leveraging GPUs.

Many of the convolution operations done in Deep Learning are repetitive and as such can be greatly accelerated on GPUs, even up to 100s of times.

Data Science today is no different as many repetitive operations are performed on large datasets with libraries like Pandas, Numpy, and Scikit-Learn.

These operations aren’t too complex to implement on the GPU either.

Finally, there’s a solution.

GPU Acceleration with RapidsRapids is a suite of software libraries designed for accelerating Data Science by leveraging GPUs.

It uses low-level CUDA code for fast, GPU-optimized implementations of algorithms while still having an easy to use Python layer on top.

The beauty of Rapids is that it’s integrated smoothly with Data Science libraries — things like Pandas dataframes are easily passed through to Rapids for GPU acceleration.

The diagram below illustrates how Rapids achieves low-level acceleration while maintaining an easy to use top-layer.

Rapids leverages several Python libraries:cuDF —Python GPU DataFrames.

It can do almost everything Pandas can in terms of data handling and manipulation.

cuML — Python GPU Machine Learning.

It contains many of the ML algorithms that Scikit-Learn has, all in a very similar format.

cuGraph — Python GPU graph processing.

It contains many common graph analytics algorithms including PageRank and various similarity metrics.

A Tutorial for how to use RapidsInstallationNow you’ll see how to use Rapids!To install it, head on over to the website where you’ll see how to install Rapids.

You can install it directly on your machine through Conda or simply pull the Docker container.

When installing, you can set your system specs such as CUDA version and which libraries you would like to install.

For example, I have CUDA 10.

0 and wanted to install all the libraries, so my install command was:conda install -c nvidia -c rapidsai -c numba -c conda-forge -c pytorch -c defaults cudf=0.

8 cuml=0.

8 cugraph=0.

8 python=3.

6 cudatoolkit=10.

0Once that command finishing running, you’re ready to start doing GPU-accelerated Data Science.

Setting up our dataFor this tutorial, we’re going to go through a modified version of the DBSCAN demo.

I’ll be using the Nvidia Data Science Work Station to run the testing which came with 2 GPUs.

DBSCAN is a density-based clustering algorithm that can automatically classify groups of data, without the user having to specify how many groups there are.

There’s an implementation of it in Scikit-Learn.

We’ll start by getting all of our imports setup.

Libraries for loading data, visualising data, and applying ML models.

import osimport matplotlib.

pyplot as pltfrom matplotlib.

colors import ListedColormapfrom sklearn.

datasets import make_circlesThe make_circles functions will automatically create a complex distribution of data resembling two circles that we’ll apply DBSCAN on.

Let’s start by creating our dataset of 100,000 points and visualising it in a plot:X, y = make_circles(n_samples=int(1e5), factor=.

35, noise=.

05)X[:, 0] = 3*X[:, 0]X[:, 1] = 3*X[:, 1]plt.

scatter(X[:, 0], X[:, 1])plt.

show()DBSCAN on CPURunning DBSCAN on CPU is easy with Scikit-Learn.

We’ll import our algorithm and setup some parameters.

from sklearn.

cluster import DBSCANdb = DBSCAN(eps=0.

6, min_samples=2)We can now apply DBSCAN on our circle data with a single function call from Scikit-Learn.

Putting a %%time before our function tells Jupyter Notebook to measure its run time.

%%timey_db = db.

fit_predict(X)For those 100, 000 points, the run time was 8.

31 seconds.

The resulting plot is shown below.

Result of running DBSCAN on the CPU using Scikit-LearnDBSCAN with Rapids on GPUNow let’s make things faster with Rapids!First, we’ll convert our data to a pandas.

DataFrame and use that to create a cudf.

DataFrame.

Pandas dataframes are converted seamlessly to cuDF dataframes without any change in the data format.

import pandas as pdimport cudfX_df = pd.

DataFrame({'fea%d'%i: X[:, i] for i in range(X.

shape[1])})X_gpu = cudf.

DataFrame.

from_pandas(X_df)We’ll then import and initialise a special version of DBSCAN from cuML, one that is GPU accelerated.

The function format of the cuML version of DBSCAN is the exact same as that of Scikit-Learn — same parameters, same style, same functions.

from cuml import DBSCAN as cumlDBSCANdb_gpu = cumlDBSCAN(eps=0.

6, min_samples=2)Finally, we can run our prediction function for the GPU DBSCAN while measuring the run time.

%%timey_db_gpu = db_gpu.

fit_predict(X_gpu)The GPU version has a run time of 4.

22 seconds — almost a 2X speedup.

The resulting plot is the exact same as the CPU version too, since we are using the same algorithm.

Result of running DBSCAN on the GPU using cuMLGetting super speed with Rapids GPUThe amount of speedup we get from Rapids depends on how much data we are processing.

A good rule of thumb is that larger datasets will benefit from GPU acceleration.

There is some overhead time associated with transferring data between the CPU and GPU — that overhead time becomes more “worth it” with larger datasets.

We can illustrate this with a simple example.

We’re going to create a Numpy array of random numbers and apply DBSCAN on it.

We’ll compare the speed of our regular CPU DBSCAN and the GPU version from cuML, while increasing and decreasing the number of data points to see how it effects our run time.

The code below illustrates this test:import numpy as npn_rows, n_cols = 10000, 100X = np.

random.

rand(n_rows, n_cols)print(X.

shape)X_df = pd.

DataFrame({'fea%d'%i: X[:, i] for i in range(X.

shape[1])})X_gpu = cudf.

DataFrame.

from_pandas(X_df)db = DBSCAN(eps=3, min_samples=2)db_gpu = cumlDBSCAN(eps=3, min_samples=2)%%timey_db = db.

fit_predict(X)%%timey_db_gpu = db_gpu.

fit_predict(X_gpu)Check out the plot of the results from Matplotlib down below:The amount of rises quite drastically when using the GPU instead of CPU.

Even at 10,000 points (far left) we still get a speedup of 4.

54X.

On the higher end of things, with 10,000,000 points we get a speedup of 88.

04X when switching to GPU!Like to learn?Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science!.Connect with me on LinkedIn too!Recommended ReadingWant to learn more about Data Science?.The Python Data Science Handbook book is the best resource out there for learning how to do real Data Science with Python!And just a heads up, I support this blog with Amazon affiliate links to great books, because sharing great books helps everyone!.As an Amazon Associate I earn from qualifying purchases.

.

. More details

Leave a Reply