First Impressions of GPUs and PyData

Currently my favorite approach is to use Numpy functions as a lingua franca, and to allow the frameworks to hijack those functions and interpret them as they will.This was proposed and accepted within Numpy itself in NEP-0018 and has been pushed forward by people like Stephan Hoyer, Hameer Abbasi, Marten van Kerkwijk, and Eric Wieser.This is also useful for other array libraries, like pydata/sparse and dask array, and would go a long way towards unifying operations with libraries like XArray.cuML needs features, Scikit-Learn needs datastructure agnosticismWhile deep learning on the GPU is commonplace today, more traditional algorithms like GLMs, random forests, preprocessing and so on haven’t received the same thorough treatment.Fortunately the ecosystem is well prepared to accept work in this space, largely because Scikit Learn established a simple pluggable API early on..Building new estimators in external libraries that connect to the ecosystem well is straightforward.We should be able to build isolated estimators that can be dropped into existing workflows piece by piece, leveraging the existing infrastructure within other Scikit-Learn-compatible projects.# This code is aspirationalfrom sklearn.model_selection import RandomSearchCVfrom sklearn.pipeline import make_pipelinefrom sklearn.feature_extraction.text import TfidfTransformer# from sklearn.feature_extraction.text import HashingVectorizerfrom cuml.feature_extraction.text import HashingVectorizer # swap out for GPU versions# from sklearn.linear_model import LogisticRegression, RandomForestfrom cuml.linear_model import LogisticRegression, RandomForestpipeline = make_pipeline([HashingVectorizer(), # use Scikit-Learn infrastructure TfidfTransformer(), LogisticRegression()])RandomSearchCV(pipeline).fit(data, labels)Note, the example above is aspirational (that cuml code doesn’t exist yet) and probably naive (I don’t know ML well).However, aside from the straightforward task of building these GPU-enabled estimators (which seems to be routine for the CUDA developers at NVIDIA) there are still challenges around cleanly passing non-Numpy arrays around, coercing only when necessary, and so on that we’ll need to work out within Scikit-Learn.Fortunately this work has already started because of Dask Array, which has the same problem..The Dask and Scikit-Learn communities have been collaborating to better enable pluggability over the last year..Hopefully this additional use case proceeds along these existing efforts, but now with more support.Deep learning frameworks are overly specializedThe SciPy/PyData stack thrived because it was modular and adaptable to new situations..There are many small issues around integrating components of the deep learning frameworks into the more general ecosystem.We went through a similar experience with Dask early on, when the Python ecosystem wasn’t ready for parallel computing..As Dask expanded we ran into many small issues around parallel computing that hadn’t been addressed before because, for the most part, few people used Python for parallelism at the time.Various libraries didn’t release the GIL (thanks for the work pandas, scikit-image, and others!)Various libraries weren’t threadsafe in some cases (like h5py, and even Scikit-Learn in one case)Function serialization still needed work (thanks cloudpickle developers!)Compression libraries were unmaintained (like LZ4)Networking libraries weren’t used to high bandwidth workloads (thanks Tornado devs!)These issues were fixed by a combination of Dask developers and the broader community (it’s amazing what people will do if you provide a well-scoped and well-described problem on GitHub)..These libraries were designed to be used with other libraries, and so they were well incentivized to improve their usability by the broader ecosystem.Today deep learning frameworks have these same problems..They rarely serialize well, aren’t threadsafe when called by external threads, and so on..This is to be expected, most people using a tool like TensorFlow or PyTorch are operating almost entirely within those frameworks..These projects aren’t being stressed against the rest of the ecossytem (no one puts PyTorch arrays as columns in pandas, or pickles them to send across a wire)..Taking tools that were designed for narrow workflows and encouraging them towards general purpose collaboration takes time and effort, both technically and socially.The non-deep-learning OSS community has not yet made a strong effort to engage the deep-learning developer communities..This should be an interesting social experiment between two different of dev cultures.. More details

Leave a Reply