Beginner’s Guide to Create End-to-End Machine Learning Pipeline in PySpark

Beginner’s Guide to Create End-to-End Machine Learning Pipeline in PySparkUseful Resources, Concepts and Lessons For Data Scientist Building 1st End-to-End Machine Learning Pipeline in SparkSherry WangBlockedUnblockFollowFollowingJun 23Photo by AbsolutVision on UnsplashWhen I realized my training set includes more than 1 millions rows daily, first thing came to my mind was sub-sampling.

However, as I started subsampling, I found it hard to not create any bias during the process.

That’s when I started to wonder if I could build a model without subsampling and thought of Spark.

I’ve used Spark to create features but I’ve never created a end-to-end training with it.

I thought there’s not much modeling choices in Spark, and the spark machine learning package isn’t as powerful and user friendly as sklearn.

To my surprise, I pretty much found everything I needed in Spark easily.

I found the model I wanted to use, lightgbm, in mmlspark , an open source package for spark developed by Microsfot; and I found pretty well-documented pipeline functions from spark MLlib package.

So I decided to give it a shot: to build an end-to-end modeling pipeline in spark.

It wasn’t an easy journey.

Spark isn’t as widely used for machine learning as Python just yet, community support is sometimes limited, useful information is very scattered, and there isn’t a good beginner’s guide to help clarify common confusions.

Thus in this post, I’ll list some resources I found useful, go over some foundational concepts, and share some lessons learned during my journey.

Useful ResourcesSearching for FunctionsGoogle of course is the first choice when it comes to searching for something you need, but I found looking through Spark documentation for functions also very helpful.

It’s important to refer to the the right Spark version though (above link is version 2.

4.

3).

Machine Learning Code ExamplesSpark MLlib documentation already has a lot of code examples, but I found Databrick’s notebook documentation for machine learning even better.

This notebook walks through a classification training pipeline, and this notebook demonstrates parameter tuning and mlflow for tracking.

These notebooks are created to explain how to use various Spark MLlib features in Databricks, but a lot of functionalities showcased in these notebooks are not for Databricks exclusively.

Spark Overall UnderstandingApache Spark wikipedia summarized important Spark modules very nicely.

I didn’t get much out of it when I first read it, but after some learning of spark, I grew to appreciate this page as it provides a very good overall introduction of Apache Spark.

I also liked the “Apache Spark Ecosystem” section on DataBrick’s spark introduction page a lot.

This is very similar to the information on wikipedia page.

Reading both have provided me with a thorough high-level understanding of Spark ecosystem.

Spache Spark EcosystemFoundational ConceptsPySpark vs PythonPython Code and Functions : Python code works with Python objects (list, dictionary, pandas data types, numpy data types etc.

) is executable in PySpark, but they won’t benefit from spark at all (i.

e.

distributed computing).

Python code can’t be applied to Spark objects (RDD, Spark Datasets, Spark Dataframe etc.

) directly though.

If needed, such code can be turned into UDFs (User Defined Functions) to apply to each row of Spark objects (just like map in pandas).

This blog explains UDF very well, and there’s also code example from this Databricks notebook.

PySpark Code and Functions: Pyspark code can only be applied to spark objects.

They won’t work when applying to Python objects.

Python and PySpark Object Conversion: It is possible to convert some (but not all) python objects (e.

g.

pandas dataframe) to spark objects (e.

g.

spark dataframe) and vise versa when it’s small enough to fit in the driver's memory.

spark.

mlllib vs spark.

mlIt was very confusing to me at first that when checking the documentation, you’ll see MLlib being used as the name of machine learning library, but all the code examples import from pyspark.

ml.

In fact both spark.

mllib and spark.

ml are spark’s machine learning libraries: spark.

mllib is the old library that works with RDD while spark.

ml is the new API build around spark dataframe.

According to spark’s announcement, the RDD-based API has entered maintenance mode since Spark 2.

0.

0.

This means there won’t be new features added to pyspark.

mllib, and after reaching feature parity the RDD-based API will be deprecate; pyspark.

mllib is expected to be removed in Spark 3.

0.

In short, use pyspark.

ml and do not use pyspark.

mllib whenever you can.

Lessons LearnedAlgorithm choicesspark’s machine learning library includes a lot of industry widely used algorithms such as generalized linear models, random forest, gradient boosted tree etc.

The full list of supported algorithms can be found here.

There are also open source library mmlspark.

It provides seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK), LightGBM and OpenCV.

Unfortunately another very popular training framework, xgboost, is not supported in PySpark.

Even though there’s XGBoost4J-Spark that integrates xgboost frame on spark, there’s no Python API developed yet.

As mentioned before, technically it’s possible to import the python xgboost or lightgbm module and apply training functions on a pandas dataframe in PySpark, if training data could fit in driver memory.

However this approach wouldn’t benefit from Spark at all (i.

e.

training would be happening on single machine but not distributed across machines just as without Spark,).

No Super Deep TreesOne surprise to me is that ensemble models random forest and gradient boosted trees can’t take values more than 30 for max_depth parameter.

Training time also increases exponentially as max_depth increases.

With my training set, training a 20 depth random forest took 1 hour and a 30 depth one took 8 hours.

In my opinion shallow tree for random forest is a problem because when training data is big, deep individual trees are able to find diverse “rules”, and such diversity should help performance.

Tune Resource AllocationIt’s important to allocate enough memory for executors during training, and it worth spending time tuning num-cores, num-executors and executor-memory.

One of my training runs finished in 10mins with the right resource allocation comparing to 2 hours when I first started tuning for resource allocation.

Ensemble FeaturesMost if not all spark models takes a dataframe of 2 columns, feature and label, as input.

Feature column is a list of all the feature values concatenated.

VecorAssembler is the function to do it, and should always be used as the last step of feature engineering.

There’s an example of using it in modeling pipeline here.

ConclusionComparing to python, there’s a less of community support for pyspark, especially when it comes to machine learning tasks.

I also didn’t find much open source development for pyspark, other than mmlspark.

However Spark is a very powerful tool when it comes to big data: I was able to train a lightgbm model in spark with ~20M rows and ~100 features in 10 minutess.

Of course runtime depends a lot on the model parameters, but it showcases the power of Spark.

.. More details

Leave a Reply