Want to Build Machine Learning Pipelines? A Quick Introduction using PySpark

Here’s the caveat – Spark’s OneHotEncoder does not directly encode the categorical variable.

First, we need to use the String Indexer to convert the variable into numerical form and then use OneHotEncoderEstimator to encode multiple columns of the dataset.

It creates a Sparse Vector for each row: View the code on Gist.

  Vector Assembler A vector assembler combines a given list of columns into a single vector column.

This is typically used at the end of the data exploration and pre-processing steps.

At this stage, we usually work with a few raw or transformed features that can be used to train our model.

The Vector Assembler converts them into a single feature column in order to train the machine learning model (such as Logistic Regression).

It accepts numeric, boolean and vector type columns: View the code on Gist.

  Building Machine Learning Pipelines using PySpark A machine learning project typically involves steps like data preprocessing, feature extraction, model fitting and evaluating results.

We need to perform a lot of transformations on the data in sequence.

As you can imagine, keeping track of them can potentially become a tedious task.

This is where machine learning pipelines come in.

A pipeline allows us to maintain the data flow of all the relevant transformations that are required to reach the end result.

We need to define the stages of the pipeline which act as a chain of command for Spark to run.

Here, each stage is either a Transformer or an Estimator.

  Transformers and Estimators As the name suggests, Transformers convert one dataframe into another either by updating the current values of a particular column (like converting categorical columns to numeric) or mapping it to some other values by using a defined logic.

An Estimator implements the fit() method on a dataframe and produces a model.

For example, LogisticRegression is an Estimator that trains a classification model when we call the fit() method.

Let’s understand this with the help of some examples.

  Examples of Pipelines Let’s create a sample dataframe with three columns as shown below.

Here, we will define some of the stages in which we want to transform the data and see how to set up the pipeline: View the code on Gist.

We have created the dataframe.

Suppose we have to transform the data in the below order: stage_1: Label Encode or String Index the column category_1 stage_2: Label Encode or String Index the column category_2 stage_3: One-Hot Encode the indexed column category_2 At each stage, we will pass the input and output column name and setup the pipeline by passing the defined stages in the list of the Pipeline object.

The pipeline model then performs certain steps one by one in a sequence and gives us the end result.

Let’s see how to implement the pipeline: View the code on Gist.

Now, let’s take a more complex example of setting up a pipeline.

Here, we will do transformations on the data and build a logistic regression model.

For this, we will create a sample dataframe which will be our training dataset with four features and the target label: View the code on Gist.

Now, suppose this is the order of our pipeline: stage_1: Label Encode or String Index the column feature_2 stage_2: Label Encode or String Index the column feature_3 stage_3: One Hot Encode the indexed column of feature_2 and feature_3 stage_4: Create a vector of all the features required to train a Logistic Regression model stage_5: Build a Logistic Regression model We have to define the stages by providing the input column name and output column name.

The final stage would be to build a logistic regression model.

And in the end, when we run the pipeline on the training dataset, it will run the steps in a sequence and add new columns to the dataframe (like rawPrediction, probability, and prediction).

View the code on Gist.

Congrats!.We have successfully set up the pipeline.

Let’s create a sample test dataset without the labels and this time, we do not need to define all the steps again.

We will just pass the data through the pipeline and we are done!.View the code on Gist.

Perfect!.  End Notes This was a short but intuitive article on how to build machine learning pipelines using PySpark.

I’ll reiterate it again because it’s that important – you need to know how these pipelines work.

This is a big part of your role as a data scientist.

Have you worked on an end-to-end machine learning project before?.Or been a part of a team that built these pipelines in an industry setting?.Let’s connect in the comments section below and discuss.

I’ll see you in the next article on this PySpark for beginners series.

Happy learning!.You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.

. More details

Leave a Reply