Finding Donors: Classification Project With PySpark

Finding Donors: Classification Project With PySparkLearn how to use Apache PySpark to empower your classification predictionsVictor RomanBlockedUnblockFollowFollowingJun 19IntroductionThe aim of this article is to make a gentle introduction to Classification problems in Machine Learning and go through a comprehensive guide to develop succesfully a class prediction using PySpark.

So without further a do, let’s jump into it!ClassificationIf you want to a deep explanation about Classification problems, its main algorithms and how to deal with them using machine learning techniques, I strongly suggest you to chek out the following article, where I explain this concepts throughfully.

Supervised Learning: Basics of Classification and Main AlgorithmsLearn how machines classifytowardsdatascience.

comWhat is Classification?Classification is a subcategory of supervised learning where the goal is to predict the categorical class labels (discrete, unoredered values, group membership) of new instances based on past observations.

There are two main types of classification problems:Binary classification: The typical example is e-mail spam detection, which each e-mail is spam → 1 spam; or isn’t → 0.

Multi-class classification: Like handwritten character recognition (where classes go from 0 to 9).

The following example is very representative to explain binary classification:There are 2 classes, circles and crosses, and 2 features, X1 and X2.

The model is able to find the relationship between the features of each data point and its class, and to set a boundary line between them, so when provided with new data, it can estimate the class where it belongs, given its features.

In this case, the new data point falls into the circle subspace and, therefore, the model will predict its class to be a circle.

Classification Main AlgorithmsIn order to predict the class of certain samples, there are several classification algorithms that can be used.

In fact, when developing our machine learning models, we will train and evaluate a certain number of them, and we will keep those with better predicting performance.

A non-exhaustive list of some of the most used algorithms are:Logistic RegressionDecision TreesRandom ForestsSupport Vector MachinesK-Nearest Neighbors (KNN)Classification Evaluation MetricsWhen making predictions on events we can get four type of results:True Positives: TPTrue Negatives: TNFalse Positives: FPFalse Negatives: FNAll of these are represented in the following classification matrix:Accuracy measures how often the classifier makes the correct prediction.

It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).

Precision tells us what proportion of events we classified as a certain class, actually were that class.

It is a ratio of true positives to all positives.

Recall (sensitivity) tells us what proportion of events that actually were the of a certain class were classified by us as that class.

It is a ratio of true positives to all the positives.

Specifity is the proportion of classes that were correctly identified as negative upon the total of negative classes.

For classification problems that are skewed in their classification distributions , accuracy by itself is not an appropiate metric.

Instead, precision and recall are much more representative.

These two metrics can be combined to get the F1 score, which is weighted average(harmonic mean) of the precision and recall scores.

This score can range from 0 to 1, with 1 being the best possible F1 score(we take the harmonic mean as we are dealing with ratios).

ROCFinally, the metric that we will use in our project is the Reciever Operation Characteristic or ROC.

The ROC curve tells us about how good the model can distinguish between two classes.

It can get values from 0 to 1 ( €[0,1] ).

The better the model is, the closer to 1 value it will be.

As can be seen in the image of above, our classification model will draw a separation boundary between the classes and:Every sample that falls at the left of the threshod, will be classified as negative class.

Every sample that falls at the right of the threshod, will be classified as positive class,And the distribution of predictions will be the following:Trade off Between Sensitivity & SpecifityWhen we decrease the threshold, we end up predicting more positive values and increasing sensitivity.

Therefore, specifity decreases.

When we increase the threshold, we end up predicting more negative values and increasing specifity.

Therefore, decreasing sensitivity.

As Sensitivity ⬇️ Specificity ⬆️As Specificity ⬇️ Sensitivity ⬆️In order to optimize the classification performance,we consider (1- specifity) instead specificity.

So, when sensitivity increases, (1-specificity) will also increase.

And that is how we calculate the ROC.

Examples of PerformanceAs stated before, the closer to 1 gets the evaluator, the better predictive performance the model will be, and the smaller the overlapping area between classes will be.

Finding Donors ProjectA complete walkthrough of the project can be found in the following article:Machine Learning Classification Project: Finding DonorsFind and predict who will donate to a charity with this classification model!towardsdatascience.

comIn the present article we will focus on the PySpark implementation of the project.

As a summary, throughout the project, we will use a number of different supervised algorithms to precisely predict individuals’ income using data collected from the 1994 U.

S.

Census.

We will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data.

Our goal with this implementation is to build a model that accurately predicts whether an individual makes more than $50,000.

As from our previous research we have found out that the individuals who are most likely to donate money to a charity are the ones that make more than $50,000.

Therefore, we are facing a binary classification problem, where we want to determine wether an individual makes more than $50K a year (class 1) or do not (class 0).

The dataset for this project originates from the UCI Machine Learning Repository.

DataThe census dataset consists of approximately 45222 data points, with each datapoint having 13 features.

Featuresage: Ageworkclass: Working Class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)education_level: Level of Education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)education-num: Number of educational years completedmarital-status: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)occupation: Work Occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)relationship: Relationship Status (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)race: Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)sex: Sex (Female, Male)capital-gain: Monetary Capital Gainscapital-loss: Monetary Capital Losseshours-per-week: Average Hours Per Week Workednative-country: Native Country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)Target Variableincome: Income Class (<=50K, >50K)Import Data & Exploratory Data Analysis (EDA)We will start by importing the dataset and displaying the firsts rows of the data to make a first approximation to an exploratory data analysis.

# File location and typefile_location = "/FileStore/tables/census.

csv"file_type = "csv"# CSV optionsinfer_schema = "true"first_row_is_header = "true"delimiter = ","# The applied options are for CSV files.

For other file types, these will be ignored.

df = spark.

read.

format(file_type) .

option("inferSchema", infer_schema) .

option("header", first_row_is_header) .

option("sep", delimiter) .

load(file_location)display(df)We will now display a summary of the dataset’s information by using the .

describe() method.

# Display Dataset's Summarydisplay(df.

describe())Let’s also find out the dataset’s schema.

# Display Dataset's Schemadisplay(df.

describe())Prepare the DataAs we want to predict wether or not the individual is earning more of $50K per year, we will substitute the label ‘income’ to ‘>50K’.

To do so, we will create a new column which values will be 1 or 0 depending if the individual makes or not more than $50K per year.

We will then drop this income column.

# Import pyspark functionsfrom pyspark.

sql import functions as F# Create add new column to the datasetdf = df.

withColumn('>50K', F.

when(df.

income == '<=50K', 0).

otherwise(1))# Drop the Income labeldf = df.

drop('income')# Show dataset's columnsdf.

columnsVectorizing Numerical Features and One-Hot Encodin Categorical FeaturesIn order to be processed for the training of the models, features in Apache Spark must be transformed into vectors.

This process will be done using certain classes that we will explore now.

First, we will import relevant libraries and methods.

from pyspark.

ml import Pipelinefrom pyspark.

ml.

feature import StringIndexer, OneHotEncoder, VectorAssemblerfrom pyspark.

ml.

classification import (DecisionTreeClassifier, GBTClassifier, RandomForestClassifier, LogisticRegression)from pyspark.

ml.

evaluation import BinaryClassificationEvaluatorNow, we will select the categorical features.

# Selecting categorical featurescategorical_columns = [ 'workclass', 'education_level', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'hours-per-week', 'native-country', ]In order to One-Hot encode this categorical features we will first pass them through an indexer and then to an encoder.

# The index of string values multiple columnsindexers = [ StringIndexer(inputCol=c, outputCol="{0}_indexed".

format(c)) for c in categorical_columns]# The encode of indexed values multiple columnsencoders = [OneHotEncoder(dropLast=False,inputCol=indexer.

getOutputCol(), outputCol="{0}_encoded".

format(indexer.

getOutputCol())) for indexer in indexers]Now, we will join the categorical encoded features with the numerical ones and make a vector with both of them.

# Vectorizing encoded valuescategorical_encoded = [encoder.

getOutputCol() for encoder in encoders]numerical_columns = ['age', 'education-num', 'capital-gain', 'capital-loss']inputcols = categorical_encoded + numerical_columnsassembler = VectorAssembler(inputCols=inputcols, outputCol="features")Now, we will set up a pipeline to automatize this stages.

pipeline = Pipeline(stages=indexers + encoders+[assembler])model = pipeline.

fit(df)# Transform datatransformed = model.

transform(df)display(transformed)Finally, we will select a dataset only with the relevant features.

# Transform datafinal_data = transformed.

select('features', '>50K')Initializing the ModelsFor this project, we will study the predictive performance of three different classification algorithms:Decision TreesRandom ForestsGradient Boosted Trees# Initialize the classification modelsdtc = DecisionTreeClassifier(labelCol='>50K', featuresCol='features')rfc = RandomForestClassifier(numTrees=150, labelCol='>50K', featuresCol='features')gbt = GBTClassifier(labelCol='>50K', featuresCol='features', maxIter=10)Splitting DataWe will perform a classic 80/20 split between training and testing data.

train_data, test_data = final_data.

randomSplit([0.

8,0.

2])Training the Modelsdtc_model = dtc.

fit(train_data)rfc_model = rfc.

fit(train_data)gbt_model = gbt.

fit(train_data)Obtaining Predictionsdtc_preds = dtc_model.

transform(test_data)rfc_preds = rfc_model.

transform(test_data)gbt_preds = gbt_model.

transform(test_data)Evaluating Model’s PerformanceAs stated before, our evaluator will be the ROC.

We will initialize its class and pass it the predicitons in order to obtain the value.

my_eval = BinaryClassificationEvaluator(labelCol='>50K')# Display Decision Tree evaluation metricprint('DTC')print(my_eval.

evaluate(dtc_preds))# Display Random Forest evaluation metricprint('RFC')print(my_eval.

evaluate(rfc_preds))# Display Gradien Boosting Tree evaluation metricprint('GBT')print(my_eval.

evaluate(gbt_preds))The best predictor is the Gradient Boosting Tree.

Actually 0.

911 is a pretty good value and when display its predictions we will see the following:Improving Models PerformanceWe will try to do this by performing the grid search cross validation technique.

With it, we will evaluate the performance of the model with different combinations of previously sets of hyperparameter’s values.

The hyperparameters that we will tune are:Max DepthMax BinsMax Iterations# Import librariesfrom pyspark.

ml.

tuning import ParamGridBuilder, CrossValidator# Set the Parameters gridparamGrid = (ParamGridBuilder() .

addGrid(gbt.

maxDepth, [2, 4, 6]) .

addGrid(gbt.

maxBins, [20, 60]) .

addGrid(gbt.

maxIter, [10, 20]) .

build())# Iinitializing the cross validator classcv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=my_eval, numFolds=5)# Run cross validations.

This can take about 6 minutes since it is training over 20 treescvModel = cv.

fit(train_data)gbt_predictions_2 = cvModel.

transform(test_data)my_eval.

evaluate(gbt_predictions_2)We have obtained a tiny improvement in the predictive performance.

And the computation time, went almost to the 20 minutes.

So, in these cases we should analyze if the improvement is worth the effort.

ConclusionThroughout this article we made a machine learning classification project from end-to-end.

We also learned and obtained several insights about classification models and the keys to develop one with a good performance, using PySpark, its methods and implementations.

We also have learned how to tune our algorithms once one good-performing model has been identified.

On the next articles we will learn how to develop Regression Models in PySpark.

So, if you are interested in the topic I strongly suggest you to stay tuned!.

. More details

Leave a Reply