Using AutoML Toolkit to Automate Loan Default Predictions

In a traditional ML pipeline, there are many hand-written components to perform the tasks of featurization and model building and tuning.

  The diagram below provides a graphical representation of these stages.

These stages consist of: Feature Engineering: We will first define potential features, vectorize them (different steps are required for numeric and categorical data), and then choose the features we will use.

Model Building and Tuning: These are the highly repetitive stages of building and training our model, executing the model and reviewing the metrics, tuning the model, making changes to the model and repeating this process until finally building our model.

In the next few sections, we will Describe with code and visualizations these steps extracted from the Evaluating Risk for Loan Approvals using XGBoost (0.

90) notebook Show how much simpler this is using the AutoML Toolkit as noted in the Using AutoML Toolkit to Simplify Loan Risk Analysis XGBoost Model Optimization notebook.

Our Feature Presentation After obtaining reliable and clean data, one of the first steps for a data scientist is to identify which columns (i.


features) will be used for their model.

Identify Important Features: Traditional ML Pipelines There are typically a number of steps when choosing which features you will want to use for your model.

  In our example, we are creating a binary classifier (is this a bad loan or not?) where we will need to define the potential features, vectorize numeric and categorical features, and finally choose the features that will be used in the creation of your model.

Expand to view traditional identifying important features details // Load loan risk analysis dataset val sourceData = spark.


“) // view data display(sourceData) As you can see from the above table, the loan risk analysis dataset contains both numeric and categorical columns.

This is an important distinction as there will be a different set of steps for numeric and categorical columns to ultimately assemble a vector that will be used as the input to your ML model.

To better understand if there is a correlation between independent variables, we can quickly examine sourceData using the display command to view this data as a scatterplot.

You can further analyze this data by calculating the correlation coefficients; a popular method is to use pandas .


While our Databricks notebook is written in Scala, we can quickly and easily use Python pandas code as noted below.

%python # Calculate using Pandas `corr` pdf_corr = spark.

sql(“select loan_amnt, emp_length, annual_inc, dti, delinq_2yrs, revol_util, total_acc, credit_length_in_years, int_rate, net, issue_year, label from sourceData”).


corr() # View correlation coefficients by loan_amnt display(pdf_corr.

loc[:, [loan_amnt]])     As noted in the preceding scatterplots (expand above for those details), there are no obvious highly correlated numeric variables.

Based on this assessment, we will keep all of the columns when creating the model.

Identifying Important Features: AutoML Toolkit It is important to note that this process of identifying important features can be a highly iterative and time-consuming process.

  There are so many different techniques that can be applied, that this process is a book in itself (e.


Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists).

The AutoML Toolkit includes the class FeatureImportances that automatically identifies the most important features; this is all done by the following code snippet.

// Calculate Feature Importance (fi) val fiConfig = ConfigurationGenerator.

generateConfigFromMap(“XGBoost”, “classifier”, genericMapOverrides) // Since were using XGBoost here we cannot have parallelism > 2x number of nodes fiConfig.


tunerParallelism = nodeCount * 2 val fiMainConfig = ConfigurationGenerator.

generateFeatureImportanceConfig(fiConfig) // Generate Feature Importance val importances = new FeatureImportances(sourceData, fiMainConfig, “count”, 20.

0) .

generateFeatureImportances() // Display Feature Importance display(importances.

importances) In this specific example, thirty (30) different Spark jobs were automatically generated and executed to find the most important features that need to be included.

Note, the number of Spark jobs kicked off will vary depending on several factors.

Instead of the days or weeks to manually explore the data, with four lines of code, we identified these features in minutes.

Let’s Build it!.Now that we have identified our most important features, let’s build, train, validate, and tune our ML pipeline for our loan risk dataset.

Traditional Model Building and Tuning The following steps are an abridged version of Evaluating Risk for Loan Approvals using XGBoost (0.

90) notebook code.

Expand to view traditional model building and tuning details  First, we will define our categorical and numeric columns.

// Define our categorical and numeric columns val categoricals = Array(“term”, “home_ownership”, “purpose”, “addr_state”,”verification_status”,”application_type”) val numerics = Array(“loan_amnt”,”emp_length”, “annual_inc”,”dti”,”delinq_2yrs”,”revol_util”,”total_acc”,”credit_length_in_years”) Then we will build our ML pipeline as noted by the code snippet below.

As noted by the comments in the code, our pipeline has the following steps: VectorAssembler: Assemble a features vector based on our feature columns that have been processed by the following Inputer estimator for completing missing values for our numeric data StringIndexer to encode a string value to a numeric value OneHotEncoding to map a categorical feature (represented by the StringIndexer numeric value) to a binary vector LabelIndexer: Specify what our label is (i.


the true value) vs.

our predicted label (i.


the predicted value of a bad or good loan) StandardScaler: Normalizes our features vector to minimize the impact of feature values of different scale.

Note, this example is one of our simpler binary classification examples; there are many more methods that can be used to extract, transform, and select features.

import org.





VectorAssembler import org.





StringIndexer // Imputation estimator for completing missing values val numerics_out = numerics.

map(_ + “_out”) val imputers = new Imputer() .

setInputCols(numerics) .

setOutputCols(numerics_out) // Apply StringIndexer for our categorical data val categoricals_idx = categoricals.

map(_ + “_idx”) val indexers = categoricals.

map( x => new StringIndexer().


setOutputCol(x + “_idx”).

setHandleInvalid(“keep”) ) // Apply OHE for our StringIndexed categorical data val categoricals_class = categoricals.

map(_ + “_class”) val oneHotEncoders = new OneHotEncoderEstimator() .

setInputCols(categoricals_idx) .

setOutputCols(categoricals_class) // Set feature columns val featureCols = categoricals_class ++ numerics_out // Create assembler for our numeric columns (including label) val assembler = new VectorAssembler() .

setInputCols(featureCols) .

setOutputCol(“features”) // Establish label val labelIndexer = new StringIndexer() .

setInputCol(“label”) .

setOutputCol(“predictedLabel”) // Apply StandardScaler val scaler = new StandardScaler() .

setInputCol(“features”) .

setOutputCol(“scaledFeatures”) .

setWithMean(true) .

setWithStd(true) // Build pipeline array val pipelineAry = indexers ++ Array(oneHotEncoders, imputers, assembler, labelIndexer, scaler) With our pipeline and our decision to use the XGBoost model (as noted in Loan Risk Analysis with XGBoost and Databricks Runtime for Machine Learning), let’s build, train, and validate our model.

// Create XGBoostClassifier val xgBoostClassifier = new XGBoostClassifier( Map[String, Any]( “num_round” -> 5, “objective” -> “binary:logistic”, “nworkers” -> 16, “nthreads” -> 4 ) ) .

setFeaturesCol(“scaledFeatures”) .

setLabelCol(“predictedLabel”) // Create XGBoost Pipeline val xgBoostPipeline = new Pipeline().

setStages(pipelineAry :+ xgBoostClassifier) // Create XGBoost Model based on the training dataset val xgBoostModel = xgBoostPipeline.

fit(dataset_train) // Test our model against the validation dataset val predictions = xgBoostModel.

transform(dataset_valid) By using the BinaryClassificationEvaluator included in Spark MLlib, we can evaluate the performance of the model.

// Include BinaryClassificationEvaluator import org.





BinaryClassificationEvaluator // Evaluate val evaluator = new BinaryClassificationEvaluator() .

setRawPredictionCol(“probability”) // Calculate Validation AUC val auc = evaluator.

evaluate(predictions) // AUC Value // 0.

6507 With an AUC value of 0.

6507, let’s see if we can tune this model further by setting a paramGrid and using a CrossValidator().

It is important to note that you will need to understand the model options (e.


XGBoost Classisifer maxDepth) to properly choose the parameters to try.

import org.





{CrossValidator, CrossValidatorModel, ParamGridBuilder} // Build parameter grid val paramGrid = new ParamGridBuilder() .


maxDepth, Array(4, 7)) .


eta, Array(0.

1, 0.

6)) .


numRound, Array(5, 10)) .

build() // Set evaluator as a BinaryClassificationEvaluator val evaluator = new BinaryClassificationEvaluator() .

setRawPredictionCol(“probability”) // Establish CrossValidator() val cv = new CrossValidator() .

setEstimator(xgBoostPipeline) .

setEvaluator(evaluator) .

setEstimatorParamMaps(paramGrid) .

setNumFolds(4) // Run cross-validation, and choose the best set of parameters.

val cvModel = cv.

fit(dataset_train) // Test our model against the cvModel and validation dataset val predictions_cv = cvModel.

transform(dataset_valid) // Calculate cvModel Validation AUC val cvAUC = evaluator.

evaluate(predictions_cv) // AUC Value // 0.

6732     After many iterations of choosing different parameters and testing a laundry list of different values for those parameters (expand above for more details), using traditional model building and tuning we were able to improve the model so it has an AUC = 0.

6732 (up from 0.


AutoML Model Building and Tuning With all of the traditional model building and tuning steps taking days (or weeks), we were able to manually build a model with a better than random AUC value.

  But with AutoML Toolkit, the AutomationRunner allows us to perform all of the above steps with a few lines of code.

With the following five lines of code, AutoML Toolkit’s  AutomationRunner performs all of the previously noted steps (build, train, validate, tune, repeat) automatically.

val modelingType = “XGBoost” val conf = ConfigurationGenerator.

generateConfigFromMap(modelingType, “classifier”, genericMapOverrides) // Adjust model tuner configuration conf.


tunerParallelism = nodeCount // Generate configuration val XGBConfig = ConfigurationGenerator.

generateMainConfig(conf) // Select on the important features val runner = new AutomationRunner(sourceData) .

setMainConfig(XGBConfig) .

runWithConfusionReport() In a few hours (or minutes), AutoML Toolkit finds the best model and stores the model and the inference data as noted in the output of the previous code snippet.


Model will be saved to path dbfs:/ml/dennylee/automl/models/dl_AutoML_Demo/BestRun/classifier_XGBoost_421b6cbe1e954ebba119eb3bfc2997bf/bestModel Inference DF will be saved to dbfs:/ml/dennylee/automl/inference/dl_AutoML_Demo/3959076_best/421b6cbe1e954ebba119eb3bfc2997bf_best modelingType: String = XGBoost .

Because the AutoML Toolkit makes use of the Databricks MLflow integration, all of the model metrics are automatically logged.

As noted in the MLflow details (preceding screenshot), the AUC (areaUnderROC) has improved to a value of 0.

995!.How did the AutoML Toolkit do this?.The details behind how the AutoML Toolkit was able to do this will be discussed in a future blog.

  From a high level, the AutoML toolkit was able to find much better hyperparameters because it tested and tuned all modifiable hyperparameters in a distributed fashion using a collection of optimization algorithms.

  Incorporated within AutoML toolkit is the understanding of how to use the parameters extracted from the algorithm source code (e.


XGBoost in this case).

Clearing up the Confusion With the remarkable improvement in the AUC value, how much better does the AutoML XGBoost model perform in comparison to the hand created one?  Because this is a binary classification problem, we can clear up the confusion using confusion matrices.

The confusion matrices from both the hand-made model and AutoML Toolkit notebooks are included below.

  To match the analysis of what we did in the past (ala Loan Risk Analysis with XGBoost and Databricks Runtime for Machine Learning), we’re evaluating for loans that were issued after 2015.

In the preceding graphic, the confusion matrix on the left is from the hand-made XGBoost model while the one on the right is from AutoML Toolkit.

  While both models do a great job correctly identifying good loans (True: Good, Predicted: Good), the AutoML model performs significantly better on identifying bad loans (True: Bad, Predicted: Bad – 21963 vs.

1370) as well as preventing false positives (True: Good, Predict: Bad – 0 vs.


Understanding the Business Value Let’s quantify this confusion matrix to business value; the definition would be: Prediction Label (Is Bad Loan) Short Description Long Description 1 1 Loss Avoided Correctly found bad loans 1 0 Profit Forfeited Incorrectly labeled bad loans 0 1 Loss Still Incurred Incorrectly labeled good loans 0 0 Profit Retained Correctly found good loans To review the dollar value associated with our confusion matrix for our hand-made model, we will use the following code snippet.

// Value gained from implementing model = -(loss avoided-profit forfeited) display(predictions_cv.

groupBy(“prediction”, “label”).


alias(“sum_net_mill”))) To review the dollar value associated with our confusion matrix of the AutoML Toolkit model will use the following code snippet.

// Value gained from implementing model = -(loss avoided-profit forfeited) for 2015 data display(runner.


where($”issue_year” > 2015).

groupBy(“prediction”, “label”).


alias(“sum_net_mill”))) Business value is calculated as value = -(loss avoided – profit forfeited) Model Loss Avoided Profit Forfeited Value Hand Made -20.

16 3.

06 $23.

22M AutoML Toolkit -267.

24 0 $267.

24M As you can observe, the potential profits saved by using the AutoML Toolkit is 10x better than our handmade model with savings of $267.


AutoML Toolkit: Less Code and Faster With the AutoML Toolkit, you can write less code to deliver better results faster.

  For this loan risk analysis with XGBoost example, we had seen a >10x in performance (savings of $267.

24M vs.


22M with the original technique).

  The AutoML toolkit was able to find much better hyperparameters because it automatically generated, tested, and tuned all of the algorithm’s modifiable hyperparameters in a distributed fashion.

Try out the AutoML Toolkit with the Using AutoML Toolkit to Simplify Loan Risk Analysis XGBoost Model Optimization notebook on Databricks today!.Try Databricks for free.

Get started today.

. More details

Leave a Reply