Predicting Wine Quality with Gradient Boosting Machines

Predicting Wine Quality with Gradient Boosting MachinesTutorial for training and deploying a GBM modelClaire LongoBlockedUnblockFollowFollowingMay 18Photo by Vincenzo Landino on UnsplashIntroductionThis article gives a tutorial for training a Gradient Boosting Machine (GBM) model that predicts the quality of a wine using only the price and a written description of the wine.

In many real Data Science projects, it is important to take the projects into the next phase by deploying a live model that others can use to get real-time predictions.

In this article, I demonstrate the model training and deployment process using AWS SageMaker to create a live endpoint.

I wanted to purchase a bottle of “summer water” rose from Winc, an online wine subscription service.

Since I’m unable to try it before I buy, I wanted to get an estimate of the quality before I decide to purchase.

At the end of this post, we’ll see the quality prediction for this wine using the trained GBM model.

Let’s jump into the data and modeling!DataThis model uses a Gradient Boosting Trees Regression and real data from Kaggle to predict the quality points of a bottle of wine (y).

Wine “points” are on a scale of 0–100 and are categorized by Wine Spectator as follows:95–100 Classic: a great wine90–94 Outstanding: a wine of superior character and style85–89 Very good: a wine with special qualities80–84 Good: a solid, well-made wine75–79 Mediocre: a drinkable wine that may have minor flaws50–74 Not recommendedFeature EngineeringThe features (X) used in this mode are the price of the bottle of wine and latent features obtained from the performing Latent Semantic Analysis (LSA) on the unstructured text data from the descriptions.

The Latent text features are engineered using the following steps:Text data is cleaned and processed using typical methods to clean document strings, such as removing punctuation and converting text to lower-case.

Feature engineering is performed on the wine descriptions using LSA (TF-IDF and SVD are used vectorize and then compress the body of the text into 25 latent features).

Figure 1: Example of wine description, price, and points data.

The ModelThe Gradient Boosting Regression Tree model is fit using xgboost in python and evaluated using Mean Absolute Error (MAE)So what are Gradient Boosted Machines?In general, Gradient Boosting Machines are a collection of weak learners ensemble into one model to create a strong learner.

Gradient Boosting Machines can be used for both Regression or Classification tasks.

They are typically applied to tree-based models but could, in theory, be applied to any type of weak learner.

Figure 2: This visualization of one of the weak learner trees in the XGBoost model illustrates how the tree splits on the price and latent description of the wine.

We can see that the price is very influential for predicting the wine quality points!.This weak learner also found something meaningful in one of the latent description topics from the LSA.

The training data is used to fit each weak learner.

Boosting and Bagging can both be used to ensemble these weak learners to one model.

Bagging builds all the weak learners in parallel.

Boosting takes a more systematic approach, and builds the weak learners sequentially with each weak learner attempting to better explain the patterns missed by the last weak learner by applying weights to the observations that the previous weak learner incorrectly predicted.

In stochastic gradient boosting, a sample of the training data is used to fit each weak learner.

AdaboostingAdaBoosting is the simplest effective boosting algorithm for binary classifcation.

It sequentially fits decision trees with one split.

These little weak learner trees are called “decision stumps”.

Each observation in the training observation receives a weight based on the classification error, and the next decision stump is trained using the updated weights on the training data.

Each stump is also assigned a weight based on the classifiers total misclassification rate.

The model then ensembles the predictions using the weights on each stump.

Stumps with a high number of misclassifications receive lower weights, causing their predictions to contribute less in the ensembled prediction.

Gradient BoostingGradient boosting sequentially fits weak learners to the gradient (derivative) of a loss function in an attempt to explain the patterns missed by the previous weak learner.

An additive model is used to ensemble the weak learners as each one is fit.

The output of the new weak learned is added to the output of the previous weak learner to adjust the predictions.

This results in a recursive equation where each weak learner attempts to explains a pattern not picked up by the previous ones.

The first weak learner is initialized as a constant, such as the mean.

Then, a function h(x) is fit to the residuals.

The residuals are the gradient of the loss function.

Where h(x) is a weak learner fit to the gradient of a loss function.

Gamma represents the learning rate, or step size.

The resulting model has many terms for each feature, each nudging the prediction it in a different direction.

Because the weak learners are fit to predict the gradient of the loss function, any differentiable loss function used can be selected, thus allowing this method to be applied to both classification and regression problems.

Training the Model in PythonFirst, start by importing the required python libraries.

Now, load in the wine data and view a sample of it.

Preprocess the text descriptions by removing punctuation, digits, and converting to all characters to lower case.

Now that the descriptions are cleaned, TF-IDF is used to vectorize the words, and SVD is used to compress that matrix into 5 latent vectors.

This method of feature compression from text data is called Latent Semantic Analysis.

5 latent features are chosen for simplicity, but in practice, an elbow plot could be used to select the right number of latent features.

Perform a test/train split, format data for xgboost, and train the model!Make predictions on the test data and evaluate the model using Mean Absolute Error.

On average, the predicted quality points are off by 1.

84 points.

Not bad!Use the feature importance plot from xgboost to see the features that influence the model the most.

It looks like price is the most important feature when predicting the quality of a wine.

Figure 3: Feature importance.

Deployment and Inference with AWS SageMakerIn this notebook, I used SageMaker’s estimator to train the model and host it as a live endpoint.

The estimator spins up a training instance and uses the code in the train_wine_gbt.

pyscript to train the model, save the model to s3, and define the endpoint’s input and output.

It is possible to use SageMaker’s many built-in models for training and deploying, but I wanted to specify my own feature transformations and output for the live predictions, which can be done using a python script like train_wine_gbt.

py.

Now that the model is trained and deployed, I can use it to predict the quality of any bottle of wine!.Using the text description and price of “summer water” from the Winc website, the model predicts this wine as an 87, which is categorized as “very good”!If you’d like to reuse any of this code, check out the GitHub repo for the project here, where I have jupyter notebooks for both training and deployment.

Cheers!.. More details

Leave a Reply