Wine is OSEMN

Before examining Random Forest Classification of wine datasets, I’d have ventured to guess that the more a taster is putting pen to paper, or technically, finger to keyboard, the higher they are going to rate the wine being sampled.

The results from the the RFC modeling were fairly decisive, and my presuppositions did not actually hold up.

While the RFC of the simplified points metric (not pictured) did do slightly better than average at classifying wine description length into the five categories defined, it was not to the extent I hypothesized when first formulating this project.

This graph from my EDA may demonstrate why:A linear regression model would have been more appropriate here than classification.

This turned out to be especially true in the case of 20 distinct points categories, but still true when binned into smaller groupings every 4 points.

A much better classification model involved using the shorter descriptions to classify less expensive wines (under $100).

The inclusion of upward outliers helped this model along in a big way.

In fact, only including wines priced up to $500 was detrimental to the model’s accuracy, something else I would not have foreseen without testing it and verifying.

I had some domain knowledge in choosing $100 as a threshold of quality, beyond which lies cult status.

As the damage gets into three digits a bottle, there is tremendous divergence in price without sufficiently justified exponential increase in quality.

What was reflected in the RFC model was a high level of accuracy for classifying wines under $100 but a not-better-than-chance classification model for wines above that threshold.

I put forth that the vast majority of wine drinkers are never looking to spend more than $100 on a bottle, or to do so only very infrequently.

This model shows that wines of a certain description length from this subset of wine reviews from 2017 are able to be classified with high precision and accuracy as being below that critical price point through the Random Forest Algorithm.

Above it, and review length is only about as good as chance.

Taken together, I contend that these two results should be interpreted to mean that buying wine with longer description length that is still under $100 for the bottle is the best approach for scouting out your next pick in Wine Enthusiast.

Since it was the weaker model, point rating should only be used as a secondary way to classify which are truly the best wines.

Although I do not touch on it here, there is likely some interplay of description length and word frequency inside the review description which merits further study.

This would be entering the province of Natural Language Processing, and involves turning word appearances in a given text into vectors to be able to gauge frequency.

Description length might reasonably harmonize with how often certain words are appearing, or it might not.

This would be a worthwhile extension of what I have delved into here.

Our foray into phenolics gave us a decisive classification.


Phenols are organic compounds released by plants and animals as a part of their defense mechanisms, with a central cyclic benzene ring and a varying number of hydroxyl groups as substituents.

So, the more stressed the vines are, the more phenols they will produce.

These compounds not only affect the color in red wines but also act as preservatives and impact the taste.

Bacchus favors a stressed-out grape.

My notebook details that regardless of if you are dealing with what I define as “not two-buck Chuck” (good) or “two-buck Chuck” (bad) wine, Random Forest is quite a precise and accurate Machine Learning classification model for one of these two styles of wine.

It is particularly good at distinguising good quality wines.

Feature scaling is always something worth applying when analyzing datasets that contain variables operating over highly disperse distributions.

However in this specific instance, all of the features were very uniform even before scaling.

This, along with playing around with some feature selection, provided some additional insight and depth into my foray into the world of Machine Learning and wine.

Although by no means perfect, this project made the Data Science lifecycle in all its OSEMNess much more concrete for me.

I hope it can go some measure toward doing the same for you.


com/harrisonhardin/Mod5ProjectWine Reviews130k wine reviews with variety, location, winery, price, and descriptionwww.



Cortez, A.

Cerdeira, F.

Almeida, T.

Matos and J.


Modeling wine preferences by data mining from physicochemical properties.

In Decision Support Systems, Elsevier, 47(4):547–553, 2009.











. More details

Leave a Reply