Linear regression and a quality bottle of wine

Read on to find out.

Akhilesh RaiBlockedUnblockFollowFollowingApr 11There is something about statistics that govern our lives more than we think.

The fact that you are likely to perform better after one bad performance.

This was coined a term called ‘Regression’ by Francis Galton.

He compared heights of ancestors and their descendants and found that the heights tend to be closer to mean.

The same inference can be made about a person’s marks on a Math test.

If you have got a really bad score on your Math test then as per regression you should have a good score on your next math test.

Bell curveA bell curve is a representation of normal distribution of a sample N.

Where u is the mean.

Regression states that the data on the X-axis tends to regress towards the mean.

Today, regression algorithms are used to solve numerous problems that involve a variety of applications ranging from statistics (which is its parent field) to cutting edge AI techniques in stock markets.

Perhaps the best use of regression is in the field of data analytics.

The same model can be used to predict the quality of wine.

Generalised linear regression which follows the following equation:β0 is intercept and β1…βn are regression coefficients.

Now, remember that because we use multiple variables here which means that we are interpreting the data on a multi-dimensional hyperplane.

Generalised linear regression assumes that the dependent variable(Y in this case) has a linear relationship with the independent variables(X1.

p).

So if we assume a two dimensional dataset(X1 and Y) it would be like the image below.

Linear regression for one dependent variable and independent variableRemember that the ‘red line’ is the assumed line and data points are actual points of data.

When the model is fitted the relationship is assumed to be linear which means data is assumed to fit near that red line.

First, we need to collect dataset from the UCI repository.

The pre-requisite is that you have python, spyder and other specified packages below have been installed:For extraction of data we use pandas and assign our features(independent variables) to X and dependent variable to Y.

Extracting variables using pandasCorrelation is a statistical technique that can show how strongly pairs of variables are related.

We use these variables to plot a heat map that is used to indicate the predictive relationship.

We find how each feature is related to each other.

It must be noted that the correlation between features does not imply a causal relationship.

For example, the correlation coefficient(r) between a feature and itself is equal to 1.

In all other scenarios correlation tends to be less than 1.

Also the correlation coefficient can have an inverse relationship between the feature in that case the correlation coefficient will be equal to -1.

The closer r is to +1 or -1, the more closely the two variables are related.

Heat-map of both dependent and independent variablesNext, we use seaborn to the plot feature vs the quality plot for all the available features in X.

We have assumed that the model is linear.

Now the dataset needs to be split into training and test data for that we use Scikitlearn’s model_selection module.

All we are left to do after that is train our linear regression module.

It can be trained using the LinearRegression class over x_train and can be tested over x_test.

We use the fit method to centre the data since the data is already scaled we do not apply the transform method here.

Estimation of the linear model is done with help of root mean square error or root mean square deviation.

RMSE for sample nThe accuracy and RMSE(root mean square error) or RMSD(root mean square deviation)of the linear regression technique is used to decide how well the algorithm has faired.

So the RMSE is 0.

4, to be honest, is pretty bad also the model has got the accuracy of just 34.

4%.

 This model highlights some major bottlenecks that researches were having during initial days of AI when scientists were unable to classify the features based on quality.

Some features have a strong correlation with the dependent variable Y but that does not mean that feature causes an improvement in Y.

Perhaps, this is what we missed in our approach we took these features and thought that because the heat-map implies a relationship the feature must be strong but correlation does not imply causality.

The quality of wine is a qualitative variable and that is another reason why the algorithm did not do good.

It is important to note that linear regression model fairs well with a quantitative approach as opposed to a qualitative approach.

Variables can be classified as qualitative when we are unable to decide if one variable is better than the other.

For example: Is blue greater than green?.No.

This means that algorithms such as Decision Trees, Random Forest would fair well with the classification of wine quality.

Let’s save them for some other time.

Cheers!Note: The author does not own the dataset provided by the UCI repository.

The author thanks the owners of the dataset for keeping the dataset open source.

.

. More details

Leave a Reply