A Gentle Introduction to Degrees of Freedom in Machine Learning

Degrees of freedom is an important concept from statistics and engineering.

It is often employed to summarize the number of values used in the calculation of a statistic, such as a sample statistic or in a statistical hypothesis test.

In machine learning, the degrees of freedom may refer to the number of parameters in the model, such as the number of coefficients in a linear regression model or the number of weights in a deep learning neural network.

The concern is that if there are more degrees of freedom (model parameters) in machine learning, then the model is expected to overfit the training dataset.

This is the common understanding from statistics.

This expectation can be overcome through the use of regularization techniques, such as regularization linear regression and the suite of regularization methods available for deep learning neural network models.

In this post, you will discover degrees of freedom in statistics and machine learning.

After reading this post, you will know:Let’s get started.

A Gentle Introduction to Degrees of Freedom in Machine LearningPhoto by daveynin, some rights reserved.

This tutorial is divided into three parts; they are:Degrees of freedom represent the number of points of control of a system, model, or calculation.

Each independent parameter that can change is a separate dimension in a d-dimensional space that defines the scope of values that may influence the system, where the specific observed or specified values are a single point in that space.

Mathematically, the degrees of freedom is often represented using the Greek letter nu, which looks like a lower-case “v”.

It may also be abbreviated as “d.

o.

f,” “dof,” “d.

f.

,” or simply “df.

”Degrees of freedom is a term from statistics and engineering and may be used in machine learning.

In statistics, the degrees of freedom is the number of values used in the calculation of a statistic that can change.

Degrees of freedom: Roughly, the minimum amount of data needed to calculate a statistic.

More practically, it is a number, or numbers, used to approximate the number of observations in the data set for the purpose of determining statistical significance.

— Page 60, Statistics in Plain English, 3rd Edition, 2010.

It is calculated as the number of independent values used in the calculation of the statistic minus the number of statistics calculated.

For example, we may have 50 independent samples and we wish to calculate a statistic of the sample, like the mean.

All 50 samples are used in the calculation and there is one statistic, so the number of degrees of freedom for the mean, in this case, is calculated as:Degrees of freedom is often an important consideration in data distributions and statistical hypothesis tests.

For example, it used to be common to have tables of statistical test critical values calculated for different common degrees of freedom (before calculating the statistic directly was easy and common).

So far, so good, but what about a model fit from data, such as in machine learning?In predictive modeling, the degrees of freedom often refers to the number of parameters in the model that are estimated from data.

This can also include both the coefficients of the model and the data used in the calculation of the error of the model.

The best case for understanding this is with a linear regression model.

Consider a linear regression model for a dataset that has two input variables.

We will require one coefficient in the model for each of the input variables, e.

g.

the model will have two parameters.

This model looks as follows, where x1 and x2 are the input variables and beta1 and beta2 are the model parameters.

This linear regression model has two degrees of freedom because there are two parameters in the model that must be estimated from a training dataset.

Adding one more column to the data (one more input variable) would add one more degree of freedom for the model.

It is common to describe the complexity of a model fit from data based on the number of parameters that were fit.

For example, the complexity of a linear regression model with two parameters is equal to the degrees of freedom, which in this case is 2.

We often prefer lower complexity models over higher complexity models.

Simpler models generalize better.

The degrees of freedom are an accounting of how many parameters are estimated by the model and, by extension, a measure of complexity for linear regression models.

— Page 71, Applied Predictive Modeling, 2013.

It’s not over yet.

The number of training examples matters and impacts the overall degrees of freedom for the regression model.

Consider that the coefficients of the linear regression model are fit using a training dataset with 100 rows or examples.

The model is fit by minimizing the error between the model predictions and the expected output values.

The total error of the model has one degree of freedom for each example in the training dataset minus the number of parameters estimated from the data.

In this case, the model error has 100 minus 2 parameters from the model, or 98 degrees of freedom.

It is often good practice to report the error of a linear model, like linear regression, including the degrees of freedom of the error.

At the very least, the number of observations in the training data can be included so that the model error degrees of freedom can be determined.

The total degrees of freedom for the linear regression model is taken as the sum of the model degrees of freedom plus the model error degrees of freedom.

Generally, the degrees of freedom is equal to the number of rows of training data used to fit the model.

Consider a dataset with 100 rows of data as before, but now we have 70 input variables.

This means that the model has 70 coefficients or parameters fit from the data.

The model error would therefore be 100 – 70, or 30 degrees of freedom.

The total degrees of freedom for the model is still equal to the number of rows, or 70 + 30.

What happens when we have more columns than rows of data?For example, we may have 100 rows of data and 10,000 variables, such as gene markers for 100 patients.

A linear regression model would therefore have 10,000 parameters, meaning the model would have 10,000 degrees of freedom.

We can calculate the model error degrees of freedom as follows:Uh oh.

And we can calculate the total degrees of freedom as follows:The model has 100 total degrees of freedom, but the model error has a negative degrees of freedom.

A negative degree of freedom is valid.

It suggests that we have more statistics than we have values that can change.

In this case, we have more parameters in the model than we have rows of data or observations to train the model.

This is a so-called p >> n or having many more predictors p than we do samples n.

The problem is that when we have more parameters than observations, there is a risk of overfitting the training dataset.

This is intuitive if we think of each coefficient in the model as a point of control.

If we have more points of control in the model than we have observations, we can, in theory, configure the model to predict the training dataset correctly and exactly.

Learning the details of the training dataset at the expense of performing well on new data is the definition of overfitting.

This is the general concern that statisticians have about deep learning neural network models.

That is, deep learning models often have many more parameters (model weights) than samples (e.

g.

billions of weights), and using our understanding of linear models, are expected to overfit.

Nevertheless, through careful selection of model architectures and regularization techniques, they can be prevented from overfitting and maintain low generalization error.

Further, in deep models, the effective degrees of freedom may be decoupled from the number of parameters in the model.

We showed that for simple classification models, degrees of freedom is equal to the number of parameters in the model.

In deep networks, the degrees of freedom is generally much less than the number of parameters in the model, and deeper networks tend to have less degrees of freedom.

— Degrees of Freedom in Deep Neural Networks, 2016.

As such, there is a growing trend by statisticians and machine learning practitioners to move away from degrees of freedom for both a proxy for model complexity and as an expectation for overfitting.

To most applied statisticians, a fitting procedure’s degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data.

[…] We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly.

— Effective Degrees Of Freedom: A Flawed Metaphor, 2013.

This section provides more resources on the topic if you are looking to go deeper.

In this post, you discovered degrees of freedom in statistics and machine learning.

Specifically, you learned:Do you have any questions? Ask your questions in the comments below and I will do my best to answer.

by writing lines of code in pythonDiscover how in my new Ebook: Statistical Methods for Machine LearningIt provides self-study tutorials on topics like: Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more.

Skip the Academics.

Just Results.

.

Leave a Reply