well…making sure that the algorithm does not learn…too much!If an algorithm maps very precisely all the data points of a given training distribution, it will for sure produce a very little error on that specific training data (training error).
However, while learning from the valid data points, it will also pick up the random variations in the data produced by its noise, at the much higher cost of a high error on the test data (test error) and consequently a poor performance due to a high generalization error.
In other words, more precision translates into collecting more noise which can ‘confuse’ the algorithm with apparent relationships in the training data that do not hold in general being based on a fallacious combinations of good data and noisy one.
The consequence for an algorithm that models the training data too rigorously absorbing too much of its noise, is that it will produce drastically different models depending on the training set and show high variations in its performances over different test data.
Variance measure exactly that: an algorithm’s sensitivity to specific sets of training data.
High variance indicates that the algorithm fits too well the data and it is probably over-complicated for the data distribution is trying to model and therefore it is said to overfit the data.
That is why we need to look for the “simplest approximating model”!On the other hand, we cannot either select a model that is just too simplistic and not expressive enough to capture and learn the complexity of an event data distribution.
Imagine using a linear regression to map a training data set that has a non-linear pattern: no matter how hard you will try and how many observations you will be able to collect, a linear regression is just a line and simply too rigid to model the curves of a non-linear data set.
Bias, at the opposite corner from variance, essentially measure this inability of a machine learning algorithm to fit, or represent well enough, the distribution of the data set used to train the system.
Bias, in other words, give a dimension to the simplifying assumptions made by a model to make the target function easier to learn, but at the cost of not finding the best possible function.
When this is the case, the algorithm will have a high bias value and is said to underfit the data.
While overfitting is characterized by variance, underfitting is characterized by bias.
Well, if I was able to transfer at least part of the main concepts, you will have probably figured it out by now that we are in front of an interesting dilemma: on one hand, if the algorithm fits the training data tightly, or overfits the training data, it will show low bias but high variance.
Which indicates not-too-good generalization performances.
On the other hand, when an algorithm is too simplistic to fit the training data well, or underfits the data, it will show low variance but higher bias values.
Probably good generalization performances but maybe not the best algorithm we could choose.
In statistic, this is a very well-known dilemma and it’s called the bias-variance trade-off.
Ability to find an algorithm that balance well the values of bias and variance, will lead us exactly to what we are looking at: the simplest approximating model with the best generalization performances (or with the smallest generalization error).
The bias-variance trade-off concept is captured in machine learning also by the estimation-approximation trade-off.
Bias measure a model’s poor approximation of the target function.
To improve the performance, we’ll probably need to select a different algorithm or family of algorithms, that offer a larger hypothesis space which, covering a bigger area, will more likely get closer or, approximate better, the space covered by the target function we are aiming at.
But let’s remember that the target function we are trying to get closer at is derived only from a finite set of sample data.
Not from the real, future, full distribution.
Sample data is all we have for learning, and a limited set of data can only represent an estimation of the real function that describes the entire phenomenon.
So, similarly to the bias-variance trade-off case, if we approximate too well the function that describes the sample distribution, producing a low approximation error and a low bias, the risk is that when we’ll then use the newly built function to predict or classify future unseen data coming from the real distribution, our function will continue predicting too closely on the basis of the learned pattern from the sample data, and will result too rigid and specific to capture well the general progression of the real distribution.
In these cases, every different training set will generate very specific models that greatly varies from each other.
That is why in statistic the unity of measure to express this specific aspect is called variance and in a more machine learning oriented ‘dialect’ the same aspect is referred to as estimation error.
High variance, big estimation errors.
Since, as we have just seen, complexity of our models affects their performance, we need to find a way to define complexity in a quantitative way and among others, the Vapnik-Chervonenkis dimension, is a widely used approach to find the right balance between bias and variance as it quantifies precision capturing very well this notion of complexity for a model.
Without getting into too much details, VC dimension relates to (but not necessarily equals to) the number of parameters that every model has which, in turn, is connected to the number of data points that the model can handle.
The main idea is that the higher is the number of data points that the model want to approximate, the higher is the number of parameters the model needs to have to map them, which add complexity and makes the model very specific to that data set.
And being specific does not go along well with being able to generalize well.
VC dimension helps to capture the best balance between the two incompatible yet inseparable elements of the equation.
[For those of you who want to get deeper into VC dimension consider that: VC dimension measure the number of effective parameters or binary degrees of freedom of an algorithm which in turn express the number of data points that the particular algorithm can shatter.
A set of points is said to be shattered by a class of functions if, no matter how we assign a binary label to each point, a member of the class can perfectly separate them.
]While measuring an algorithm complexity, VC dimension can also help us to estimate the prediction error, providing us with a probabilistic assessment on whether an algorithm can learn and generalize given the sample data set: a low VC dimension compared to the number of available training data will suggest that the test error will not be far from the training error.
This is an amazing result: we can actually make a strong statement about what our test performance will be on data we haven’t seen, using only a property of the model!But surprises from the learning theory do not end here.
It offers even more fundamental concepts and essential guidance into the fascinating journey of making learning possible.
And while driving the data scientists in their quest to the right alchemy to transform undisciplined data into valuable predictions, it also suggests lots of good common sense on how we, humans, should build more reliable hypothesis about the facts of life that happen around us: with a little bit more of realistic data and a little bit less of noisy opinions.
Grazie.
Originally published on LinkedIn.
.. More details