The Fundamental Problem of Machine Learning, Without Math

Apparently there’s an entire field dedicated to algorithms which create models that extract patterns from data and apply them to other data, and it’s called machine learning.

I wanted to talk about one of the most important concepts in machine learning, overfitting.

Overfitting is just a fancy way of saying that a model found patterns which are too complicated, and this causes problems in predicting future points.

Source: wikipedia.

orgLet’s say your task is to draw a line to separate the red from the blue points, but there may be some randomness in how the points are colored; the data is somewhat noisy.

The black line seems like a reasonable solution.

It doesn’t classify all of the points correctly, but it seems to take into account that some of the points on the border could go either way.

The green solution separates all of the points successfully.

But here’s the difference: I’ve highlighted some of the regions where, if you go by the green line, a new point would be classified as red, whereas if you go by the black line, a new point would be classified as blue.

This definitely wasn’t done via Microsoft Word.

I definitely used a professional program, like Photoshop.

Now, as a reader who isn’t trying to troll the writer, I’m sure you’ll agree with me that the highlighted yellow regions are more likely to include blue than red points.

That’s overfitting — the green line performs better on the existing points (training points) but worse on new points (test points).

We can rephrase by saying that the model which produced the green line found too many patterns in the training points.

It was too good at finding patterns, so when it was time to apply them, it failed to see that the patterns it found were probably not applicable to new points.

The fundamental problem, then, is to discover which patterns are real and which are just noise from the data itself.

I’m definitely not the first person to notice this problem.

I actually learned about it in class.

Many smart people have tried to solve the problem by coming up with really clever ways using long equations and a lot of Greek letters to prevent models from finding weird lines like the green one; this is called regularization.

Whether the regularization is added during training or baked into the model itself, this is how improvements in machine learning are made.

My view is slightly different — in my opinion, we skipped the most important step.

Before resorting to fancy math, we forgot to ask ourselves why.

Well, there’s always the question of why the sky is blue, or why we exist in the universe, but those aren’t the questions I was referencing (these are not the questions you are looking for).

No, I was more concerned about why intelligence works in general.

We’re trying to figure out which patterns work, but we haven’t established why any pattern would work at all.

Why should the next point be blue or red; why can’t it be purple?This question, again, wasn’t asked first by me.

In fact, I have no idea who asked it first, but fortunately, someone along the way decided to come up with an answer: the reason why some patterns work is that the test data should be drawn independently from the training data, and the two should be from the same distribution.

This is just a mathematical way of stating that when we test our model, we shouldn’t subject it to different circumstances from when we trained it.

The likelihood that the next point is purple is small because we’ve already seen a lot of points, and none of them were purple.

After all, while we can’t expect half heads and half tails when we roll a six-sided die, we can expect that if we flip the same coin in the same manner as we have been, we should get similar results.

To borrow a motivational quote: “If you always do what you’ve always done, you’ll always get what you’ve always gotten.

” I never thought machine learning could be tied to motivational speeches until I wrote that sentence.

Ok, fine.

It was done via Microsoft Word :/Ok, great, but let’s get back on task.

What does this have to do with differentiating between actual patterns and overfitting?.If you notice above, the primary cause of overfitting is the green line model trying to fit a few outliers.

Let’s say we could somehow group the points into different regions, some of which contain outliers and do not fit the patterns found (yellow regions), and some of which contain points which do fit the patterns found (green regions).

Intuitively, “green” regions ought to be larger, contain more points, and be more accurate in predicting future points.

In other words, if more points appeared in the “green” regions, we would expect them to be red.

On the other hand, if more points appeared in the “yellow” regions, while some would be red, we would think most would be blue.

Of course, to get more points, we could always collect more data.

But sometimes, you find yourself a poor undergraduate student without the money, time, or resources that other people have.

Sigh.

Anyways, sometimes, it’s not feasible to collect more data.

So instead of doing this, we can leave out part of the training data, and “add” points by using the points we’ve left out!.This also isn’t a new idea — this is called using a validation set (where we validate patterns found by the model from the training set).

Using the validation set to determine specific regions of overfitting is new.

There’s a reason people haven’t done this, however.

“Green” regions sound really nice, but first, they’re hard to find, and second, the regions were constructed with a dependence on the data.

Simply put, some regions are obviously “green,” and others obviously “yellow,” but some regions will be hard to color due to the fact that the regions were molded around the training data.

The solution to this problem relies on the same fundamental principles as America.

It would be strange if two models trained on different data came up with the same “yellow” regions.

Therefore, if we repeat this coloring process for models trained on somewhat different data, for the regions we are unsure about in the first model, we can just look at their color in the second model as a decider!.If that model found the same region, then we can be pretty sure it wasn’t by chance, so it should be “green.

” Conversely, if the second model did not find the region, then it was probably noise, and it should be colored “yellow.

”There’s just one more thing.

How do we find these regions?.We use clustering!.What is clustering, you ask?Source: wikiepdia.

orgClustering is finding clusters in data (I’m not being sarcastic intentionally).

But look; clustering is just finding the three clusters in the data, as shown.

Points in the same cluster look similar to the model.

If there are multiple ways the points are transformed and then plotted in the model (say, in each of the layers of a neural network), then points which are in the same clusters in all the plots are indistinguishable to the model, by definition.

This is clear — if the model could differentiate between them, then they’d be in different clusters at some point.

Anyways, if we define the places where the points will always be in the same cluster, well, now we have our regions!Machine learning is just finding algorithms which can separate points (oh, and regression).

Algorithms tend to find the green line, since that minimizes classification error.

The challenge of machine learning is finding an algorithm which can find the black line, since this would probably work better on new points.

Regularization, aka fancy math, helps us get something closer to the black line.

And that’s all there is to it!Do you want a formal version of this, applied to deep learning that achieves state-of-the-art results?.Are you interested in how this justifies Occam’s Razor and ties into traditional regularization technique?.Do you want to be a cool kid?.-> https://arxiv.

org/abs/1904.

05488 <-Thanks for reading!.. More details

Leave a Reply