# Building Blocks of Supervised Learning

In that scenario, there is no reason to suggest that the optimal boundary should be linear.

While our visual inspection shows us obvious misclassifications either side of the boundary it gives us no indication of how badly we have misunderstood the underlying relationship.

This exemplifies a key pitfall in linear regression — heavy reliance on a linear structural assumption or in statistical terminology, high bias.

This leads us to another classification technique which is much more suited to the scenario we have just introduced.

k-Nearest-Neighbours (kNN)The base principle of k-Nearest-Neighbours (or kNN) is to assign a class based on the most common output from a collections of observations in the training set that are near in input space, X.

How do we fit the model?The kNN output is defined as the arithmetic mean of the response variables from k nearest inputs.

Again, nearest implies a metric so we shall use Euclidean distance (https://en.

wikipedia.

org/wiki/Euclidean_distance).

Rather than pool over the entire training set as in linear regression only a k-sized subset of the dataset is required for a given point X=x.

In essence, this method treats classification as a majority poll and assumes behaviour is locally constant i.

e.

that behaviour does not change within each neighbourhood.

Figure 3–12 Nearest Neighbour Classification BoundaryAs in the linear model case, we assign the integer value 1 to the Salmon class and 0 to the Cornflower blue class.

Then for any given (X1,X2), the kNN output f is defined to be the arithmetic mean of a set of 12 binary numbers.

It is quickly noticed that the output may not be an integer and therefore a boundary condition is again required in order to deliver us our desired output {Salmon, Cornflower blue}.

We choose to classify into the Salmon class if f is greater than 0.

5 else to the Cornflower blue class.

Figure 3 shows the k=12 implementation.

It is immediately obvious that freedom from the linear structural assumption enables a more flexible boundary that even accommodates areas of Salmon classification amongst largely Cornflower blue areas.

This flexibility is a major benefit for the kNN method but what happens if we change k?Figure 4–1 Nearest Neighbour Classification BoundaryFigure 4 shows a special case of the algorithm where the model assigns an output based solely on the nearest observation in the dataset i.

e.

the k=1 case.

Notice, that there are no misclassifications in the training set which may look great on first glance however leads to a high error rate for classification/prediction outside the training set, especially if the dataset on which the model was trained was small.

Note that the boundary differs for different values of k which introduces noise and the additional task of choosing k (which we won’t discuss here).

While the model bias is low, kNN suffers from high variance.

Alternatively put, the model can be very sensitive to the training data and may over fit the model as opposed to describing the true input/output relationship.

Since this technique puts fewer assumptions on the data, there are many more variations of the algorithm.

A deeper dive would explore the effect of different distance metrics, weighted averages and parameter selection techniques for k on the decision boundary.

Naive Bayes ClassifierThe base principle of the Naive Bayes Classifier is to classify to the most probable class using our prior knowledge and is based on a mathematical theorem for how much we should trust the evidence observed (Bayes’ Theorem — https://en.

wikipedia.

org/wiki/Bayes'_theorem).

How do we fit the model?The procedure for this model requires calculation and comparison of the posterior probability which represent the extent to which we trust the model accurately describes Y given all inputs X.

This is a closed form calculation that requires knowledge of the distribution of Y (both the type of distribution and it’s parameters).

In reality, this limits the use of this method — it is rare that the generating distribution is known but we also require it to be somewhat tractable (easy to evaluate).

So in order to fit a Bayes model for our example, we need to know where the data came from.

Since I simulated the data myself I can tell you that the data were generated from a mixture of 10 low-variance Gaussian distributions each with individual, independently distributed Gaussian means (as in the scenario discussed earlier!).

Figure 5 shows the decision boundary for this method.

Figure 5 — Naive Bayes Classification BoundaryWe observe that the boundary appears similar to k=12 kNN method despite what appears to be a very different technique.

Probability theory states that if several variables do not depend on each other in any way then the probability of observing them together is simply the product of seeing each variable independently.

In reality, it is rare that this assumption holds true as variables in the systems we are interested in modelling often hold some relation to each other.

For this reason, this method is naive.

A Final NoteThe best prediction of Y at any point X=x is the expectation given all information available.

We have seen that each of the three techniques produce a solution to our initial classification problem yet they differ greatly in both procedure and decision boundary.

In fact, all three are underpinned by the above statement where best is measured by the squared error loss function described earlier.

The statement has been proven in statistical decision theory and is known as the regression function.

The differences of each technique can be summarised quite neatly by their interpretation of the regression function.

Linear models: The best prediction of Y at any point X=x is the expectation given a linear combination of all information available.

kNN: The best prediction of Y at any point X=x is the expectation given a k-sized neighbourhood around X=x.

Naive Bayes: The best prediction of Y at any point X=x is the expectation given all information available (assuming the generating density is known).

These building blocks are a great base for statistical inference.

Not only are they easy to implement but they highlight the intricacies of statistical modelling and issues that arise in modelling real systems.

Natural extensions arise along the following line of questioning:What happens to the performance of these models as the number of input variables increases (curse of dimensionality)?How can we compare relative performance of models?How can we test if the model assumptions hold?.Why do these assumptions matter?What if we repeat our analysis with a different loss function?Feedback and comments most welcome!.