We know by looking at the dataset plot that this is exactly the region where most of our Orange class sits.

Observations in the Orange class are almost always misclassified into Blue or Green classes.

Oops!In turns out that for more that 2 classes (K>2) linear regression struggles to see all classes.

This is known as masking and results in severe misclassification.

We clearly need a better model.

Attempt #2 — Linear Discriminant Analysis (LDA)Figure 4 — Real dataset (left), LDA fitted dataset (right)Linear Discriminant Analysis (LDA) is an immediate improvement from our first attempt.

Figure 4 shows the output from the LDA model on our training set.

We no longer exhibit masking and the number of misclassifications have greatly reduced.

Great…but what actually is LDA?Discriminant n.

A distinguishing feature or characteristic.

Discriminant analysis in general is a multi-class classification technique that uses an assumption that data from each class comes from a family that exhibit very specific spacial behaviour (called a distribution).

The statistical properties (such as mean and variance) of which are the distinguishing characteristic of the distribution and are then used to evaluate which class is conditionally most probable for any observation seen.

LDA is a special case of this technique that assumes that the observations from each class come from an individual Gaussian distribution with common covariance matrix across classes.

One way to interpret this problem is to consider 3 objects of different colours within a closed box.

While you cannot peer into the box, there are a number of tiny holes in the box that reveal one of 3 colours.

Your job is to identify the boundaries of each colour within the box and therefore to understand the size and shape of each object.

Translating the preceding paragraph, LDA assumes that each object in the box is a sphere (or ellipsoid — Gaussian distribution) and of the same size (common covariance matrix).

These statistical properties are typically the parameters of the assumed Gaussian distribution which are then plugged into the below linear discriminant functions.

Equation 1 — Linear Discriminant FunctionsThis is the only mathematical function required for this technique.

It’s derivation comes from comparison of posterior probabilities from two classes and subsequent classification based on the highest evaluation (also seen in Naive Bayes classifiers).

The procedure for fitting the model involves estimating the parameters using a given dataset.

The following estimates are required to calculate in order to evaluate the linear discriminant functions for each class,Class sample mean — an average (X1,X2) for each class, intuitively this gives an indication of the centre position of each class (called a centroid).

Class prior probability — the number of observations in a given class k divided by the total number of observations i.

e.

a simple proportion of each class in the dataset.

This is a naive guess on how likely we are to get an observation a class with no knowledge of the data.

Sample covariance matrix — an estimate measure of how spread out the full sample is.

This is effectively an average of the spread of each class.

All that remains to produce LDA predictions is to plug the estimates above into the linear discriminant functions and select the class that maximises the functions for a given set of inputs.

Extension — Quadratic Discriminant Analysis (QDA)In our object metaphor, what if we drop our assumption that each object is an equal sized ellipsoid?.That is to ask, what is we relax our assumptions of a common covariance and of Gaussian data.

It is clear that our discriminant functions would be different, but how?Retaining the ellipsoid object assumption but allowing a difference in size we are led to an extension of LDA called Quadratic Discriminant Analysis (QDA).

The resulting discriminant functions for QDA are quadratic in X,Equation 2 — Quadratic Discriminant FunctionsThe procedure remains just as LDA, we need the following estimates,Class sample mean — an average (X1,X2) for each class, intuitively this gives an indication of the centre position of each class (called a centroid).

Class prior probability — the number of observations in a given class k divided by the total number of observations i.

e.

a simple proportion of each class in the dataset.

This is a naive guess on how likely we are to get an observation a class with no knowledge of the data.

Class sample covariance matrix — an estimate measure of how spread out the each class is.

Notice that only 3 has changed from the LDA scenario.

In fact, LDA can be thought of as a special case of QDA where the covariance matrix is the same for each class.

Figure 5 shows the performance of LDA and QDA on a more complex dataset.

The difference in boundary between the two models is akin to the difference in drawing a straight boundary line (LDA) vs a curved boundary line (QDA).

Clearly the spread of points in training set 2 varies more than we have previously seen.

QDA is flexible enough to capture this spread much more effectively than LDA and can be observed by performance on the Orange class.

So we should always use QDA, right?.Well, maybe.

While QDA is preferred, when the number of classes increases so does the number of parameters we need to estimate.

Is the marginal model improvement really worth the increase in computational complexity?.For a ‘small data’ problem like the one we have seen here yes, but for a much larger set of classes it is not clear.

To summarise, in part 1 we have:Identified a problem of being able to group/slice datasets.

Developed a linear regression classifier for a 3-class example, which was subject to masking.

Found that LDA is a very powerful tool for well behaved Gaussian datasets.

Extended into QDA for a slightly more flexible but more expensive method for less well-behaved datasets.

Comments & feedback appreciated!.. More details