Then, Bayes’ theorem states:Bayes’ theorem for classificationThe equation above can simply be abbreviated to:Abbreviated Bayes’ theorem for classificationHopefully, this makes some sense!The challenge here is to estimate the density function..Theoretically, Bayes’ classification has the lowest error rate..Therefore, our classifier needs to estimate the density function such as to approach the Bayes’ classifier.LDA for one predictorSuppose we only have one predictor and that the density function normal..Then, you can express the density function as:Normal distribution functionNow, we want to assign an observation X = x for which the P_k(X) is the largest..If you plug in the density function in P_k(X) and take the log, you find that you wish to maximize:Discriminant equationThe equation above is called the discriminant..As you can see, it is a linear equation..Hence the name: linear discriminant analysis!Now, assuming only two classes with equal distributions, you find:Boundary equationThis is the boundary equation..A graphical representation is shown hereunder.Boundary line to separate 2 classes using LDAOf course, this represents an ideal solution..In reality, we cannot exactly calculate the boundary line.Therefore, LDA makes use of the following approximation:For the average of all training observationsAverage of all training observationsFor the weighted average of sample variances for each classWeighted average of sample variances for each classWhere n is the number of observations.It is important to know that LDA assumes a normal distribution for each class, a class-specific mean, and a common variance.LDA for more than one predictorExtending now for multiple predictors, we must assume that X is drawn from a multivariate Gaussian distribution, with a class-specific mean vector, and a common covariance matrix.An example of a correlated and uncorrelated Gaussian distribution is shown below.Left: Uncorrelated normal distribution..Right: correlated normal distributionNow, expressing the discriminant equation using vector notation, we get:Discriminant equation with matrix notationAs you can see, the equation remains the same..Only this time, we are using vector notation to accommodate many predictors.How to assess the performance of the modelWith classification, it is sometimes irrelevant to use accuracy to assess the performance of a model.Consider analyzing a highly imbalanced data set..For example, you are trying to determine if a transaction is fraudulent or not, but only 0.5% of your data set contains a fraudulent transaction.. More details