Introduction to Discriminant Analysis (Part 1)

Introduction to Discriminant Analysis (Part 1)Pranov MishraBlockedUnblockFollowFollowingFeb 17It may look like the man is discriminating against the blue fish.

Could it be that the man is protecting the smaller fishes from being eaten up by the bigger fish?The clearer we become in our thinking, and the more discriminant in our focus, the more EMPOWERED we become!Take a moment to analyze these two sentences:Sentence 1: I think one performs better when he/she is fit, overall.

It is important to do whatever is required to stay fit from all angles.

Sentence 2: One performs better when he/she is fit, both physically and mentally.

It is important to work on both aspects by channeling different/distinct forms of energy to each of the aspects.

Can you find any difference between the two?In the second sentence, there was a clear differentiation between the two aspects of fitness and the focus required.

Discrimination is bad when the differentiation achieved is used in a negative way.

Otherwise, amazing things can be done with the help of ability to discriminate, differentiate and distribute appropriate focus to achieve divergent goals.

Introduction to Discriminant AnalysisDiscriminant analysis, a loose derivation from the word discrimination, is a concept widely used to classify levels of an outcome.

In other words, it is useful in determining whether a set of variables are effective in predicting category membershipFor example, I may want to predict whether a student will “Pass” or “Fail” in an exam based on the marks he has been scoring in the various class tests in the run up to the final exam.

Similarly, I may want to predict whether a customer will make his monthly mortgage payment or not based on the salary he has been drawing, his monthly expenditure and other bank liabilities etc.

In both the above cases my efforts are directed towards predicting a response which is categorical in nature.

The factors that influence the response or have a substantial role in deciding what the response will be, are called independent variables.

As I was reading through various books on a multitude of classification techniques, I came across Discriminant analysis as a very powerful classification tool.

Another such technique is Logistic Regression which found to be used much more widely.

I wanted to bring out the subtleties of Discriminant analysis, which sometimes outperforms Logistic regression especially when the response variable has more than 2 levels.

The topic broadly covers the below areas:I.

What is Discriminant Analysis?II.

What is the Relationship of Discriminant Analysis with Manova?III.

Illustration with a simple exampleIV.

Case Study – HR AnalyticsI.

What is Discriminant Analysis?Source: https://www.

flickr.

com/photos/15609463@N03/14898932531Discriminant, as the name suggests, is a method of analyzing business problems, with the goal of differentiating or discriminating the response variable into its distinct classes.

Typically Discriminant analysis is put to use when we already have predefined classes/categories of response and we want to build a model that helps in distinctly predicting the class, if any new observation comes into equation.

However if we have a dataset for which the classes of the response are not defined yet, clustering precedes Discriminant to create the various categories of output that best defines the behavior of the population.

After the clusters are built, a lot of statisticians/analysts generally use either Discriminant or logistic model as the predictive technique to classify any new observation.

Some relevant real life examples of where a Discriminant model can be used areWhen we want to predict whether an applicant for a bank loan is likely to default or not.

Predict likelihood of a heart attack based on various health indicators.

Predict stability level — “Good”, “Requires Inspection” or “Requires Repair/Replacement”- of an engine/machine based on various performance indicators.

In terms of an equation the expected relationship between the response variable and the independent variables can be explained by the below equationd=v1*X1+v2*X2+…+vn*Xn+aWhere d is the discriminate function, v-discriminant coefficients, X-respondent’s score for that variable.

a-constant(error).

We always get n-1 discriminant equations where n is the number of groups/memberships, the dependent variable has.

For Iris data set we get two equations as we have three classes of the dependent variable i.

e.

the species.

LDA(Linear Discriminant analysis) determines group means and computes, for each individual, the probability of belonging to the different groups.

The individual is then assigned to the group with the highest probability score.

See example on the left.

Compared to logistic regression, LDA is more suitable for predicting the category of an observation in the situation where the outcome variable contains more than two classes.

Additionally, it’s more stable than the logistic regression for multi-class classification problems.

LDA assumes that predictors are normally distributed (Gaussian distribution) and that the different classes have class-specific means and equal variance/covariance.

If these assumptions are violated, logistic regression will outperform LDA.

Quadratic Discriminant Analysis(QDA), an extension of LDA is little bit more flexible than the former, in the sense that it does not assumes the equality of variance/covariance.

In other words, for QDA the covariance matrix can be different for each class.

LDA tends to be a better than QDA when you have a small training set.

In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major issue, or if the assumption of a common covariance matrix for the K classes is clearly untenable.

II.

Relationship between Discriminant and MANOVADiscriminant is typically used when we have a categorical response variable and a set of independent variables which are continuous in nature.

The test before using a Discriminant analysis is to employ Manova on the same set of variables, but after reversing the equation i.

e.

response (=dependent)and independent variables for Discriminant becomes independent variables and response variables, respectively for Manova.

If the Manova output shows that the means of the categorical variable are significantly different, thereby rejecting the null hypothesis that there is no difference (in means) between the factors presumed to be impacting the response, only then Discriminant analysis will do a good job of differentiating and classifying the response variable (in the Discriminant Model).

If Manova does not reject the null hypothesis, Discriminant analysis would be a futile exercise.

So in a lot of ways, Discriminant is dependent on Manova and sometimes referred to as reverse of Manova.

We will see this in more detail in the following sections where we will go through a few examples.

III.

Illustration using an exampleA few correlated variables are there as can be seen belowFlavnoids and NonFlavnoids are correlated to OD280.

OD315.

Proline and Alcohol are also having a decent degree of correlationSome of the uni variate plots, for the codes above are shown below#Bivariate Analyis with facetsa=ggplot(wine, aes(x=Alcohol, y=MalicAcid,col="red", alpha=0.

5))a+geom_point()+facet_grid(Class~.

)+guides(colour=FALSE, alpha=FALSE)No real correlation as can be seen below with cor coeff of 0.

09cor(wine$Alcohol,wine$MalicAcid)No real correlation as seen in the scatter plot on left.

Clear separation seen with Mean values of Alcohol as can be seen in the bar graph on the rightManova Test: Use of Manova to test the hypothesis that the means of levels of the outcome are different which essentially means there is a differentiation possible and the independent variables contribute to the differentiation.

summary(M)Df Pillai approx F num Df den Df Pr(>F) Y 1 0.

8651 64.

661 12 121 < 2.

2e-16 ***Residuals 132 —Signif.

codes: 0 ‘***’ 0.

001 ‘**’ 0.

01 ‘*’ 0.

05 ‘.

’ 0.

1 ‘ ’ 1pValue and Fstatistic suggest that the null hypothesis is rejected i.

e.

the combination of all independent variables (i.

e.

all variables except Class) have different means for the 3 wine types.

The variables that are significant individually can be found bysummary.

aov(M).

The significant variables can also be found using the function discpower found in library, Discriminer.

100% Accuracy is too good to believe.

There could be over fitting.

However this is a small data-set for warm up.

The real deal is coming next!ConclusionIn the 1930’s, three different people- Fisher in UK, Hoteling in US and Mahalanobis in India were trying to solve the same problem through three different approaches.

Later their methods were combined together to form what we call today Discriminant Analysis.

In this blog we learnt what Linear Discriminant Analysis is and it’s application to a simple data-set.

In the next blog, we will take a more complicated, real-life problem and see the application of LDA.

Additionally, the follow up blog will cover applications of variants of linear discriminant analysis, comparison of all versions of discriminant models and choosing the best, and applications of complex non-linear models including ensemble modelling.

.

. More details

Leave a Reply