Introduction to Classification for Beginners

Introduction to Classification for BeginnersAditya ChandupatlaBlockedUnblockFollowFollowingJan 20Supervised Machine Learning can be broadly classified into two categories:RegressionClassificationWhile regression allows you to predict a continuous variable, classification allows you to categorise data into different classes.

For example, you can use a regression based model to predict how the sales of a company will be in the next year.

However, if you want a self-driving car to determine whether an object on the road is a street sign, or a pedestrian, then you have to model your problem as a classification problem, rather than a regression problem.

This article is the second article in my series of articles titled “Machine Learning 2019.

” I will be drawing a lot of references from my first article where I have talked about Linear Regression, and hence if you are not familiar with it, I advice you to go through it once before proceeding any further.

So, let us understand how classification is done by solving the problem of “University admission prediction.

”The problem is simple.

Our hypothetical University requires a student to take two examinations before he is considered for admission.

You will be given scores of a student on the two examinations and your task is to determine whether the student gets admitted into the University or not.

Furthermore, to build the model, you will be given historical data of former students who have applied to the University along with their admission decisions as follows:This data is taken from the Machine Learning course on Coursera, taught by Andrew NgIt looks like a pretty dull image.

Let us visualise (using matplotlib) to understand it better:The data is normalised (to have zero mean and a unit variance) by replacing each feature with its z-score.

exam1_score is plotted on x-axis and exam2_score is plotted on y-axis.

Beautiful.

Now, let us understand the idea behind classification.

First and foremost, to solve any supervised learning model, we will be requiring the following components:A hypothesis to fit our dataA cost function which determines how our hypothesis is performingAn optimisation algorithm which adjusts the weights of the hypothesis to minimise the cost determined by the cost functionLet us discuss about each of the three components:1.

HypothesisIf you are familiar with Linear Regression, then you understand all the above three points very well.

In classification, however, we need to tweak the definition of our hypothesis a bit.

Consider the below equation:h(x) = w0 + w1x1 + w2x2For regression based problems it was excellent since we were predicting continuous valued output variable — h(x).

In classification, we need to categorise the data into buckets, i.

e.

a yes or a no.

More concretely, in our above problem, we need to output a “Yes”, signalling that the student will get an admit, or “No”, signalling that the student will not get an admit.

What I mean to say is, we cannot say 45% “Yes”!Mathematically, we can encode this information into our equation (of h(x)) by applying a sigmoid function.

y = 1/(1 + exp(-x))The above function is called logistic function — which is a function belonging to the family of functions called sigmoid functions.

The logistic function’s range, as we can see, is between 0 and 1.

Hence, if we are to pass our output of h(x) to logistic function, we will be limiting the range to [0, 1].

g(h(x)) = 1/(1 + exp(-h(x)))g(h(x)) = 1/(1 + exp(- w0 – w1x1 – w2x2))Half the problem is solved.

It is not sufficient to limit the range of h(x) to [0, 1] alone.

We need to output a yes or a no.

To accomplish this, we must first notice how the output of logistic function is.

From the diagram above, it is clear that the curve is symmetric around y = 0.

5.

Therefore, we can safely say that, if the output of logistic function is greater than 0.

5, we can treat it as a “Yes”, and if it is less than or equal to 0.

5, we can treat is as a “No.

” Formally speaking, output ‘yes’ if:g(h(x)) > 0.

5But g(z), for some ‘z’, is greater than 0.

5, only when z > 0 (see the above graph for reference.

) Therefore:g(h(x)) > 0.

5 only if h(x) > 0Which means,w0 + w1x1 + w2x2 > 0Observe carefully what the above equation is trying to tell.

It is an equation in terms of x1 and x2, which are our features (exam1_score and exam2_score.

) The parameters for this equation are w0, w1, and w2.

So, basically, it is a straight line:w0 + w1x1 + w2x2 > 0w0 + w1x1 > – w2x2(-w0/w2) + (-w1/w2) x1 > x2c + mx > y where c = (-w0/w2), m = (-w1/w2), x = x1 and y = x2Therefore, the equation represents the portion of the graph which is above the line y = mx + c.

Now, recall where this all started.

We were trying to encode the information of “yes” or “no” into g(h(x)) — which led us to a point where we can say that the region above the line w0 + w1x1 + w2x2 is where we output “yes”, and the region below is where we output “no.

”This line, is referred to as decision boundary, which separates the two classes — in our case the students who get an admit and the students who do not.

2.

Cost FunctionWe cannot use the cost function which we have used for Linear Regression:This is because, in the above equation, if h(x) is the logistic function, then J(θ) will be non-convex (due to the presence of exponential term in the logistic function):Therefore, we need to come up with a new cost function for classification based problems.

Moreover, we need to know a bit about probability if we want to derive the cost function which we will be using now, for classification.

We will use y = 1 as a representation of “yes” and y = 0 as a representation of “no.

”Recall what we have said about h(x).

Using conditional probability we can write:h(x) = P(y = 1 / x)Ok, well, what if y = 0,1 – h(x) = P(y = 0 / x) (By the complementary rule of probability)Combining the above two equations:P(y/x) = (h(x)^y) * ((1 – h(x))^(1 – y))You can verify whether the above equation is a correct combination by plugging in the values of y = 0 and y = 1.

The above equation gives us the probability of y, given input x (irrespective of the value of ‘y’).

Since our task, when we are performing classification, is to maximise the probability of outputting the correct value — irrespective of the value of ‘y’, we can safely say that the above equation is a good representation of how our model is performing on a given example.

To determine our model’s performance on all examples, we simply take the product of the above equation on all examples, as follows:P(y/x) = [(h(x)^y(1)) * ((1 — h(x))^(1 — y(1)))] * [(h(x)^y(2)) * ((1 — h(x))^(1 — y(2)))] * [(h(x)^y(2)) * ((1 — h(x))^(1 — y(2)))] … * [(h(x)^y(N)) * ((1 — h(x))^(1 — y(N)))]where, y(1), y(2), …, y(N) are the N examples in our input dataset.

One small thing leftover to do, is to get rid of the product and exponentials, by applying a log-transformation to simplify the equation:log(P(y/x)) = Σ( log((h(x)^y(i)) * ((1 — h(x))^(1 — y(i)))))log(P(y/x)) = Σ(log(h(x)^y(i)) + log((1 — h(x))^(1 — y(i))))log(P(y/x)) = Σ(y(i)log(h(x)) + (1 — y(i))log((1 — h(x))))This equation is our cost function.

It is also referred to as cross-entropy cost function.

One important thing to remember is that, this function is a convex function!.Hence there will be only a single minima:3.

Optimisation AlgorithmSimilar to Linear Regression, we can use Gradient Descent to optimise our cross entropy cost function, without worrying that our model will be stuck at a local optima.

Note: If you do not know what gradient descent is, or you need some intuition behind its internals, you can refer to my explanation on Linear Regression article.

Equipped with all the three components, the hypothesis, the cost function and the optimisation algorithm, we can now start building our model — a binary classifier.

Our binary classifier is also sometimes called as logistic regression model.

The reason this is named as regression and not classification, is because we are still predicting a continuous valued variable — which is the probability P(y/x).

If you observe the derivation of our hypothesis and the cross-entropy cost function, you will notice that we have built a probability predicting regression model and made it a classifier by treating the probability greater than 0.

5 as True, and less than 0.

5 as False.

Besides, the reason it is called “logistic” regression is because we have used the “logistic” function to transform our hypothesis h(x).

The entire code to solve the problem of “University admission prediction” is in an IPython notebook present in my Github repository.

Here is the crux of the code:Remember the decision boundary which we have talked about in the beginning?w0 + w1x1 + w2x2After learning the parameters w0, w1, and w2, the plot looks as follows:The accuracy of our model is:Train accuracy: 90.

0 %Test accuracy: 85.

0 %With this, we come to the end of the article.

I hope there was something to takeaway.

Next week, we will be dissecting Neural Networks and Dive Deep into deep learning!.. More details

Leave a Reply