Notice how fast it converges to -1 or +1.
How can we find the parameters for our model?Let’s examine some approaches to find good parameters for our model.
But what does good mean in this context?Loss functionWe have a model that we can use to make decisions, but we still have to find the parameters W.
To do that, we need an objective measurement of how good a given set of parameters are.
For that purpose, we will use a loss (cost) function:Which is also known as the Log loss or Cross-entropy loss functionSource: https://ml-cheatsheet.
readthedocs.
ioWe can compress the above function into one:whereLet’s implement it in Python:Approach #1 — tryout a numberLet’s think of 3 numbers that represent the coefficients w0, w1, w2.
loss: 25.
0 predicted: 0.
999999999986112 actual: 0.
0Unfortunately, I am pretty lazy, and this approach seems like a bit too much work for me.
Let’s go to the next one:Approach #2 — tryout a lot of numbersAlright, these days computers are pretty fast, 6+ core laptops are everywhere.
Smartphones can be pretty performant, too! Let’s use that power for good™ and try to find those pesky parameters by just trying out a lot of numbers:0.
0 0.
0 0.
0 6.
661338147750941e-16 9.
359180097590508e-141.
3887890837434982e-11 2.
0611535832696244e-09 3.
059022736706331e-07 4.
539889921682063e-05 0.
006715348489118056 0.
6931471805599397 5.
006715348489103 10.
000045398900186 15.
000000305680194 19.
999999966169824 24.
99999582410784 30.
001020555434774 34.
945041100449046 inf infAmazing, the first parameter value we tried got us a loss of 0.
Is it your lucky day or this will always be the case, though? The answer is left as an exercise for the reader :)Approach #3 — Gradient descentGradient descent algorithms (yes, there are a lot of them) provide us with a way to find a minimum of some function f.
They work by iteratively going in the direction of the descent as defined by the gradient.
In Machine Learning, we use gradient descent algorithms to find “good” parameters for our models (Logistic Regression, Linear Regression, Neural Networks, etc…).
source: PyTorchZeroToAllHow does it work?.Starting somewhere, we take our first step downhill in the direction specified by the negative gradient.
Next, we recalculate the negative gradient and take another step in the direction it specifies.
This process continues until we get to a point where we can no longer move downhill — a local minimum.
Ok, but how can we find that gradient thing?.We have to find the derivate of our cost function since our example is rather simple.
The first derivative of the sigmoid functionThe first derivative of the sigmoid function is given by the following equation:Complete derivation can be found here.
The first derivative of the cost functionRecall that the cost function was given by the following equation:GivenWe obtain the first derivative of the cost function:Updating our parameters WNow that we have the derivate, we can go back to our updating rule and use it there:The parameter a is known as learning rate.
High learning rate can converge quickly, but risks overshooting the lowest point.
Low learning rate allows for confident moves in the direction of the negative gradient.
However, it time-consuming so it will take us a lot of time to get to converge.
Too big vs too small learning rate (source: https://towardsdatascience.
com/)The Gradient descent algorithmThe algorithm we’re going to use works as follows:Repeat until convergence { 1.
Calculate gradient average 2.
Multiply by learning rate 3.
Subtract from weights}Let’s do this in Python:About that until convergence part.
You might notice that we kinda brute-force our way around it.
That is, we will run the algorithm for a preset amount of iterations.
Another interesting point is the initialization of our weights W — initially set at zero.
Let’s put our implementation to the test, literally.
But first, we need a function that helps us predict y given some data X (predict whether or not we should send a discount to a customer based on its spending):Now for our simple test:Note that we use reshape to add a dummy dimension to X.
Further, after our call to predict, we round the results.
Recall that the sigmoid function spits out (kinda like a dragon with an upset stomach) numbers in the [0; 1] range.
We’re just going to round the result in order to obtain our 0 or 1 (yes or no) answers.
run_tests()Here is the result of running our test case:FWell, that’s not good, after all that hustling we’re nowhere near achieving our goal of finding good parameters for our model.
But, what went wrong?Welcome to your first model debugging session!.Let’s start by finding whether our algorithm improves over time.
We can use our loss metric for that:run_tests()We pretty much copy & pasted our training code except that we’re printing the training loss every 10,000 iterations.
Let’s have a look:loss: 0.
6931471805599453 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056loss: 0.
41899283818630056F.
.
Suspiciously enough, we found a possible cause for our problem on the first try!.Our loss doesn’t get low enough, in other words, our algorithm gets stuck at some point that is not a good enough minimum for us.
How can we fix this?.Perhaps, try out different learning rate or initializing our parameter with a different value?First, a smaller learning rate a :run_tests()With a=0.
001 we obtain this:loss: 0.
42351356323845546 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056 loss: 0.
41899283818630056F.
.
Not so successful, are we?.How about adding one more parameter for our model to find/learn?run_tests()And for the results:.
.
———————————————————Ran 8 tests in 0.
686s OKWhat we did here?.We added a new element to our parameter vector W and set it’s initial value to 1.
Seems like this turn things into our favor!Bonus — building your own LogisticRegressorKnowing all of the details of the inner workings of the Gradient descent is good, but when solving problems in the wild, we might be hard pressed for time.
In those situations, a simple & easy to use interface for fitting a Logistic Regression model might save us a lot of time.
So, let’s build one!But first, let’s write some tests:run_tests()We just packed all previously written functions into a tiny class.
One huge advantage of this approach is the fact that we hide the complexity of the Gradient descent algorithm and the use of the parameters W.
Using our Regressor to decide who should receive discount codesNow that you’re done with the “hard” part let’s use the model to predict whether or not we should send discount codes.
Let’s recall our initial data:Now let’s try our model on data obtained from 2 new customers:Customer 1 – $10Customer 2 – $250y_testRecall that 1 means send code and 0 means do not send:array([1.
, 0.
])Looks reasonable enough.
Care to try out more cases?You can find complete source code and run the code in your browser here:LogisticRegressioncolab.
research.
google.
comConclusionWell done!.You have a complete (albeit simple) LogisticRegressor implementation that you can play with.
Go on, have some fun with it!Coming up next, you will implement a Linear regression model from scratch :)Like what you read?.Do you want to learn even more about Machine Learning?.Come on Patreon:Venelin Valkov is creating Machine Learning blog posts, notebook, videos & books | PatreonHello everybody, My name is Venelin and I am thrilled to invite you on a journey through the amazing world of Machine…www.
patreon.
com.