# A Detailed Guide to 7 Loss Functions for Machine Learning Algorithms with Python Code

We can consider this as a disadvantage of MAE.

Here is the code for the update_weight function with MAE cost: View the code on Gist.

We get the below plot after running the code for 500 iterations with different learning rates:   3.

Huber Loss The Huber loss combines the best properties of MSE and MAE.

It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient).

It is identified by its delta parameter: View the code on Gist.

We obtain the below plot for 500 iterations of weight update at a learning rate of 0.

0001 for different values of the delta parameter: Huber loss is more robust to outliers than MSE.

It is used in Robust Regression, M-estimation and Additive Modelling.

A variant of Huber Loss is also used in classification.

Binary Classification Loss Functions The name is pretty self-explanatory.

Binary Classification refers to assigning an object into one of two classes.

This classification is based on a rule applied to the input feature vector.

For example, classifying an email as spam or not spam based on, say its subject line, is binary classification.

I will illustrate these binary classification loss functions on the Breast Cancer dataset.

We want to classify a tumor as ‘Malignant’ or ‘Benign’ based on features like average radius, area, perimeter, etc.

For simplification, we will use only two input features (X_1 and X_2) namely ‘worst area’ and ‘mean symmetry’ for classification.

The target value Y can be 0 (Malignant) or 1 (Benign).

Here is a scatter plot for our data:   1.

Binary Cross Entropy Loss Let us start by understanding the term ‘entropy’.

Generally, we use entropy to indicate disorder or uncertainty.

It is measured for a random variable X with probability distribution p(X): The negative sign is used to make the overall quantity positive.

A greater value of entropy for a probability distribution indicates a greater uncertainty in the distribution.

Likewise, a smaller value indicates a more certain distribution.

This makes binary cross-entropy suitable as a loss function – you want to minimize its value.

We use binary cross-entropy loss for classification models which output a probability p.

Probability that the element belongs to class 1 (or positive class) = p Then, the probability that the element belongs to class 0 (or negative class) = 1 – p Then, the cross-entropy loss for output label y (can take values 0 and 1) and predicted probability p is defined as: This is also called Log-Loss.

To calculate the probability p, we can use the sigmoid function.

Here, z is a function of our input features: The range of the sigmoid function is [0, 1] which makes it suitable for calculating probability.

Try to find the gradient yourself and then look at the code for the update_weight function below.

View the code on Gist.

I got the below plot on using the weight update rule for 1000 iterations with different values of alpha:   2.

Hinge Loss Hinge loss is primarily used with Support Vector Machine (SVM) Classifiers with class labels -1 and 1.

So make sure you change the label of the ‘Malignant’ class in the dataset from 0 to -1.

Hinge Loss not only penalizes the wrong predictions but also the right predictions that are not confident.

Hinge loss for an input-output pair (x, y) is given as: View the code on Gist.

After running the update function for 2000 iterations with three different values of alpha, we obtain this plot: Hinge Loss simplifies the mathematics for SVM while maximizing the loss (as compared to Log-Loss).

It is used when we want to make real-time decisions with not a laser-sharp focus on accuracy.

Multi-Class Classification Loss Functions Emails are not just classified as spam or not spam (this isn’t the 90s anymore!).

They are classified into various other categories – Work, Home, Social, Promotions, etc.

This is a Multi-Class Classification use case.

We’ll use the Iris Dataset for understanding the remaining two loss functions.

We will use 2 features X_1, Sepal length and feature X_2, Petal width, to predict the class (Y) of the Iris flower – Setosa, Versicolor or Virginica Our task is to implement the classifier using a neural network model and the in-built Adam optimizer in Keras.

This is because as the number of parameters increases, the math, as well as the code, will become difficult to comprehend.

Here is the scatter plot for our data:   1.

Multi-Class Cross Entropy Loss The multi-class cross-entropy loss is a generalization of the Binary Cross Entropy loss.

The loss for input vector X_i and the corresponding one-hot encoded target vector Y_i is: We use the softmax function to find the probabilities p_ij: Source: Wikipedia “Softmax is implemented through a neural network layer just before the output layer.

The Softmax layer must have the same number of nodes as the output layer.

” Google Developer’s Blog Finally, our output is the class with the maximum probability for the given input.

We build a model using an input layer and an output layer and compile it with different learning rates.

Specify the loss parameter as ‘categorical_crossentropy’ in the model.

compile() statement: View the code on Gist.

Here are the plots for cost and accuracy respectively after training for 200 epochs:   2.

KL-Divergence The Kullback-Liebler Divergence is a measure of how a probability distribution differs from another distribution.

A KL-divergence of zero indicates that the distributions are identical.

Notice that the divergence function is not symmetric.

This is why KL-Divergence cannot be used as a distance metric.

I will describe the basic approach of using KL-Divergence as a loss function without getting into its math.

We want to approximate the true probability distribution P of our target variables with respect to the input features, given some approximate distribution Q.

Since KL-Divergence is not symmetric, we can do this in two ways:     The first approach is used in Supervised learning, the second in Reinforcement Learning.

KL-Divergence is functionally similar to multi-class cross-entropy and is also called relative entropy of P with respect to Q: We specify the ‘kullback_leibler_divergence’ as the value of the loss parameter in the compile() function as we did before with the multi-class cross-entropy loss.

View the code on Gist.

KL-Divergence is used more commonly to approximate complex functions than in multi-class classification.

We come across KL-Divergence frequently while playing with deep-generative models like Variational Autoencoders (VAEs).

End Notes Woah!.We have covered a lot of ground here.

Give yourself a pat on your back for making it all the way to the end.

This was quite a comprehensive list of loss functions we typically use in machine learning.

I would suggest going through this article a couple of times more as you proceed with your machine learning journey.

This isn’t a one-time effort.

It will take a few readings and experience to understand how and where these loss functions work.

Make sure to experiment with these loss functions and let me know your observations down in the comments.

Also, let me know other topics that you would like to read about.

I will do my best to cover them in future articles.

Meanwhile, make sure you check out our comprehensive beginner-level machine learning course: Applied Machine Learning – Beginner to Professional You can also read this article on Analytics Vidhyas Android APP Share this:Click to share on LinkedIn (Opens in new window)Click to share on Facebook (Opens in new window)Click to share on Twitter (Opens in new window)Click to share on Pocket (Opens in new window)Click to share on Reddit (Opens in new window) Related Articles (adsbygoogle = window.