A Gentle Introduction to Generative Adversarial Network Loss Functions

A Large-Scale Study, 2018.

The result is better gradient information when updating the weights of the generator and a more stable training process.

This objective function results in the same fixed point of the dynamics of G and D but provides much stronger gradients early in learning.

— Generative Adversarial Networks, 2014.

In practice, this is also implemented as a binary classification problem, like the discriminator.

Instead of maximizing the loss, we can flip the labels for real and fake images and minimize the cross-entropy.

… one approach is to continue to use cross-entropy minimization for the generator.

Instead of flipping the sign on the discriminator’s cost to obtain a cost for the generator, we flip the target used to construct the cross-entropy cost.

— NIPS 2016 Tutorial: Generative Adversarial Networks, 2016.

The choice of loss function is a hot research topic and many alternate loss functions have been proposed and evaluated.

Two popular alternate loss functions used in many GAN implementations are the least squares loss and the Wasserstein loss.

The least squares loss was proposed by Xudong Mao, et al.

in their 2016 paper titled “Least Squares Generative Adversarial Networks.

”Their approach was based on the observation of the limitations for using binary cross entropy loss when generated images are very different from real images, which can lead to very small or vanishing gradients, and in turn, little or no update to the model.

… this loss function, however, will lead to the problem of vanishing gradients when updating the generator using the fake samples that are on the correct side of the decision boundary, but are still far from the real data.

— Least Squares Generative Adversarial Networks, 2016.

The discriminator seeks to minimize the sum squared difference between predicted and expected values for real and fake images.

The generator seeks to minimize the sum squared difference between predicted and expected values as though the generated images were real.

In practice, this involves maintaining the class labels of 0 and 1 for fake and real images respectively, minimizing the least squares, also called mean squared error or L2 loss.

The benefit of the least squares loss is that it gives more penalty to larger errors, in turn resulting in a large correction rather than a vanishing gradient and no model update.

… the least squares loss function is able to move the fake samples toward the decision boundary, because the least squares loss function penalizes samples that lie in a long way on the correct side of the decision boundary.

— Least Squares Generative Adversarial Networks, 2016.

The Wasserstein loss was proposed by Martin Arjovsky, et al.

in their 2017 paper titled “Wasserstein GAN.

”The Wasserstein loss is informed by the observation that the traditional GAN is motivated to minimize the distance between the actual and predicted probability distributions for real and generated images, the so-called Kullback-Leibler divergence, or the Jensen-Shannon divergence.

Instead, they propose modeling the problem on the Earth-Mover’s distance, also referred to as the Wasserstein-1 distance.

The Earth-Mover’s distance calculates the distance between two probability distributions in terms of the cost of turning one distribution (pile of earth) into another.

The GAN using Wasserstein loss involves changing the notion of the discriminator into a critic that is updated more often (e.

g.

five times more often) than the generator model.

The critic scores images with a real value instead of predicting a probability.

It also requires that model weights be kept small, e.

g.

clipped to a hypercube of [-0.

01, 0.

01].

The score is calculated such that the distance between scores for real and fake images are maximally separate.

The loss function can be implemented by calculating the average predicted score across real and fake images and multiplying the average score by 1 and -1 respectively.

This has the desired effect of driving the scores for real and fake images apart.

The benefit of Wasserstein loss is that it provides a useful gradient almost everywhere, allowing for the continued training of the models.

It also means that a lower Wasserstein loss correlates with better generator image quality, meaning that we are explicitly seeking a minimization of generator loss.

To our knowledge, this is the first time in GAN literature that such a property is shown, where the loss of the GAN shows properties of convergence.

— Wasserstein GAN, 2017.

Many loss functions have been developed and evaluated in an effort to improve the stability of training GAN models.

The most common is the non-saturating loss, generally, and the Least Squares and Wasserstein loss in larger and more recent GAN models.

As such, there is much interest in whether one loss function is truly better than another for a given model implementation.

This question motivated a large study of GAN loss functions by Mario Lucic, et al.

in their 2018 paper titled “Are GANs Created Equal?.A Large-Scale Study.

”Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others.

We conduct a neutral, multi-faceted large-scale empirical study on state-of-the-art models and evaluation measures.

— Are GANs Created Equal?.A Large-Scale Study, 2018.

They fix the computational budget and hyperparameter configuration for models and look at a suite of seven loss functions.

This includes the Minimax loss (MM GAN), Non-Saturating loss (NS GAN), Wasserstein loss (WGAN), and Least-Squares loss (LS GAN) described above.

The study also includes an extension of Wasserstein loss to remove the weight clipping called Wasserstein Gradient Penalty loss (WGAN GP) and two others, DRAGAN and BEGAN.

The table below, taken from the paper, provides a useful summary of the different loss functions for both the discriminator and generator.

Summary of Different GAN Loss Functions.

Taken from: Are GANs Created Equal?.A Large-Scale Study.

The models were evaluated systematically using a range of GAN evaluation metrics, including the popular Frechet Inception Distance, or FID.

Surprisingly, they discover that all evaluated loss functions performed approximately the same when all other elements were held constant.

We provide a fair and comprehensive comparison of the state-of-the-art GANs, and empirically demonstrate that nearly all of them can reach similar values of FID, given a high enough computational budget.

— Are GANs Created Equal?.A Large-Scale Study, 2018.

This does not mean that the choice of loss does not matter for specific problems and model configurations.

Instead, the result suggests that the difference in the choice of loss function disappears when the other concerns of the model are held constant, such as computational budget and model configuration.

This section provides more resources on the topic if you are looking to go deeper.

In this post, you discovered an introduction to loss functions for generative adversarial networks.

Specifically, you learned:Do you have any questions?.Ask your questions in the comments below and I will do my best to answer.

Develop Your GAN Models in Minutes .

with just a few lines of python codeDiscover how in my new Ebook: Generative Adversarial Networks with PythonIt provides self-study tutorials and end-to-end projects on: DCGAN, conditional GANs, image translation, Pix2Pix, CycleGAN and much more.

Finally Bring GAN Models to your Vision Projects Skip the Academics.

Just Results.

Click to learn more.. More details

Leave a Reply