How GANs really work

Imagine we are at equilibrium and the generator is not sampling on the underlying distribution of X (ie the distribution of A, or real distribution).

For instance for a certain point x, the real probability is 0.

3 and the generator samples it only at 0.

2.

Then the discriminator will be “fed” with more ones than zeros on this x, and therefore the score for x will go up, more than 0.

5.

But the generator wants to maximize its score, so as every other score is 0.

5, the only possibility is to increase the sampling rate of x in order to “grab” this extra score.

We have the guarantee that with a perfect discriminator, the generator will converge to the perfect solution.

Therefore having a good discriminator is very important.

However if the discriminator is perfect, he won’t give gradients to the generator since its samples would always have a score of 0.

A good mixture is necessaryGAN algorithmNetwork approximationAnother point is crucial in GANs : neural networks are approximators.

Actually the adversarial approach does not need approximation, we can do it with “perfect” models.

But it would be much less powerful, paradoxically.

With perfect models, the optimum is reached when the generator distribution is the distribution of X, our finite set.

That is to say the generator would only output samples from X, and the discriminator would put 0.

5 to every element of X, and random less than 0.

5 elsewhere.

But we don’t want that.

We want to sample from A, the “true” set, not our data.

We want to generate new samples, new situations which describe the same high-level features as X.

And that’s exactly what networks does : we use their weakness for our profit.

If well trained, networks are not overfitting (which is a good idea to minimize the loss, surprisingly) but are approximating to have a good validation test score.

Therefore the generator will sample from A and not X, because the network is not perfect and will try to generalize.

In practiceLet’s look at what it looks like in practice.

We’ll take E = R², and try to sample specific distributions.

Set upThe generator and discriminator are dense networks.

The first has 4 hidden layers of 50 neurons each, with a central gaussian noise as input of dimension 15, and output in R².

The discriminator has an input of dimension 2, 3 hidden layers of size [100,50,50] and a single sigmoidal output neuron.

All the activations are leaky relus except for the sigmoidal last one.

The learning rate is 1e-5.

The optimizers are Adam optimizers with beta1 0.

9 and beta2 0.

999.

I use 10 steps of learning per epoch for each networks.

The batch are of size 500.

The code is pretty inspired from this one.

Discrete distributionHere A is finite, and we take X = A.

We test the GAN model on A = {(-1,0), (1,0)}, and π(a) = 1/2 for all a in A.

Here are the results :As you can see, the points are quickly concentrated on the desired points (A), and the probabilities are 0.

5 each, as expected.

Circle distributionHere A is a circle centered on 0, with radius 1.

π is again the uniform distribution on A.

The points are quickly uniformly distributed on the circle.

And with VAE ?Let’s see what is the result with a VAE, for comparison (with same parameters, layer size etc).

Maybe you understand better what I was saying earlier.

VAEs are doing the job, but the result is blurry.

Non convergence and mode collapseI have a terrible announcement.

GANs are not guaranteed to converge.

As I mentioned before, networks are imperfect.

Therefore, they can “forget” old data points to remember new ones.

Imagine A is composed of 4 points, (-1,0), (1,0), (0,1) and (0,-1), and π is again the uniform distribution on A.

A first possibility is that the GAN converges to π, as I showed it with only two points before.

Another possibility is that the generator and discriminator will always chase each other and never converge.

The generator could generate only (-1,0), then the discriminator will lower and lower the score of this point, and the generator will move to the next point, say (1,0).

And so on, indefinitely.

It is a possible equilibrium state, however not a Nash equilibrium.

Let’s see this in practice :I changed a little bit the parameters of the network to force the non convergence.

The model is not converging anymore, moving around regularly between 3 different places.

Another issue, closely related, is mode collapse.

“Modes” are the main features of our distribution.

Here in R² it’s not really relevant, since our modes are our 4 points, but for instance MNIST images have 10 different modes, the 10 possible digits.

The generator sometimes produces only a restricted number of modes and forget the others.

If the variance is too small, it can’t get out of this trap because the gradients are too small.

Mode collapse is a real problem in GANs and occur very often.

The main solution is hyperparameter tuning, but there has been many interesting attempts for a few years now.

I recommend you to look at this article if you want to know more about the ways to improve GAN performance.

ConclusionGANs are a tremendous tool to recover an unknown probability distribution from data.

Many problems are linked to this “density estimation” problem, and it is therefore very powerful.

You can see fancy applications on this post, on different types of data.

[1] I.

Goodfellow, J.

Mirza, B.

Xu, D.

Warde-Farley, S.

Ozair, A.

Courville, and Y.

Bengio.

Generative adversarial nets (2014), NIPS 2014.

.