Checklist for debugging neural networks

Erik Rippel has a great, colorful post on ‘Visualizing parts of Convolutional Neural Networks using Keras and Cats’4.

Diagnose parametersNeural networks have large numbers of parameters that interact with each other, making optimization hard.

Please note, this is an area of active research so the suggestions below are simply starting points.

Batch size (technically called mini-batch) —You want the batch size to be large enough to have accurate estimates of the error gradient, but small enough that stochastic gradient descent (SGD) can regularize your network.

Small batch sizes will result in a learning process that converges quickly at the cost of noise in the training process and might lead to optimization difficulties.

The paper ‘On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima’ describes how:[It] has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize.

We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions — and as is well known, sharp minima lead to poorer generalization.

In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation.

Learning rate —A learning rate that is too low will lead to slow convergence or the risk of getting stuck in a local minima, while a learning rate that too large will cause the optimization to diverge, because you risk jumping across a deeper, but narrower part of the loss function.

Consider incorporating learning rate scheduling to decrease the learning rate as training progresses.

The CS231n course has a great section on different techniques to implement annealing learning rates.

Machine learning frameworks such as Keras, Tensorflow, PyTorch, MXNet now all have documentation or examples around using learning rate schedulers/decay:Keras — https://keras.

io/callbacks/#learningrateschedulerTensorflow — https://www.

tensorflow.

org/api_docs/python/tf/train/exponential_decayPyTorch — https://pytorch.

org/docs/stable/_modules/torch/optim/lr_scheduler.

htmlMXNet — https://mxnet.

incubator.

apache.

org/versions/master/tutorials/gluon/learning_rate_schedules.

htmlGradient clipping — This will clip parameters’ gradients during backpropagation by a maximum value or maximum norm.

Useful for addressing any exploding gradients that you might encounter in Step #3 aboveBatch normalization —Batch normalization is used to normalize the inputs of each layer, in order to fight the internal covariate shift problem.

Make sure to read the point below on Dropout if you’re using Dropout and Batch Norma together.

This article from Dishank Bansal ‘Pitfalls of Batch Norm in TensorFlow and Sanity Checks for Training Networks’ is a great resource for common errors with batch normalization.

Stochastic Gradient Descent (SGD)— There are several flavors of SGD that use momentum, adaptive learning rates, and Nesterov updates with no clear winner for both training performance and generalization(See Sebastian Ruder’s excellent ‘An overview of gradient descent optimization algorithms’ and this interesting experiment ‘SGD > Adam?’) A recommended starting point is Adam or plain SGD with Nesterov momentum.

Regularization —Regularization is crucial for building a generalizable model since it adds a penalty for model complexity or extreme parameter values.

It significantly reduces the variance of the model, without substantial increase in its bias.

As described in the CS231n course:It is often the case that a loss function is a sum of the data loss and the regularization loss (e.

g.

L2 penalty on weights).

One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term (which usually has a much simpler gradient expression).

This can mask an incorrect implementation of the data loss gradient.

To audit this, you should turn off regularization and check your data loss gradient independently.

Dropout — Dropout is another technique to regularize your network to prevent overfitting.

While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise.

As a result, the network has to use a different subset of parameters per training batch, which reduces the changes of specific parameters becoming dominant over others.

The important note here is: if you’re using both dropout and batch normalization (batch norm) together, be cautious of the order of these operations or even of using them together.

This is still an active area of research, but you can see the latest discussions:From Stackoverflow user MiloMinderBinder : “Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt.

So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.

”From arXiv: Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift (Xiang Li, Shuo Chen, Xiaolin Hu, Jian Yang) — “Theoretically, we find that Dropout would shift the variance of a specific neural unit when we transfer the state of that network from train to test.

However, BN would maintain its statistical variance, which is accumulated from the entire learning procedure, in the test phase.

The inconsistency of that variance (we name this scheme as “variance shift”) causes the unstable numerical behavior in inference that leads to more erroneous predictions finally, when applying Dropout before BN.

”5.

Track your workIt’s easy to overlook the importance of documenting your experiments until you forget which learning rate or class weights you used.

With better tracking, you can easily review and reproduce previous experiments to reduce duplicating work (aka running into the same errors).

However, manually documenting information can be difficult to do and scale for multiple experiments.

Tools like Comet.

ml can help automatically track datasets, code changes, experimentation history and production models — including key pieces of information about your model like hyperparameters, model performance metrics, and environment details.

Your neural network can be very sensitive to slight changes in both data, parameters, and even package versions — leading to drops in model performance that can build up.

Tracking your work is the first step you can take to begin standardizing your environment and modeling workflow.

Check out model performance metrics and retrieve the code used to train the model from within Comet.

ml.

There’s an example of Comet’s automatic experiment tracking here.

Quick RecapWe hope this post serves a solid starting point for debugging your neural network.

To summarize the highlights, you should:Start simple — build a simpler model first and test by training on a few data pointsConfirm your loss — check to see if you’re using the correct loss and review your initial lossCheck intermediate outputs and connections — use gradient checking and visualization to check if your layers are properly connected and that your gradients are updating as expectedDiagnose parameters — from SGD to learning rates, identifying the right combination (or figuring out the wrong ones) ????Tracking your work— as a baseline, tracking your experimentation process and key modeling artifactsFound this post useful?.Think it’s missing something?.Comment below with your feedback and questions!.????????‍????Follow the discussion on HackerNews!.

. More details

Leave a Reply