Improving Deep Neural NetworksRochak AgrawalBlockedUnblockFollowFollowingJun 19Deep Neural Networks are the solution to complex tasks like Natural Language Processing, Computer Vision, Speech Synthesis etc.

Improving their performance is as important as understanding how they work.

To understand how they work, you can refer to my previous posts.

In this post, I will be explaining various terminologies and methods related to improving the neural networks.

Bias and VarianceBias and Variance are two essential terminologies that explain how well the network performs on the Training set and the Test set.

Let us understand Bias and Variance easily and intuitively using a 2 class problem.

The blue line indicates the decision boundary computed by the neural network.

The leftmost figure shows that the neural network has the problem of High Bias.

In this case, the network has learned a simple hypothesis and is therefore not able to train properly on the training data.

As a result, it is not able to differentiate between the examples of different classes and will perform poorly on the training set and test set both.

We can also say that the network is Underfitting.

The rightmost figure shows that the neural network has the problem of High Variance.

In this case, the network has learned a very complex hypothesis and therefore is not able to generalise.

As a result, it will perform great on training data, whereas poor on the test data.

We can also say that the network is Overfitting.

The centre figure shows a “Just Right” neural network.

It has learned the ideal hypothesis, which helps the network to filter out the anomalies and also generalise on the data.

Our goal should be to achieve such type of network.

The Training RecipeNow that we know what kind of neural network is desirable; let us see how we can achieve our goal.

The steps first tackle the bias problem and then the variance problem.

The first question that we should ask is “Is there a High Bias?” If the answer is YES, then we should try the following steps:Train a bigger network.

It includes increasing the number of hidden layers and the number of neurons in the hidden layers.

Train the network for an extended period of time.

It may be the case that the full training has not been completed yet and will take more iterations.

Try a different optimisation algorithm.

These algorithms include Adam, Momentum, AdaDelta etc.

Perform the above steps iteratively until the bias problem is solved and then move on to the second question.

If the answer is NO, it means that we have overcome the bias problem, and it is time to focus on the variance problem.

The second question that we should ask now is “Is there a High Variance?” If the answer is YES, then we should try the following steps:Gather more training data.

As we gather more data, we will get more variation in the data, and the complexity of the learned hypothesis from the less varied data will break.

Try Regularization.

I will speak about it in the next section.

Perform the above steps iteratively until the variance problem is solved.

If the answer is NO, it means that we have overcome the variance problem, and now our Neural Network is “Just Right”.

RegularizationRegularization is a logical technique which helps to reduce overfitting in a neural network.

When we add regularization to our network, we add a new regularization term, and the loss function is modified.

The modified cost function J is mathematically formulated as:The second term with lambda is known as the regularization term.

The term ||W|| is known as Frobenius Norm (sum of squares of elements in a matrix).

With the inclusion of regularization, lambda becomes a new hyperparameter that can be modified to improve the performance of the neural network.

The above regularization is also known as L-2 regularization.

Earlier, we used the following update rule to update the weights:Since there is a new regularization term in the modified Cost Function J, which includes regularization, we will update the weights in the following manner:Here we can see that the Weight value decreases by a small number which is less than 1.

Therefore, we also call this type of regularization as Weight Decay.

The decay value depends on the learning rate alpha and the regularization term lambda.

Why does Regularization work?The end goal of training a neural network is to minimize Cost Function J and hence the regularization term.

Now that we know what regularization is, let us try to understand why it works.

The first intuition is that if we increase the value of lambda, the Frobenius Norm becomes small, and the weight values become close to 0.

This methodology mainly wipes out certain neurons making the network a shallow one.

It may be thought of as converting the deep network which learns complex hypothesis into a shallow network which learns simple hypothesis.

As we know that simple hypothesis leads to fewer complex features, the overfitting will be reduced, and we will obtain a “Just Right” Neural Network.

Another intuition can be gained from the way the activation of a neuron works when regularization is applied.

For this, let us consider tanh(x) activation.

If we increase the value of lambda, then the Frobenius Norm becomes small, i.

e.

the Weights W become small.

Due to this, the output of that layer will become small and will lie in the blue region of the activation function.

As we can see, the activation of the blue area is almost linear, the network will behave similar to a shallow network, i.

e.

the network will not learn complex hypothesis (sharp curves will be avoided) and the overfitting will eventually reduce, and we will obtain a “Just Right” Neural Network.

Therefore, a too small value of lambda will result in Overfitting as the Frobenius Norm will be large, and neurons will not be wiped out, and output of the layer will not be in the linear region.

Similarly, an excessively large value of lambda will result in underfitting.

Therefore, finding the perfect value of lambda is a crucial task in improving the performance of the neural network.

Dropout RegularizationDropout regularization is another regularization technique in which we drop certain neurons along with their connections present in the neural network.

The probability keep_prob determines the neurons that will be dropped.

After the neurons are removed, the network is trained on the remaining neurons.

It is important to note that during the test time/ inference time, all the neurons are taken into consideration for determining the output.

Let us try to understand the concept with the help of an example:# Define the probablity that a neuron stays.

keep_prob = 0.

5# Create a probability mask for a layer eg.

layer 2.

The mask should # have same dimensions as the weight matrix so that the connections # can be removed.

d2 = np.

random.

rand(a2.

shape[0],a2.

shape[1]) < keep_prob# Obtain the new output matrix.

a2 = np.

multiply(a2,d2)# Since few neurons are removed, we need to boost the weights of # remaining neurons to avoid weight imbalance during test time.

a2 = a2/keep_probSince we first drop the neurons with the probability keep_prob and then boost the remaining neurons with keep_prob, this type of Dropout is known as Inverted Dropout.

The intuition between dropout is that it prohibits the neurons from relying only on certain features, and therefore, the weights are spread out.

It may be the case that the neuron becomes dependent on certain input features to determine the output.

With the help of dropout regularization, a particular neuron gets only a few features as input every time for different training examples during training.

Eventually, the weights are spread out amongst all the inputs, and the network uses all the input features to determine the output and does not rely on any single one, thus making the network more robust.

It is also known as Adaptive form of L2 Regularization.

We can also set keep_prob individually for each layer.

Since the number of neurons that are dropped is inversely proportional to the keep_prob; the general criteria for establishing the keep_prob is that the dense connections should have relatively less keep_prob so that more neurons are dropped and vice versa.

Another intuition is that with Dropout Regularization, the deep network mimics the working of a shallow network during the training phase.

This, in turn, leads to reducing overfitting, and we obtain a “Just Right” Neural Network.

Early StoppingEarly Stopping is a training methodology in which we stop training the neural network at an earlier stage of time to prevent it from overfitting.

We keep track of train_loss and dev_loss to determine when to stop the training.

Just the dev_loss starts to overshoot; we stop the training process.

This methodology is known as Early Stopping.

However, early stopping is not a recommended method for training a network because of the following two reasons:The loss is not minimum when we stop the training process.

We are trying to reduce overfitting on the improperly trained network.

Early stopping makes things complicated, and we are not able to obtain the “Just Right” Neural Network.

ReferencesWikipedia — Activation FunctionsCoursera — Deep Learning Course 2I want to thank the readers for reading the story.

If you have any questions or doubts, feel free to ask them in the comments section below.

I’ll be more than happy to answer them and help you out.

If you like the story, please follow me to get regular updates when I publish a new story.

I welcome any suggestions that will improve my stories.

.. More details