# Understanding and coding Neural Networks From Scratch in Python and R

Note: This article was originally published on May 29, 2017, and updated on July 24, 2020OverviewNeural Networks is one of the most popular machine learning algorithmsGradient Descent forms the basis of Neural networksNeural networks can be implemented in both R and Python using certain libraries and packagesIntroductionYou can learn and practice a concept in two ways:Option 1: You can learn the entire theory on a particular subject and then look for ways to apply those concepts.

So, you read up how an entire algorithm works, the maths behind it, its assumptions, limitations, and then you apply it.

Robust but time-taking approach.

Option 2: Start with simple basics and develop an intuition on the subject.

Then, pick a problem and start solving it.

Learn the concepts while you are solving the problem.

Then, keep tweaking and improving your understanding.

So, you read up how to apply an algorithm – go out and apply it.

Once you know how to apply it, try it around with different parameters, values, limits, and develop an understanding of the algorithm.

I prefer Option 2 and take that approach to learn any new topic.

I might not be able to tell you the entire math behind an algorithm, but I can tell you the intuition.

I can tell you the best scenarios to apply an algorithm based on my experiments and understanding.

In my interactions with people, I find that people don’t take time to develop this intuition and hence they struggle to apply things in the right manner.

In this article, I will discuss the building block of neural networks from scratch and focus more on developing this intuition to apply Neural networks.

We will code in both “Python” and “R”.

By the end of this article, you will understand how Neural networks work, how do we initialize weights and how do we update them using back-propagation.

Let’s start.

In case you want to learn this in a course format, check out our course Fundamentals of Deep Learning Table of Contents:Simple intuition behind Neural networksMulti-Layer Perceptron and its basicsSteps involved in Neural Network methodologyVisualizing steps for Neural Network working methodologyImplementing NN using Numpy (Python)Implementing NN using RUnderstanding the implementation of Neural Networks from scratch in detail[Optional] Mathematical Perspective of Back Propagation Algorithm Simple intuition behind neural networksIn case you have been a developer or seen one work – you know how it is to search for bugs in code.

You would fire various test cases by varying the inputs or circumstances and look for the output.

Further, the change in output provides you a hint on where to look for the bug – which module to check, which lines to read.

Once you find it, you make the changes and the exercise continues until you have the right code/application.

Neural networks work in a very similar manner.

It takes several inputs, processes it through multiple neurons from multiple hidden layers, and returns the result using an output layer.

This result estimation process is technically known as “Forward Propagation“.

Next, we compare the result with actual output.

The task is to make the output to the neural network as close to the actual (desired) output.

Each of these neurons is contributing some error to the final output.

How do you reduce the error?We try to minimize the value/ weight of neurons that are contributing more to the error and this happens while traveling back to the neurons of the neural network and finding where the error lies.

This process is known as “Backward Propagation“.

In order to reduce this number of iterations to minimize the error, the neural networks use a common algorithm known as “Gradient Descent”, which helps to optimize the task quickly and efficiently.

That’s it – this is how Neural networks work! I know this is a very simple representation, but it would help you understand things in a simple manner.

Multi-Layer Perceptron and its basicsJust like atoms form the basics of any material on earth – the basic forming unit of a neural network is a perceptron.

So, what is a perceptron?A perceptron can be understood as anything that takes multiple inputs and produces one output.

For example, look at the image below.

PerceptronThe above structure takes three inputs and produces one output.

The next logical question is what is the relationship between input and output? Let us start with basic ways and build on to find more complex ways.

Below, I have discussed three ways of creating input-output relationships:By directly combining the input and computing the output based on a threshold value.

for eg: Take x1=0, x2=1, x3=1 and setting a threshold =0.

So, if x1+x2+x3>0, the output is 1 otherwise 0.

You can see that in this case, the perceptron calculates the output as 1.

Next, let us add weights to the inputs.

Weights give importance to an input.

For example, you assign w1=2, w2=3, and w3=4 to x1, x2, and x3 respectively.

To compute the output, we will multiply input with respective weights and compare with threshold value as w1*x1 + w2*x2 + w3*x3 > threshold.

These weights assign more importance to x3 in comparison to x1 and x2.

Next, let us add bias: Each perceptron also has a bias which can be thought of as how much flexible the perceptron is.

It is somehow similar to the constant b of a linear function y = ax + b.

It allows us to move the lineup and down to fit the prediction with the data better.

Without b the line will always go through the origin (0, 0) and you may get a poorer fit.

For example, a perceptron may have two inputs, in that case, it requires three weights.

One for each input and one for the bias.

Now linear representation of input will look like, w1*x1 + w2*x2 + w3*x3 + 1*b.

But, all of this is still linear which is what perceptrons used to be.

But that was not as much fun.

So, people thought of evolving a perceptron to what is now called as an artificial neuron.

A neuron applies non-linear transformations (activation function) to the inputs and biases.

What is an activation function?Activation Function takes the sum of weighted input (w1*x1 + w2*x2 + w3*x3 + 1*b) as an argument and returns the output of the neuron.

In the above equation, we have represented 1 as x0 and b as w0.

Moreover, the activation function is mostly used to make a non-linear transformation that allows us to fit nonlinear hypotheses or to estimate the complex functions.

There are multiple activation functions, like “Sigmoid”, “Tanh”, ReLu and many others.

Forward Propagation, Back Propagation, and EpochsTill now, we have computed the output and this process is known as “Forward Propagation“.

But what if the estimated output is far away from the actual output (high error).

In the neural network what we do, we update the biases and weights based on the error.

This weight and bias updating process is known as “Back Propagation“.

Back-propagation (BP) algorithms work by determining the loss (or error) at the output and then propagating it back into the network.

The weights are updated to minimize the error resulting from each neuron.

Subsequently, the first step in minimizing the error is to determine the gradient (Derivatives) of each node w.

r.

t.

the final output.

To get a mathematical perspective of the Backward propagation, refer to the below section.

This one round of forwarding and backpropagation iteration is known as one training iteration aka “Epoch“.

Multi-layer perceptronNow, let’s move on to the next part of Multi-Layer Perceptron.

So far, we have seen just a single layer consisting of 3 input nodes i.

e x1, x2, and x3, and an output layer consisting of a single neuron.

But, for practical purposes, the single-layer network can do only so much.

An MLP consists of multiple layers called Hidden Layers stacked in between the Input Layer and the Output Layer as shown below.

The image above shows just a single hidden layer in green but in practice can contain multiple hidden layers.

In addition, another point to remember in case of an MLP is that all the layers are fully connected i.

e every node in a layer(except the input and the output layer) is connected to every node in the previous layer and the following layer.

Let’s move on to the next topic which is a training algorithm for neural networks (to minimize the error).

Here, we will look at the most common training algorithms known as Gradient descent.

Full Batch Gradient Descent and Stochastic Gradient DescentBoth variants of Gradient Descent perform the same work of updating the weights of the MLP by using the same updating algorithm but the difference lies in the number of training samples used to update the weights and biases.

Full Batch Gradient Descent Algorithm as the name implies uses all the training data points to update each of the weights once whereas Stochastic Gradient uses 1 or more(sample) but never the entire training data to update the weights once.

Let us understand this with a simple example of a dataset of 10 data points with two weights w1 and w2.

Full Batch: You use 10 data points (entire training data) and calculate the change in w1 (Δw1) and change in w2(Δw2) and update w1 and w2.

SGD: You use 1st data point and calculate the change in w1 (Δw1) and change in w2(Δw2) and update w1 and w2.

Next, when you use 2nd data point, you will work on the updated weightsFor a more in-depth explanation of both the methods, you can have a look at this article.

Steps involved in Neural Network methodologyLet’s look at the step by step building methodology of Neural Network (MLP with one hidden layer, similar to above-shown architecture).

At the output layer, we have only one neuron as we are solving a binary classification problem (predict 0 or 1).

We could also have two neurons for predicting each of both classes.

Firstly look at the broad steps:0.

) We take input and outputX as an input matrixy as an output matrix1.

) Then we initialize weights and biases with random values (This is one-time initiation.

In the next iteration, we will use updated weights, and biases).

Let us define:wh as a weight matrix to the hidden layerbh as bias matrix to the hidden layerwout as a weight matrix to the output layerbout as bias matrix to the output layer2.

) Then we take matrix dot product of input and weights assigned to edges between the input and hidden layer then add biases of the hidden layer neurons to respective inputs, this is known as linear transformation:hidden_layer_input= matrix_dot_product(X,wh) + bh3) Perform non-linear transformation using an activation function (Sigmoid).

Sigmoid will return the output as 1/(1 + exp(-x)).

hiddenlayer_activations = sigmoid(hidden_layer_input)4.

) Then perform a linear transformation on hidden layer activation (take matrix dot product with weights and add a bias of the output layer neuron) then apply an activation function (again used sigmoid, but you can use any other activation function depending upon your task) to predict the outputoutput_layer_input = matrix_dot_product (hiddenlayer_activations * wout ) + bout output = sigmoid(output_layer_input) All the above steps are known as “Forward Propagation“5.

) Compare prediction with actual output and calculate the gradient of error (Actual – Predicted).

Error is the mean square loss = ((Y-t)^2)/2E = y – output6.

) Compute the slope/ gradient of hidden and output layer neurons ( To compute the slope, we calculate the derivatives of non-linear activations x at each layer for each neuron).

The gradient of sigmoid can be returned as x * (1 – x).

slope_output_layer = derivatives_sigmoid(output) slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)7.

) Then compute change factor(delta) at the output layer, dependent on the gradient of error multiplied by the slope of output layer activationd_output = E * slope_output_layer8.

) At this step, the error will propagate back into the network which means error at the hidden layer.

For this, we will take the dot product of the output layer delta with the weight parameters of edges between the hidden and output layer (wout.

T).

Error_at_hidden_layer = matrix_dot_product(d_output, wout.

Transpose)9.

) Compute change factor(delta) at hidden layer, multiply the error at hidden layer with slope of hidden layer activationd_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer10.

) Then update weights at the output and hidden layer: The weights in the network can be updated from the errors calculated for training example(s).

wout = wout + matrix_dot_product(hiddenlayer_activations.

Transpose, d_output)*learning_rate wh =  wh + matrix_dot_product(X.

Transpose,d_hiddenlayer)*learning_ratelearning_rate: The amount that weights are updated is controlled by a configuration parameter called the learning rate)11.

) Finally, update biases at the output and hidden layer: The biases in the network can be updated from the aggregated errors at that neuron.

bias at output_layer =bias at output_layer + sum of delta of output_layer at row-wise * learning_ratebias at hidden_layer =bias at hidden_layer + sum of delta of output_layer at row-wise * learning_rate   bh = bh + sum(d_hiddenlayer, axis=0) * learning_rate bout = bout + sum(d_output, axis=0)*learning_rateSteps from 5 to 11 are known as “Backward Propagation“One forward and backward propagation iteration is considered as one training cycle.

As I mentioned earlier, When do we train second time then update weights and biases are used for forward propagation.

Above, we have updated the weight and biases for the hidden and output layer and we have used a full batch gradient descent algorithm.

Visualization of steps for Neural Network methodologyWe will repeat the above steps and visualize the input, weights, biases, output, error matrix to understand the working methodology of Neural Network (MLP).

Note:For good visualization images, I have rounded decimal positions at 2 or3 positions.

Yellow filled cells represent current active cellOrange cell represents the input used to populate the values of the current cellStep 0: Read input and outputStep 1: Initialize weights and biases with random values (There are methods to initialize weights and biases but for now initialize with random values)Step 2: Calculate hidden layer input: hidden_layer_input= matrix_dot_product(X,wh) + bhStep 3: Perform non-linear transformation on hidden linear input hiddenlayer_activations = sigmoid(hidden_layer_input)Step 4: Perform linear and non-linear transformation of hidden layer activation at output layeroutput_layer_input = matrix_dot_product (hiddenlayer_activations * wout ) + bout output = sigmoid(output_layer_input)Step 5: Calculate gradient of Error(E) at output layer E = y-outputStep 6: Compute slope at output and hidden layer Slope_output_layer= derivatives_sigmoid(output) Slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)Step 7: Compute delta at output layerd_output = E * slope_output_layer*lrStep 8: Calculate Error at the hidden layerError_at_hidden_layer = matrix_dot_product(d_output, wout.

Transpose)Step 9: Compute delta at hidden layerd_hiddenlayer = Error_at_hidden_layer * slope_hidden_layerStep 10: Update weight at both output and hidden layerwout = wout + matrix_dot_product(hiddenlayer_activations.

Transpose, d_output)*learning_rate wh =  wh+ matrix_dot_product(X.

Transpose,d_hiddenlayer)*learning_rateStep 11: Update biases at both output and hidden layerbh = bh + sum(d_hiddenlayer, axis=0) * learning_rate bout = bout + sum(d_output, axis=0)*learning_rateAbove, you can see that there is still a good error not close to the actual target value because we have completed only one training iteration.

If we will train the model multiple times then it will be a very close actual outcome.

I have completed thousands iteration and my result is close to actual target values ([[ 0.

98032096] [ 0.

96845624] [ 0.

04532167]]).

Implementing NN using Numpy (Python) Implementing NN in R# input matrix X=matrix(c(1,0,1,0,1,0,1,1,0,1,0,1),nrow = 3, ncol=4,byrow = TRUE)# output matrix Y=matrix(c(1,1,0),byrow=FALSE)#sigmoid function sigmoid<-function(x){ 1/(1+exp(-x)) }# derivative of sigmoid function derivatives_sigmoid<-function(x){ x*(1-x) }# variable initialization epoch=5000 lr=0.

1 inputlayer_neurons=ncol(X) hiddenlayer_neurons=3 output_neurons=1#weight and bias initialization wh=matrix( rnorm(inputlayer_neurons*hiddenlayer_neurons,mean=0,sd=1), inputlayer_neurons, hiddenlayer_neurons) bias_in=runif(hiddenlayer_neurons) bias_in_temp=rep(bias_in, nrow(X)) bh=matrix(bias_in_temp, nrow = nrow(X), byrow = FALSE) wout=matrix( rnorm(hiddenlayer_neurons*output_neurons,mean=0,sd=1), hiddenlayer_neurons, output_neurons)bias_out=runif(output_neurons) bias_out_temp=rep(bias_out,nrow(X)) bout=matrix(bias_out_temp,nrow = nrow(X),byrow = FALSE) # forward propagation for(i in 1:epoch){hidden_layer_input1= X%*%wh hidden_layer_input=hidden_layer_input1+bh hidden_layer_activations=sigmoid(hidden_layer_input) output_layer_input1=hidden_layer_activations%*%wout output_layer_input=output_layer_input1+bout output= sigmoid(output_layer_input)# Back PropagationE=Y-output slope_output_layer=derivatives_sigmoid(output) slope_hidden_layer=derivatives_sigmoid(hidden_layer_activations) d_output=E*slope_output_layer Error_at_hidden_layer=d_output%*%t(wout) d_hiddenlayer=Error_at_hidden_layer*slope_hidden_layer wout= wout + (t(hidden_layer_activations)%*%d_output)*lr bout= bout+rowSums(d_output)*lr wh = wh +(t(X)%*%d_hiddenlayer)*lr bh = bh + rowSums(d_hiddenlayer)*lr} output Understanding the implementation of Neural Networks from scratch in detailNow that you have gone through a basic implementation of numpy from scratch in both Python and R, we will dive deep into understanding each code block and try to apply the same code on a different dataset.

We will also visualize how our model is working, by “debugging” it step by step using the interactive environment of a jupyter notebook and using basic data science tools such as numpy and matplotlib.

So let’s get started!The first thing we will do is to import the libraries mentioned before, namely numpy and matplotlib.

Also, as we will be working with the jupyter notebook IDE, we will set inline plotting of graphs using the magic function %matplotlib inlineView the code on Gist.

Let’s check the versions of the libraries we are usingView the code on Gist.

Version of numpy: 1.

18.

1 and the same for matplotlib View the code on Gist.

Version of matplotlib: 3.

1.

3Also, lets set the random seed parameter to a specific number (let’s say 42 (as we already know that is the answer to everything!)) so that the code we run gives us the same output every time we run (hopefully!)View the code on Gist.

Now the next step is to create our input.

Firstly, let’s take a dummy dataset, where only the first column is a useful column, whereas the rest may or may not be useful and can be a potential noise.

View the code on Gist.

This is the output we get from running the above codeInput: [[1 0 0 0] [1 0 1 1] [0 1 0 1]] Shape of Input: (3, 4) Now as you might remember, we have to take the transpose of input so that we can train our network.

Let’s do that quicklyView the code on Gist.

Input in matrix form: [[1 1 0] [0 0 1] [0 1 0] [0 1 1]] Shape of Input Matrix: (4, 3)Now let’s create our output array and transpose that tooView the code on Gist.

Actual Output: [[1] [1] [0]] Output in matrix form: [[1 1 0]] Shape of Output: (1, 3) Now that our input and output data is ready, let’s define our neural network.

We will define a very simple architecture, having one hidden layer with just three neurons View the code on Gist.

Then, we will initialize the weights for each neuron in the network.

The weights we create have values ranging from 0 to 1, which we initialize randomly at the start.

For simplicity, we will not include bias in the calculations, but you can check the simple implementation we did before to see how it works for the bias termView the code on Gist.

Let’s print the shapes of these numpy arrays for clarityView the code on Gist.

After this, we will define our activation function as sigmoid, which we will use in both the hidden layer and output layer of the networkView the code on Gist.

And then, we will implement our forward pass, first to get the hidden layer activations and then for the output layer.

Our forward pass would look something like thisView the code on Gist.

View the code on Gist.

Let’s see what our untrained model gives as an output.

View the code on Gist.

We get an output for each sample of the input data.

In this case, let’s calculate the error for each sample using the squared error lossView the code on Gist.

We get an output like thisarray([[0.

05013458, 0.

03727248, 0.

25388062]])We have completed our forward propagation step and got the error.

Now let’s do a backward propagation to calculate the error with respect to each weight of the neuron and then update these weights using simple gradient descent.

Firstly we will calculate the error with respect to weights between the hidden and output layers.

Essentially, we will do an operation such as thiswhere to calculate this, the following would be our intermediate steps using the chain ruleRate of change of error w.

r.

t outputRate of change of output w.

r.

t Z2Rate of change of Z2 w.

r.

t weights between hidden and output layerLet’s perform the operationsView the code on Gist.

View the code on Gist.

View the code on Gist.

Now, let’s check the shapes of the intermediate operations.

View the code on Gist.

What we want is an output shape like thisView the code on Gist.

Now as we saw before, we can define this operation formally using this equationLet’s perform the stepsView the code on Gist.

View the code on Gist.

We get the output as expected.

Further, let’s perform the same steps for calculating the error with respect to weights between input and hidden – like thisSo by chain rule, we will calculate the following intermediate steps,Rate of change of error w.

r.

t outputRate of change of output w.

r.

t Z2Rate of change of Z2 w.

r.

t hidden layer activationsRate of change of hidden layer activations w.

r.

t Z1Rate of change of Z1 w.

r.

t weights between input and hidden layerView the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

Let’s print the shapes of these intermediate arraysView the code on Gist.

(1, 3) (1, 3) (3, 1) (3, 3) (4, 3) But what we want is an array of shape thisView the code on Gist.

(4, 3)So we will combine them using the equationView the code on Gist.

So that is the output we want.

Lets quickly check the shape of the resultant arrayView the code on Gist.

Now the next step is to update the parameters.

For this, we will use vanilla gradient descent update function, which is as followsFirstly define our alpha parameter, i.

e.

the learning rate as 0.

01View the code on Gist.

We also print the initial weights before the updateView the code on Gist.

View the code on Gist.

View the code on Gist.

and update the weightsView the code on Gist.

Then, we check the weights again to see if they have been updatedView the code on Gist.

View the code on Gist.

Now, this is just one iteration (or epoch) of the forward and backward pass.

We have to do it multiple times to make our model perform better.

Let’s perform the steps above again for 1000 epochsView the code on Gist.

View the code on Gist.

We get an output like this, which is a debugging step we did to check error at every hundredth epochError at epoch 0 is 0.

11553 Error at epoch 100 is 0.

11082 Error at epoch 200 is 0.

10606 Error at epoch 300 is 0.

09845 Error at epoch 400 is 0.

08483 Error at epoch 500 is 0.

06396 Error at epoch 600 is 0.

04206 Error at epoch 700 is 0.

02641 Error at epoch 800 is 0.

01719 Error at epoch 900 is 0.

01190 Our model seems to be performing better and better as the training continues.

Let’s check the weights after the training is doneView the code on Gist.

View the code on Gist.

And also plot a graph to visualize how the training wentView the code on Gist.

One final thing we will do is to check how close the predictions are to our actual outputView the code on Gist.

View the code on Gist.

Pretty close!Further, the next thing we will do is to train our model on a different dataset, and visualize the performance by plotting a decision boundary after training.

Let’s get on to it!View the code on Gist.

View the code on Gist.

We get an output like thisView the code on Gist.

We will normalize the input so that our model trains fasterView the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

View the code on Gist.

Now we will define our network.

We will update the following three hyperparameters, namelyChange hidden layer neurons to be 10Change the learning rate to be 0.

1and train for more epochsView the code on Gist.

This is the error we get after each thousand of the epochError at epoch 0 is 0.

23478 Error at epoch 1000 is 0.

25000 Error at epoch 2000 is 0.

25000 Error at epoch 3000 is 0.

25000 Error at epoch 4000 is 0.

05129 Error at epoch 5000 is 0.

02163 Error at epoch 6000 is 0.

01157 Error at epoch 7000 is 0.

00775 Error at epoch 8000 is 0.

00689 Error at epoch 9000 is 0.

07556And plotting it gives an output like thisView the code on Gist.

View the code on Gist.

Now, if we check the predictions and output manually, they seem pretty closeView the code on Gist.

Next, let’s visualize the performance by plotting the decision boundary.

It’s ok if you don’t follow the code below, you can use it as-is for now.

If you are curious, do post it in the comment section belowView the code on Gist.

which gives us an output like thiswhich lets us know how adept our neural network is at trying to find the pattern in the data and then classifying them accordingly.

Here’s an exercise for you – Try to take the same implementation we did, and implement in on a “blobs” dataset using scikit-learn The data would look similar to thisDo share your results with us! [Optional] Mathematical Perspective of Back Propagation AlgorithmLet Wi be the weights between the input layer and the hidden layer.

Wh be the weights between the hidden layer and the output layer.

Now, h=σ (u)= σ (WiX), i.

e h is a function of u and u is a function of Wi and X.

here we represent our function as σY= σ (u’)= σ (Whh), i.

e Y is a function of u’ and u’ is a function of Wh and h.

We will be constantly referencing the above equations to calculate partial derivatives.

We are primarily interested in finding two terms, ∂E/∂Wi and ∂E/∂Wh i.

e change in Error on changing the weights between the input and the hidden layer and change in error on changing the weights between the hidden layer and the output layer.

But to calculate both these partial derivatives, we will need to use the chain rule of partial differentiation since E is a function of Y and Y is a function of u’ and u’ is a function of Wi.

Let’s put this property to good use and calculate the gradients.

∂E/∂Wh = (∂E/∂Y).

( ∂Y/∂u’).

( ∂u’/∂Wh), …….

(1)We know E is of the form E=(Y-t)2/2.

So, (∂E/∂Y)= (Y-t)Now, σ is a sigmoid function and has an interesting differentiation of the form σ(1- σ).

I urge the readers to work this out on their side for verification.

So, (∂Y/∂u’)= ∂( σ(u’)/ ∂u’= σ(u’)(1- σ(u’)).

But, σ(u’)=Y, So,(∂Y/∂u’)=Y(1-Y)Now, ( ∂u’/∂Wh)= ∂( Whh)/ ∂Wh = hReplacing the values in equation (1) we get,∂E/∂Wh = (Y-t).

Y(1-Y).

hSo, now we have computed the gradient between the hidden layer and the output layer.

It is time we calculate the gradient between the input layer and the hidden layer.

∂E/∂Wi =(∂ E/∂ h).

(∂h/∂u).

( ∂u/∂Wi)But, (∂ E/∂ h) = (∂E/∂Y).

( ∂Y/∂u’).

( ∂u’/∂h).

Replacing this value in the above equation we get,∂E/∂Wi =[(∂E/∂Y).

( ∂Y/∂u’).

( ∂u’/∂h)].

(∂h/∂u).

( ∂u/∂Wi)……………(2)So, What was the benefit of first calculating the gradient between the hidden layer and the output layer?As you can see in equation (2) we have already computed ∂E/∂Y and ∂Y/∂u’ saving us space and computation time.

We will come to know in a while why is this algorithm called the backpropagation algorithm.

Let us compute the unknown derivatives in equation (2).

∂u’/∂h = ∂(Whh)/ ∂h = Wh∂h/∂u = ∂( σ(u)/ ∂u= σ(u)(1- σ(u))But, σ(u)=h, So,(∂Y/∂u)=h(1-h)Now, ∂u/∂Wi = ∂(WiX)/ ∂Wi = XReplacing all these values in equation (2) we get,∂E/∂Wi = [(Y-t).

Y(1-Y).

Wh].

h(1-h).

XSo, now since we have calculated both the gradients, the weights can be updated asWh = Wh + η .

∂E/∂WhWi = Wi + η .

∂E/∂WiWhere η is the learning rate.

So coming back to the question: Why is this algorithm called Back Propagation Algorithm?The reason is: If you notice the final form of ∂E/∂Wh and ∂E/∂Wi , you will see the term (Y-t) i.

e the output error, which is what we started with and then propagated this back to the input layer for weight updation.

So, where does this mathematics fit into the code?hiddenlayer_activations=hE= Y-tSlope_output_layer = Y(1-Y)lr = ηslope_hidden_layer = h(1-h)wout = WhNow, you can easily relate the code to the mathematics.

I hope now you understand the working of neural networks.

Such as how does forward and backward propagation work, optimization algorithms (Full Batch and Stochastic gradient descent),  how to update weights and biases, visualization of each step in Excel, and on top of that code in python and R.

Therefore, in my upcoming article, I’ll explain the applications of using Neural Networks in Python and solving real-life challenges related to:Computer VisionSpeechNatural Language ProcessingI enjoyed writing this article and would love to learn from your feedback.