The network can be applied to supervised learning problem with binary classification.
Figure 1.
Example of neural network architectureNotationSuperscript [l] denotes a quantity associated with the lᵗʰ layer.
Superscript (i) denotes a quantity associated with the iᵗʰ example.
Lowerscript i denotes the iᵗʰ entry of a vector.
This article was written assuming that the reader is already familiar with the concept of a neural network.
Otherwise, I recommend to read this nice introduction https://towardsdatascience.
com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6Single neuronFigure 2.
Example of single neuron representationA neuron computes a linear function (z = Wx + b) followed by an activation function.
We generally say that the output of a neuron is a = g(Wx + b) where g is the activation function (sigmoid, tanh, ReLU, …).
DatasetLet’s assume that we have a very big dataset with weather data such as temperature, humidity, atmospheric pressure and the probability of rain.
Problem statement:a training set of m_train weather data labeled as rain (1) or not (0)a test set of m_test weather data labeled as rain or noteach weather data consists of x1 = temperature, x2 = humidity, x3 = atmospheric pressureOne common preprocessing step in machine learning is to center and standardize your dataset, meaning that you subtract the mean of the whole numpy array from each example, and then divide each example by the standard deviation of the whole numpy array.
Standard deviation – WikipediaThe standard deviation of a random variable, statistical population, data set, or probability distribution is the…en.
wikipedia.
orgGeneral methodology (building the parts of our algorithm)We will follow the Deep Learning methodology to build the model:Define the model structure (such as number of input features)Initialize parameters and define hyperparameters:number of iterationsnumber of layers L in the neural networksize of the hidden layerslearning rate α3.
Loop for num_iterations:Forward propagation (calculate current loss)Compute cost functionBackward propagation (calculate current gradient)Update parameters (using parameters, and grads from backprop)4.
Use trained parameters to predict labelsInitializationThe initialization for a deeper L-layered neural network is more complicated because there are many more weight matrices and bias vectors.
I provide the tables below in order to help you keep the right dimensions of the structures.
Table 1.
Dimensions of weight matrix W, bias vector b and activation Z for layer lTable 2.
Dimensions of weight matrix W, bias vector b and activation Z for the neural network for our example architectureTable 2 helps us prepare correct dimensions for the matrices of our example neural network architecture from Figure 1.
Snippet 1.
Initialization of the parametersParameters initialization using small random numbers is simple approach, but it guarantees good enough starting point for our algorithm.
Remember:Different initialization techniques such as Zero, Random, He or Xavier lead to different resultRandom initialization makes sure different hidden units can learn different things (initializing all the weights to zero causes, that every neuron in each layer will learn the same thing)Don’t initialize to values that are too largeActivation functionsActivation functions give the neural networks non-linearity.
In our example, we will use sigmoid and ReLU.
Sigmoid outputs a value between 0 and 1 which makes it a very good choice for binary classification.
You can classify the output as 0 if it is less than 0.
5 and classify it as 1 if the output is more than 0.
5.
Snippet 2.
Sigmoid and ReLU activation functions and their derivativesIn Snippet 2 you can see the vectorized implementation of activation functions and their derivatives (https://en.
wikipedia.
org/wiki/Derivative).
The code will be used in the further calculation.
Forward propagationDuring forward propagation, in the forward function for a layer l you need to know what the activation function in a layer is (Sigmoid, tanh, ReLU, etc.
).
Given input signal from the previous layer, we compute Z and then apply selected activation function.
Figure 3.
Forward propagation for our example neural networkThe linear forward module (vectorized over all the examples) computes the following equations:Equation 1.
Linear forward functionSnippet 3.
Forward propagation moduleWe use “cache” (Python dictionary, which contains A and Z values computed for particular layers) to pass variables computed during forward propagation to the corresponding backward propagation step.
It contains useful values for backward propagation to compute derivatives.
Loss functionIn order to monitor the learning process, we need to calculate the value of the cost function.
We will use the below formula to calculate the cost.
Equation 2.
Cross-entropy costSnippet 4.
Computation of the cost functionBackward propagationBackpropagation is used to calculate the gradient of the loss function with respect to the parameters.
This algorithm is the recursive use of a “chain rule” known from differential calculus.
Equations used in backpropagation calculation:Equation 3.
Formulas for backward propagation calculationThe chain rule is a formula for calculating the derivatives of composite functions.
Composite functions are functions composed of functions inside other function.
Equation 4.
Chain rule examplesIt is difficult to calculate the loss without “chain rule” (equation 5 as an example).
Equation 5.
Loss function (with substituted data) and its derivative with respect to the first weight.
The first step in backpropagation for our neural network model is to calculate the derivative of our loss function with respect to Z from the last layer.
Equation 6 consists of two components, the derivative of the loss function from equation 2 (with respect to the activation function) and the derivative of the activation function “sigmoid” with respect to Z from the last layer.
Equation 6.
The derivative of the loss function with respect to Z from 4ᵗʰ layerThe result from equation 6 can be used to calculate the derivatives from equation 3:Equation 7.
The derivative of the loss function with respect to A from 3ᵗʰ layerThe derivative of the loss function with respect to the activation function from the third layer (equation 7) is used in the further calculation.
Equation 8.
The derivatives for the third layerThe result from equation 7 and the derivative of the activation function “ReLU” from the third layer is used to calculate the derivatives from equation 8 (the derivative of the loss function with respect to Z).
Following this, we make a calculation for equation 3.
We make similar calculations for equation 9 and 10.
Equation 9.
The derivatives for the second layerEquation 10.
The derivatives for the first layerThe general idea:The derivative of the loss function with respect to Z from lᵗʰ layer helps to calculate the derivative of the loss function with respect to A from (l-1)ᵗʰ layer (the previous layer).
Then the result is used with the derivative of the activation function.
Figure 4.
Backward propagation for our example neural networkSnippet 5.
Backward propagation moduleUpdate parametersThe goal of the function is to update the parameters of the model using gradient optimization.
Snippet 6.
Updating parameters values using gradient descentFull modelThe full implementation of the neural network model consists of the methods provided in snippets.
Snippet 7.
The full model of the neural networkIn order to make a prediction, you only need to run a full forward propagation using the received weight matrix and a set of test data.
You can modify nn_architecture in Snippet 1 to build a neural network with a different number of layers and sizes of the hidden layers.
Moreover, prepare the correct implementation of the activation functions and their derivatives (Snippet 2).
The implemented functions can be used to modify linear_activation_forward method in Snippet 3 and linear_activation_backward method in Snippet 5.
Further improvementsYou can face the “overfitting” problem if the training dataset is not big enough.
It means that the learned network doesn’t generalize to new examples that it has never seen.
You can use regularization methods such as L2 regularization (it consists of appropriately modifying yourcost function) or dropout ( it randomly shuts down some neurons in each iteration).
We used Gradient Descent to update the parameters and minimize the cost.
You can learn more advanced optimization methods that can speed up learning and even get you to a better final value for the cost function for example:Mini-batch gradient descentMomentumAdam optimizerReferences:[1] https://www.
coursera.
org/learn/neural-networks-deep-learning[2] https://www.
coursera.
org/learn/deep-neural-network[3] https://ml-cheatsheet.
readthedocs.
io/en/latest/index.
html[4] https://medium.
com/towards-artificial-intelligence/one-lego-at-a-time-explaining-the-math-of-how-neural-networks-learn-with-implementation-from-scratch-39144a1cf80[5] https://towardsdatascience.
com/gradient-descent-in-a-nutshell-eaf8c18212f0[6] https://medium.
com/datadriveninvestor/math-neural-network-from-scratch-in-python-d6da9f29ce65[7] https://towardsdatascience.
com/how-to-build-your-own-neural-network-from-scratch-in-python-68998a08e4f6.. More details