May 18 2024

How Neural Networks Learn: Understanding BackPropagation

Ahmed Zakaria DEEP LEARNING Deep Learning 0

Introduction

Imagine a neural network as a relay race where each runner represents a layer of the network. The goal of the relay race is to deliver the baton (the data) from the start to the finish line (the output) as accurately as possible.

Feedforward Pass (The Forward Run)

1- Start of the Race (Input Layer):

The race begins with the first runner receiving the baton (the input data 𝑥). This runner represents the input layer. Similarly in neural nets the input layer gets the initial information, like the pixel values of an image.

2- Passing the Baton (Hidden Layers):

The first runner sprints to the next runner, passing the baton to him. Similarly in neural nets the input layer passes the input data to the next layer with some adjustments “weighted sum and activation”).

Each runner (hidden layer) adds their own contribution (weights and biases) and applies some rules (activation functions) to modify the baton (data) before passing it on. The baton (data) might be transformed from raw pixel values into extracted features through several hidden layers. Each hidden layer refines the data, making it more relevant for the final decision.

3- Final Stretch (Output Layer):

The last runner (output layer) receives the baton (data) and makes a final adjustment (output layer computation) before crossing the finish line. This final runner determines the race outcome (the network’s prediction 𝑦^).

4- Finish Line (Loss Calculation):

The finish line judge (loss function) evaluates how well the baton was delivered (how close the prediction 𝑦^ is to the actual result 𝑦).

Backpropagation (The Reverse Run)

5- Feedback (Backward Pass):

The race isn’t over yet! The feedback phase starts at the finish line, where the judge calculates the error (loss).

This error information is crucial for improving future races.

6- Reverse Run (Error Propagation):

The final runner (output layer) gets feedback on their performance (gradient of the loss with respect to their output).

This feedback (gradient) is passed back to the previous runner (hidden layer), indicating how they need to adjust their part of the baton passing.

Example: If the final runner (output layer) made an incorrect prediction, the error gradient informs how much and in what direction they need to change their weights.

7- Adjustments (Weight Updates):

Each runner (layer) makes adjustments to their strategy (weights) based on the feedback received from the runner they passed the baton to.

The process continues backward through all the runners (layers) until the first runner (input layer) gets feedback.

Example: Each hidden layer adjusts its weights and biases to reduce the overall error.

8- Training for the Next Race (Learning):

After adjustments, the runners are ready for the next race with the improved strategy (updated weights).

This process of running the race, receiving feedback, and adjusting continues iteratively, helping the runners (network) improve their performance over time.

Back Propagation

Backpropagation, short for “backward propagation of errors,” is a supervised learning algorithm used for training artificial neural networks. It was introduced by Geoffrey Hinton, David Rumelhart, and Ronald Williams in the 1980s. The primary objective of backpropagation is to minimize the error by adjusting the weights and biases in the network.

Chain Rule

The chain rule is a fundamental principle in calculus used to compute the derivative of a composite function. In the context of neural networks, it helps in calculating the gradients of the loss function with respect to each weight.

If 𝑦=𝑓(𝑔(𝑥)), then the derivative of 𝑦 with respect to x is:

(1) $\begin{equation*} \frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx} \end{equation*}$

Example in Neural Networks

(2) $\begin{equation*} z = w \cdot x + b \end{equation*}$

(3) $\begin{equation*} a = h(z) \end{equation*}$

(4) $\begin{equation*} L = \text{loss}(a, y) \end{equation*}$

To update the weight 𝑤, we need

(5) $\begin{equation*} \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} \end{equation*}$

$\frac{\partial L}{\partial a}$ : The gradient of the loss with respect to the output of the activation function.

$\frac{\partial a}{\partial z}$ : The gradient of the activation function with respect to its input.

$\frac{\partial z}{\partial w}$ : The gradient of the neuron’s input with respect to the weight.

Simple Neural Network and Backpropagation Calculations

Figure 2: Simple Neural Network with 1 hidden layer composed of 2 neurons

Architecture

Layer	Units
Input Layer	2 neurons
Hidden Layer	2 neurons
Output Layer	1 neuron
Loss Function	Mean Squared Error (MSE)
Activation function:	Sigmoid 𝜎(𝑧)

Initial Values

Layer	Values
Inputs	x₁=0.5, 𝑥₂=0.1
Target output	y=0.6
Weights	– w₁₁=0.4, 𝑤₁₂=0.3 – 𝑤₂₁=0.6, 𝑤₂₂=0.9 – 𝑤₃₁=0.5, 𝑤₃₂=0.7
Bias	– b₁=0.1 – 𝑏₂=0.2 – 𝑏₃=0.3

Forward Pass

Hidden Layer Calculations:

For the first neuron:

(6) $\begin{equation*} z_1 = w_{11} x_1 + w_{12} x_2 + b_1 = 0.4 \cdot 0.5 + 0.3 \cdot 0.1 + 0.1 = 0.4 \end{equation*}$

(7) $\begin{equation*} a_1 = \sigma(z_1) = \frac{1}{1 + e^{-0.4}} \approx 0.5987 \end{equation*}$

For the second neuron:

(8) $\begin{equation*} z_2 = w_{21} x_1 + w_{22} x_2 + b_2 = 0.6 \cdot 0.5 + 0.9 \cdot 0.1 + 0.2 = 0.53 \end{equation*}$

(9) $\begin{equation*} a_2 = \sigma(z_2) = \frac{1}{1 + e^{-0.53}} \approx 0.6295 \end{equation*}$

Output Layer Calculation:

(10) $\begin{equation*} z_3 = w_{31} a_1 + w_{32} a_2 + b_3 = 0.5 \cdot 0.5987 + 0.7 \cdot 0.6295 + 0.3 \approx 1.0131 \end{equation*}$

(11) $\begin{equation*} \hat{a} = \sigma(z_3) = \frac{1}{1 + e^{-1.0131}} \approx 0.7336 \end{equation*}$

Loss Calculation:

(12) $\begin{equation*} L = \frac{1}{2} (y - \hat{y})^2 = \frac{1}{2} (0.6 - 0.7336)^2 \approx 0.0089 \end{equation*}$

Backward Pass

Output Layer:

Derivative of the loss with respect to 𝑦^

(13) $\begin{equation*} \frac{\partial L}{\partial \hat{y}} = \hat{y} - y = 0.7336 - 0.6 \approx 0.1336 \end{equation*}$

Derivative of the sigmoid activation function

(14) $\begin{equation*} \sigma'(z_3) = \hat{y} (1 - \hat{y}) \approx 0.7336 \cdot (1 - 0.7336) \approx 0.1950 \end{equation*}$

Gradient for 𝑧₃

(15) $\begin{equation*} \frac{\partial L}{\partial z_3} = \frac{\partial L}{\partial \hat{y}} \cdot \sigma'(z_3) \approx 0.1336 \cdot 0.1950 \approx 0.0260 \end{equation*}$

Figure 4: Gradient with respect output neurons function

Gradients for weights 𝑤₃₁ and 𝑤₃₂ and bias 𝑏₃

(16) $\begin{equation*} \frac{\partial L}{\partial w_{31}} = \frac{\partial L}{\partial z_3} \cdot a_1 \approx 0.0260 \cdot 0.5987 \approx 0.0156 \end{equation*}$

(17) $\begin{equation*} \frac{\partial L}{\partial w_{32}} = \frac{\partial L}{\partial z_3} \cdot a_2 \approx 0.0260 \cdot 0.6295 \approx 0.0164 \end{equation*}$

(18) $\begin{equation*} \frac{\partial L}{\partial b_3} = \frac{\partial L}{\partial z_3} \approx 0.0260 \end{equation*}$

Hidden Layer:

For the first neuron:

(19) $\begin{equation*} \frac{\partial L}{\partial a_1} = \frac{\partial L}{\partial z_3} \cdot w_{31} \approx 0.0260 \cdot 0.5 \approx 0.0130 \end{equation*}$

(20) $\begin{equation*} \sigma'(z_1) = a_1 (1 - a_1) \approx 0.5987 \cdot (1 - 0.5987) \approx 0.2403 \end{equation*}$

(21) $\begin{equation*} \frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial a_1} \cdot \sigma'(z_1) \approx 0.0130 \cdot 0.2403 \approx 0.0031 \end{equation*}$

(22) $\begin{equation*} \frac{\partial L}{\partial w_{11}} = \frac{\partial L}{\partial z_1} \cdot x_1 \approx 0.0031 \cdot 0.5 \approx 0.00155 \end{equation*}$

(23) $\begin{equation*} \frac{\partial L}{\partial w_{12}} = \frac{\partial L}{\partial z_1} \cdot x_2 \approx 0.0031 \cdot 0.1 \approx 0.00031 \end{equation*}$

(24) $\begin{equation*} \frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial z_1} \approx 0.0031 \end{equation*}$

Gradient with respect to z1 — Figure 5: Gradient with respect to z₁

For each parameter (weights and biases) in the neural network, the chain rule is applied to calculate its gradient with respect to the loss function. This gradient is then used to update the parameter during the optimization process, such as in gradient descent.

Weight and Bias Updates

Gradient Descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In the context of neural networks, gradient descent is commonly used to update the weights and biases of the network in order to minimize the loss function.

Figure 6: Gradient descent steps to minimize loss

Using a learning rate 𝜂=0.1:

Output layer weights and bias:

(25) $\begin{equation*} w_{31} := w_{31} - \eta \frac{\partial w_{31}}{\partial L} \approx 0.5 - 0.1 \times 0.0156 \approx 0.4984 \end{equation*}$

By repeating this process iteratively, the neural network learns to adjust its parameters in a way that minimizes the loss function and improves its performance on the given task. Each parameter is updated based on its gradient, which represents the direction and magnitude of the change needed to decrease the loss. This iterative optimization process allows the network to gradually improve its performance and learn to make better predictions.

Why is Backpropagation Important?

Backpropagation is crucial because it enables neural networks to learn from data and improve their performance over time. It efficiently computes gradients, making it feasible to train large networks with many parameters. Some key benefits include:

Scalability: Backpropagation can handle large networks with many layers, making it suitable for deep learning applications.
Flexibility: It can be used with various types of neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and more.
Effectiveness: By minimizing the error, backpropagation ensures that the network’s predictions become more accurate over time.

Challenges and Solutions

Despite its effectiveness, backpropagation faces several challenges:

Vanishing and Exploding Gradients: In very deep networks, gradients can become too small (vanishing) or too large (exploding), making training difficult. Solutions include using different activation functions (like ReLU) and techniques such as batch normalization.
Overfitting: Networks may perform well on training data but poorly on unseen data. Regularization techniques like dropout and weight decay help mitigate overfitting.
Computational Cost: Training large networks requires significant computational resources. Advances in hardware (GPUs, TPUs) and software optimizations have alleviated this issue.

Reference

Author

Ahmed Zakaria

View all posts

How Neural Networks Learn: Understanding BackPropagation