How Neural Networks Learn: Understanding BackPropagation
Introduction
Imagine a neural network as a relay race where each runner represents a layer of the network. The goal of the relay race is to deliver the baton (the data) from the start to the finish line (the output) as accurately as possible.

Feedforward Pass (The Forward Run)
1- Start of the Race (Input Layer):
The race begins with the first runner receiving the baton (the input data ๐ฅ). This runner represents the input layer. Similarly in neural nets the input layer gets the initial information, like the pixel values of an image.
2- Passing the Baton (Hidden Layers):
The first runner sprints to the next runner, passing the baton to him. Similarly in neural nets the input layer passes the input data to the next layer with some adjustments “weighted sum and activation”).
Each runner (hidden layer) adds their own contribution (weights and biases) and applies some rules (activation functions) to modify the baton (data) before passing it on. The baton (data) might be transformed from raw pixel values into extracted features through several hidden layers. Each hidden layer refines the data, making it more relevant for the final decision.
3- Final Stretch (Output Layer):
The last runner (output layer) receives the baton (data) and makes a final adjustment (output layer computation) before crossing the finish line. This final runner determines the race outcome (the network’s prediction ๐ฆ^โ).
4- Finish Line (Loss Calculation):
The finish line judge (loss function) evaluates how well the baton was delivered (how close the prediction ๐ฆ^ is to the actual result ๐ฆ).
Backpropagation (The Reverse Run)
5- Feedback (Backward Pass):
The race isn’t over yet! The feedback phase starts at the finish line, where the judge calculates the error (loss).
This error information is crucial for improving future races.
6- Reverse Run (Error Propagation):
The final runner (output layer) gets feedback on their performance (gradient of the loss with respect to their output).
This feedback (gradient) is passed back to the previous runner (hidden layer), indicating how they need to adjust their part of the baton passing.
Example: If the final runner (output layer) made an incorrect prediction, the error gradient informs how much and in what direction they need to change their weights.
7- Adjustments (Weight Updates):
Each runner (layer) makes adjustments to their strategy (weights) based on the feedback received from the runner they passed the baton to.
The process continues backward through all the runners (layers) until the first runner (input layer) gets feedback.
Example: Each hidden layer adjusts its weights and biases to reduce the overall error.
8- Training for the Next Race (Learning):
After adjustments, the runners are ready for the next race with the improved strategy (updated weights).
This process of running the race, receiving feedback, and adjusting continues iteratively, helping the runners (network) improve their performance over time.
Back Propagation
Backpropagation, short for “backward propagation of errors,” is a supervised learning algorithm used for training artificial neural networks. It was introduced by Geoffrey Hinton, David Rumelhart, and Ronald Williams in the 1980s. The primary objective of backpropagation is to minimize the error by adjusting the weights and biases in the network.
Chain Rule
The chain rule is a fundamental principle in calculus used to compute the derivative of a composite function. In the context of neural networks, it helps in calculating the gradients of the loss function with respect to each weight.
If ๐ฆ=๐(๐(๐ฅ)), then the derivative of ๐ฆ with respect to x is:
(1)
Example in Neural Networks
(2)
(3)
(4)
To update the weight ๐ค, we need
(5)
โ: The gradient of the loss with respect to the output of the activation function.
: The gradient of the activation function with respect to its input.
: The gradient of the neuron’s input with respect to the weight.
Simple Neural Network and Backpropagation Calculations
Architecture
Layer | Units |
---|---|
Input Layer | 2 neurons |
Hidden Layer | 2 neurons |
Output Layer | 1 neuron |
Loss Function | Mean Squared Error (MSE) |
Activation function: | Sigmoid ๐(๐ง) |
Initial Values
Layer | Values |
---|---|
Inputs | x1โ=0.5, ๐ฅ2=0.1 |
Target output | y=0.6 |
Weights | – w11โ=0.4, ๐ค12=0.3 – ๐ค21=0.6, ๐ค22=0.9 – ๐ค31=0.5, ๐ค32=0.7 |
Bias | – b1โ=0.1 – ๐2=0.2 – ๐3=0.3 |
Forward Pass
Hidden Layer Calculations:
For the first neuron:
(6)
(7)
For the second neuron:
(8)
(9)
Output Layer Calculation:
(10)
(11)
Loss Calculation:
(12)
Backward Pass
Output Layer:
Derivative of the loss with respect to ๐ฆ^
(13)
Derivative of the sigmoid activation function
(14)
Gradient for ๐ง3
(15)
Gradients for weights ๐ค31 and ๐ค32โ and bias ๐3
(16)
(17)
(18)
Hidden Layer:
For the first neuron:
(19)
(20)
(21)
(22)
(23)
(24)
For each parameter (weights and biases) in the neural network, the chain rule is applied to calculate its gradient with respect to the loss function. This gradient is then used to update the parameter during the optimization process, such as in gradient descent.
Weight and Bias Updates
Gradient Descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In the context of neural networks, gradient descent is commonly used to update the weights and biases of the network in order to minimize the loss function.
Using a learning rate ๐=0.1:
Output layer weights and bias:
(25)
By repeating this process iteratively, the neural network learns to adjust its parameters in a way that minimizes the loss function and improves its performance on the given task. Each parameter is updated based on its gradient, which represents the direction and magnitude of the change needed to decrease the loss. This iterative optimization process allows the network to gradually improve its performance and learn to make better predictions.
Why is Backpropagation Important?
Backpropagation is crucial because it enables neural networks to learn from data and improve their performance over time. It efficiently computes gradients, making it feasible to train large networks with many parameters. Some key benefits include:
- Scalability: Backpropagation can handle large networks with many layers, making it suitable for deep learning applications.
- Flexibility: It can be used with various types of neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and more.
- Effectiveness: By minimizing the error, backpropagation ensures that the network’s predictions become more accurate over time.
Challenges and Solutions
Despite its effectiveness, backpropagation faces several challenges:
- Vanishing and Exploding Gradients: In very deep networks, gradients can become too small (vanishing) or too large (exploding), making training difficult. Solutions include using different activation functions (like ReLU) and techniques such as batch normalization.
- Overfitting: Networks may perform well on training data but poorly on unseen data. Regularization techniques like dropout and weight decay help mitigate overfitting.
- Computational Cost: Training large networks requires significant computational resources. Advances in hardware (GPUs, TPUs) and software optimizations have alleviated this issue.