I Use This When...
I need a neural network that can learn nonlinear patterns, not just one straight decision boundary. This is the base pattern behind dense neural nets and the starting point for understanding CNNs, RNNs, and Transformers.
History
Backprop: Rumelhart, Hinton, Williams (1986). The key insight: use the chain rule to compute gradients layer by layer, from output back to input. This made deep networks trainable.
Why It Exists
The "why" chain is:
- A single perceptron can only draw one linear boundary.
- Real patterns are often nonlinear, and XOR is the classic failure case.
- Adding hidden layers lets the model build intermediate features.
- But once many weights interact, we need a systematic way to assign blame.
- Backpropagation solves that credit-assignment problem.
So an MLP exists to represent nonlinear functions, and backprop exists to train that representation efficiently.
How It Works
Visual Intuition
Imagine three stages.
- the input layer receives raw values
- a hidden layer bends or re-expresses the space
- the output layer turns that transformed representation into a prediction
When the prediction is wrong, the error is not useful unless we can trace it back through all earlier layers. Backprop does exactly that: it sends the error signal backward, one layer at a time, so every weight gets a gradient.
Step by Step
- Run a forward pass through each layer
- Compute a loss from the final prediction
- Differentiate that loss with respect to the output layer
- Propagate gradients backward through each layer using the chain rule
- Update weights with gradient descent
The key operational idea is simple: each layer reuses the gradient from the layer after it, instead of recomputing everything from scratch.
Code
def relu(x):
return max(0.0, x)
def relu_grad(x):
return 1.0 if x > 0 else 0.0
# Sketch for one hidden neuron:
# z1 = w1 * x + b1
# a1 = relu(z1)
# z2 = w2 * a1 + b2
# loss = (z2 - y) ** 2
# then apply chain rule backward to get dloss/dw2, dloss/dw1, ...
The Math Inside
Forward pass for layer l:
z^(l) = W^(l) a^(l-1) + b^(l)
a^(l) = phi(z^(l))
a^(l-1): activations from the previous layerW^(l),b^(l): weights and bias for this layerphi: activation function such as ReLU, sigmoid, or tanh
Loss:
L = f(y, y_hat)
Backprop idea:
dL/dW = dL/da * da/dz * dz/dW
That single expression is the chain rule in action. In a deep network, you keep applying it from the output layer backward.
For each layer, the local gradient is combined with the incoming gradient from the next layer. That is why backprop scales to large networks: gradients are reused recursively instead of expanded manually.
Math Prerequisites
- Chain Rule - the core mechanism of backprop
- Partial Derivatives - one gradient entry per parameter
- Derivatives & Gradients - local slope and gradient vectors
- Gradient Descent - how gradients become parameter updates
Related
- Perceptron — The single-layer predecessor
- Chain Rule — The math that makes backprop work
- Gradient Descent — The optimization method
- CNN — Backprop applied to images
- Transformer — Modern architecture, same training principle