Chain Rule

I Use This When...

One quantity depends on another quantity, which depends on another one again. That nested dependency structure appears everywhere in neural networks because each layer is a function of the previous layer.

Why It Exists

The "why" chain is:

Real systems are built from stages, not one flat equation.
If the output changes, we want to know which earlier stage caused it.
A derivative through one stage is not enough.
We need a rule for composed functions.

The chain rule exists to let local changes flow through a pipeline of transformations.

Visual Intuition

Imagine a pipeline:

x -> u -> y

If changing x changes u, and changing u changes y, then changing x must change y through both links. The total effect is the product of those local effects.

That is exactly what backprop does in a neural net: the output error flows back through one layer, then the one before it, then the one before that.

How It Works

Break the computation into intermediate variables
Differentiate each local step
Multiply the local derivatives along the path
If multiple paths contribute, add those contributions

This is why neural-network implementations store intermediate values from the forward pass: the backward pass needs them.

The Math Inside

For one-variable composition

y = f(g(x))

the derivative is

dy/dx = (dy/dg) * (dg/dx)

For a slightly bigger example, if

u = 3x + 1

y = u^2

then

dy/du = 2u

du/dx = 3

so

dy/dx = 2u * 3 = 2(3x + 1) * 3

In multivariable settings, the same logic applies with partial derivatives. That is the version used by backpropagation.

Examples

Suppose

z = sigmoid(w x + b)

and the loss depends on z. To find dL/dw, you do not differentiate the whole expression from scratch every time. You chain the parts:

dL/dw = dL/dz * dz/d(wx+b) * d(wx+b)/dw

That pattern is the core of deep learning.

Code

def chain_rule_example(x):
    u = 3 * x + 1
    y = u ** 2
    dy_du = 2 * u
    du_dx = 3
    return dy_du * du_dx

Used In

MLP & Backprop — Backprop = chain rule
RNN / LSTM — Vanishing gradients = chain rule applied too many times
Derivatives & Gradients — Prerequisites