I Use This When...
I have a loss function I want to minimize, but solving for the minimum directly is hard, expensive, or impossible. This is the default training loop behind linear regression, neural networks, and most modern ML systems.
Why It Exists
The "why" chain is:
- A model is only useful if we can measure how wrong it is.
- Once we have a loss, we need a way to reduce it.
- Closed-form solutions are rare once models get bigger.
- Local slope tells us which small move increases loss.
- So we step in the opposite direction.
Gradient descent exists because "follow the downhill direction" scales farther than symbolic algebra.
Visual Intuition
Picture a ball on a smooth bowl. At any point on the bowl, the gradient is the direction of steepest ascent. If you want the ball to go down instead, move in the negative gradient direction.
In the linear regression demo, the bowl is not drawn directly. Instead, you see its effect: the line parameters move in the direction that lowers MSE.
-> Interactive Demo: Linear Regression
How It Works
- Start with current parameters, for example
wandb - Compute the gradient of the loss with respect to those parameters
- Multiply that gradient by a learning rate
- Subtract the result from the current parameters
- Repeat until loss stops improving enough
The learning rate controls step size:
- too small -> slow progress
- too large -> overshoot or diverge
The Math Inside
Generic update rule:
theta_next = theta - alpha * grad(L(theta))
theta: parameter vectorL(theta): loss as a function of the parametersgrad(L(theta)): gradient, the multi-dimensional slopealpha: learning rate
For linear regression with y_hat = wx + b and MSE:
dL/dw = (2/n) * sum((wx_i + b - y_i) * x_i)
dL/db = (2/n) * sum(wx_i + b - y_i)
Then update:
w <- w - alpha * dL/dw
b <- b - alpha * dL/db
This is the core loop of learning: compute error, compute gradient, step downhill.
Examples
If the loss surface is a simple bowl, gradient descent walks down to the bottom. If the surface is long and narrow, it zig-zags. If it is noisy, mini-batches and optimizer variants like momentum or Adam help smooth the path.
Code
theta = 5.0
learning_rate = 0.1
for _ in range(20):
gradient = 2 * (theta - 1) # derivative of (theta - 1)^2
theta = theta - learning_rate * gradient
Used In
- Optimization (SGD → Adam) — Variants and improvements
- MLP & Backprop — Where gradients come from
- Loss Functions — What we're minimizing
- Derivatives & Gradients — The math behind it