Optimization: SGD → Momentum → Adam

I Use This When...

I know my model and loss already. The next question is how to move the parameters efficiently through a noisy high-dimensional loss surface. This page is about the practical optimizers that turn gradients into trainable systems.

History

SGD is the oldest (Robbins & Monro, 1951). Momentum (Polyak, 1964). Adam (Kingma & Ba, 2015). Adam is now the default for most deep learning.

Why It Exists

Gradient descent is slow in practice — ravines, saddle points, noisy gradients. Each successor adds a trick to navigate the loss landscape faster.

How It Works

Visual Intuition

Imagine descending a long ravine:

plain SGD jitters because each minibatch gives a noisy slope
momentum acts like velocity that smooths the path
Adam rescales each parameter step using running estimates of past gradients

All three are still "go downhill," but they disagree on how to trust the local signal.

Step by Step

Sample a minibatch of training data
Compute the gradient of the loss
Update running state if the optimizer has one
Convert gradient plus state into a parameter step
Repeat across many minibatches and epochs

Code

for x_batch, y_batch in loader:
    loss = model(x_batch, y_batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

The Math Inside

SGD:

theta = theta - alpha grad(L)

Momentum:

v_t = beta v_{t-1} + grad(L)

theta = theta - alpha v_t

Adam keeps moving averages of first and second moments:

m_t = beta_1 m_{t-1} + (1-beta_1) g_t

v_t = beta_2 v_{t-1} + (1-beta_2) g_t^2

Then it scales each parameter step by roughly m_t / sqrt(v_t).

Interpretation:

SGD is the clean baseline
momentum remembers direction
Adam adapts step sizes per parameter and usually works better out of the box

Math Prerequisites

Gradient Descent - the core update rule
Derivatives & Gradients - where the update direction comes from
MLP & Backprop - how deep models produce gradients

Gradient Descent — The foundation
Loss Functions — What we're optimizing
MLP & Backprop — Where optimization happens