AlexNet → VGG → ResNet

I Use This When...

I want to understand why deep learning suddenly dominated computer vision and why deeper CNNs stopped collapsing once residual connections arrived. This page matters when ImageNet-scale vision, feature hierarchies, and ResNet-style skip connections show up in later models.

History

AlexNet (Krizhevsky 2012): Won ImageNet, sparked the deep learning revolution. VGG (Simonyan 2014): Showed depth matters. ResNet (He 2015): Skip connections allow 152+ layers, 3.6% error (superhuman).

Why It Exists

The "why" chain is:

CNNs already worked for vision, but earlier models were limited in scale.
AlexNet showed that deeper CNNs plus GPUs could crush ImageNet.
VGG suggested that more depth could keep improving learned features.
But eventually deeper networks became harder to optimize and accuracy degraded.
ResNet solved that by letting layers learn corrections on top of an identity path.

This family exists because vision kept rewarding depth, but depth needed an architectural trick to remain trainable.

How It Works

Visual Intuition

Imagine feature extraction through layers:

early layers detect edges and color contrasts
middle layers combine them into textures and motifs
deeper layers assemble parts like eyes, wheels, or fur
final layers reason about whole objects

Now imagine a residual block with a shortcut lane:

one path applies convolutions and nonlinearities
the other path carries the input forward unchanged
the two paths are added together

That shortcut makes it easier for the network to preserve useful information instead of relearning identity mappings from scratch.

The timeline node is here:

-> MLViz Node: AlexNet / ResNet

Step by Step

Pass an image through stacks of convolutions, activations, and pooling
Let early layers learn simple visual patterns
Let deeper layers combine them into more abstract concepts
For ResNet, add skip connections so each block learns F(x) and outputs F(x) + x
Train end to end with backprop on large labeled datasets

AlexNet proved scale worked. VGG pushed depth. ResNet made extreme depth stable.

Code

def residual_block(x):
    out = conv_bn_relu(x)
    out = conv_bn(out)
    return relu(out + x)

The Math Inside

ResNet replaces a plain mapping with:

y = F(x) + x

x: input activation
F(x): learned residual transformation
x shortcut: identity path

If the best mapping is close to identity, it is easier to learn a small residual F(x) than to relearn the entire mapping from zero.

Why this mattered:

gradients flow more easily through the shortcut path
deep stacks become easier to optimize
accuracy keeps improving as depth scales further

AlexNet's contribution was not residual learning, but showing that large CNNs trained on GPUs could suddenly dominate visual recognition.

Math Prerequisites

CNN - convolution and feature maps
MLP & Backprop - how deep nets train
Optimization - why deeper nets are hard to optimize
Loss Functions - the training objective on classification tasks

CNN — The base architecture
ViT — Transformer-based alternative
Timeline — The 2012 turning point