I Use This When...
I want to understand why deep learning suddenly dominated computer vision and why deeper CNNs stopped collapsing once residual connections arrived. This page matters when ImageNet-scale vision, feature hierarchies, and ResNet-style skip connections show up in later models.
History
AlexNet (Krizhevsky 2012): Won ImageNet, sparked the deep learning revolution. VGG (Simonyan 2014): Showed depth matters. ResNet (He 2015): Skip connections allow 152+ layers, 3.6% error (superhuman).
Why It Exists
The "why" chain is:
- CNNs already worked for vision, but earlier models were limited in scale.
- AlexNet showed that deeper CNNs plus GPUs could crush ImageNet.
- VGG suggested that more depth could keep improving learned features.
- But eventually deeper networks became harder to optimize and accuracy degraded.
- ResNet solved that by letting layers learn corrections on top of an identity path.
This family exists because vision kept rewarding depth, but depth needed an architectural trick to remain trainable.
How It Works
Visual Intuition
Imagine feature extraction through layers:
- early layers detect edges and color contrasts
- middle layers combine them into textures and motifs
- deeper layers assemble parts like eyes, wheels, or fur
- final layers reason about whole objects
Now imagine a residual block with a shortcut lane:
- one path applies convolutions and nonlinearities
- the other path carries the input forward unchanged
- the two paths are added together
That shortcut makes it easier for the network to preserve useful information instead of relearning identity mappings from scratch.
The timeline node is here:
-> MLViz Node: AlexNet / ResNet
Step by Step
- Pass an image through stacks of convolutions, activations, and pooling
- Let early layers learn simple visual patterns
- Let deeper layers combine them into more abstract concepts
- For ResNet, add skip connections so each block learns
F(x)and outputsF(x) + x - Train end to end with backprop on large labeled datasets
AlexNet proved scale worked. VGG pushed depth. ResNet made extreme depth stable.
Code
def residual_block(x):
out = conv_bn_relu(x)
out = conv_bn(out)
return relu(out + x)
The Math Inside
ResNet replaces a plain mapping with:
y = F(x) + x
x: input activationF(x): learned residual transformationxshortcut: identity path
If the best mapping is close to identity, it is easier to learn a small residual
F(x) than to relearn the entire mapping from zero.
Why this mattered:
- gradients flow more easily through the shortcut path
- deep stacks become easier to optimize
- accuracy keeps improving as depth scales further
AlexNet's contribution was not residual learning, but showing that large CNNs trained on GPUs could suddenly dominate visual recognition.
Math Prerequisites
- CNN - convolution and feature maps
- MLP & Backprop - how deep nets train
- Optimization - why deeper nets are hard to optimize
- Loss Functions - the training objective on classification tasks