Convolutional Neural Network (CNN)

I Use This When...

I need a model that understands spatial structure, especially for images, video frames, medical scans, or any grid-like signal. CNNs are useful when nearby patterns matter more than arbitrary long-distance feature interactions.

History

Yann LeCun (1989) — LeNet for digit recognition. Breakthrough: AlexNet (Krizhevsky, Sutskever, Hinton 2012) — won ImageNet by a landslide using GPU training.

Why It Exists

The "why" chain is:

Images are not just unordered lists of pixels.
Nearby pixels form edges, corners, textures, and shapes.
Fully connected layers ignore that local structure and waste parameters.
We want small reusable detectors that can slide across the image.

CNNs exist because vision needs locality and weight sharing, not one giant fully-connected block.

How It Works

Visual Intuition

Imagine a tiny filter moving across an image.

one filter lights up on vertical edges
another responds to corners
deeper layers combine those low-level signals into parts and objects

The same detector is reused everywhere in the image, which is why CNNs can scale much better than naive fully connected models for vision.

The major timeline breakthrough for this family is here:

-> MLViz Node: AlexNet

Step by Step

Start with an image tensor
Apply small learnable filters across local neighborhoods
Produce feature maps showing where each pattern appears
Repeat with deeper layers to build more abstract patterns
Optionally use pooling or striding to reduce spatial size
Feed the final representation to a classifier or regressor

The architecture progressively turns raw pixels into structured features.

Code

# concept sketch
# x -> conv -> relu -> pool -> conv -> relu -> pool -> classifier

The Math Inside

In practice, a convolution layer applies a small kernel over local patches.

At each spatial location, the kernel computes a dot product with the input patch to produce one value in a feature map.

Conceptually:

feature_map[i, j] = sum_u sum_v K[u, v] * X[i+u, j+v]

X: input image or feature map
K: learnable kernel
output value: how strongly that pattern appears at location (i, j)

Two core ideas matter:

local receptive fields: each output sees only a neighborhood
weight sharing: the same kernel is reused across positions

That combination is what makes CNNs effective for images.

Math Prerequisites

Dot Product - what each sliding filter computes locally
Vectors & Matrices - tensor and feature-map intuition
MLP & Backprop - how convolutional weights are trained
AlexNet → ResNet - the major CNN breakthrough line

MLP & Backprop — The training method
AlexNet → ResNet — Landmark CNN models
ViT — Transformer replacing CNN
Dot Product — What convolution computes