Vision Transformer (ViT)

I Use This When...

History

Dosovitskiy et al. (Google, 2020). 'An Image Is Worth 16x16 Words.' Proved Transformers can replace CNNs for vision — if you have enough data.

Why It Exists

CNNs have inductive bias for local patterns (convolution). ViT asks: what if we just split the image into patches and treat them as tokens? With enough pre-training data, it works better.

How It Works

Visual Intuition

Step by Step

Code

# Implementation sketch

The Math Inside

Split image into 16x16 patches. Flatten each patch → linear projection → add positional embedding. Feed sequence of patch embeddings into standard Transformer encoder.

Math Prerequisites

CNN — What ViT replaces
Transformer — The architecture
AlexNet → ResNet — CNN evolution