Wiki/Topics/AI / ML/Deep Learning/Models/Vision Transformer (ViT)

Vision Transformer (ViT)

vitvisiontransformerimage-classificationdeep-learning2026-04-08

I Use This When...

<!-- Practical use case -->

History

Dosovitskiy et al. (Google, 2020). 'An Image Is Worth 16x16 Words.' Proved Transformers can replace CNNs for vision — if you have enough data.

Why It Exists

CNNs have inductive bias for local patterns (convolution). ViT asks: what if we just split the image into patches and treat them as tokens? With enough pre-training data, it works better.

How It Works

Visual Intuition

<!-- 3B1B-style animation description -->

Step by Step

<!-- Algorithm walkthrough -->

Code

# Implementation sketch

The Math Inside

Split image into 16x16 patches. Flatten each patch → linear projection → add positional embedding. Feed sequence of patch embeddings into standard Transformer encoder.

Math Prerequisites

<!-- Links to math wiki -->

Linked from