Transformer

I Use This When...

I need a model for text or other sequences where long-range relationships matter, and I want training to scale in parallel. This is the default architecture behind modern language models and many multimodal systems.

History

Vaswani et al. (2017) — 'Attention Is All You Need.' Google Brain. Originally for machine translation. Turned out to be the universal architecture for almost everything.

Why It Exists

The "why" chain is:

RNNs process one step at a time.
That serial bottleneck makes training slow.
Long-range dependencies are hard when information must hop through many steps.
We want every token to access every other token directly.

The Transformer exists because attention gives direct context access and much better parallelism than recurrence.

How It Works

Visual Intuition

Imagine a sentence as a table of tokens.

each token asks, "which other tokens matter for me?"
attention scores determine how strongly it should look at them
the token builds a new representation by mixing information from the important tokens

Unlike an RNN, the model does not march left to right just to create context. Context is available directly through attention.

The main timeline node is here:

-> MLViz Node: Transformer

Step by Step

Embed the tokens
Add positional information so order is not lost
Compute queries, keys, and values from the inputs
Use attention scores to mix information across tokens
Apply feed-forward layers to each position
Stack many layers to build richer representations

Encoder-only, decoder-only, and encoder-decoder variants all reuse this core attention pattern.

Code

# concept sketch
# Q = X @ W_Q
# K = X @ W_K
# V = X @ W_V
# attention = softmax(Q @ K.T / sqrt(d_k)) @ V

The Math Inside

Self-attention:

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

Q: queries
K: keys
V: values
Q K^T: similarity scores between tokens
sqrt(d_k): scale factor for stability

Multi-head attention repeats this with different learned projections so the model can capture different types of relationships simultaneously.

Because attention itself is permutation-invariant, positional encodings or positional embeddings are added so sequence order is still represented.

Math Prerequisites

Dot Product - attention scores come from vector similarity
Vectors & Matrices - token batches and projections
RNN / LSTM - what the Transformer replaced
GPT and BERT - major model families built from the same core

RNN / LSTM — What Transformer replaced
BERT — Encoder-only Transformer
GPT — Decoder-only Transformer
ViT — Transformer for images
Dot Product — QK^T is a dot product