I Use This When...
I need a model for text or other sequences where long-range relationships matter, and I want training to scale in parallel. This is the default architecture behind modern language models and many multimodal systems.
History
Vaswani et al. (2017) — 'Attention Is All You Need.' Google Brain. Originally for machine translation. Turned out to be the universal architecture for almost everything.
Why It Exists
The "why" chain is:
- RNNs process one step at a time.
- That serial bottleneck makes training slow.
- Long-range dependencies are hard when information must hop through many steps.
- We want every token to access every other token directly.
The Transformer exists because attention gives direct context access and much better parallelism than recurrence.
How It Works
Visual Intuition
Imagine a sentence as a table of tokens.
- each token asks, "which other tokens matter for me?"
- attention scores determine how strongly it should look at them
- the token builds a new representation by mixing information from the important tokens
Unlike an RNN, the model does not march left to right just to create context. Context is available directly through attention.
The main timeline node is here:
Step by Step
- Embed the tokens
- Add positional information so order is not lost
- Compute queries, keys, and values from the inputs
- Use attention scores to mix information across tokens
- Apply feed-forward layers to each position
- Stack many layers to build richer representations
Encoder-only, decoder-only, and encoder-decoder variants all reuse this core attention pattern.
Code
# concept sketch
# Q = X @ W_Q
# K = X @ W_K
# V = X @ W_V
# attention = softmax(Q @ K.T / sqrt(d_k)) @ V
The Math Inside
Self-attention:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
Q: queriesK: keysV: valuesQ K^T: similarity scores between tokenssqrt(d_k): scale factor for stability
Multi-head attention repeats this with different learned projections so the model can capture different types of relationships simultaneously.
Because attention itself is permutation-invariant, positional encodings or positional embeddings are added so sequence order is still represented.
Math Prerequisites
- Dot Product - attention scores come from vector similarity
- Vectors & Matrices - token batches and projections
- RNN / LSTM - what the Transformer replaced
- GPT and BERT - major model families built from the same core
Related
- RNN / LSTM — What Transformer replaced
- BERT — Encoder-only Transformer
- GPT — Decoder-only Transformer
- ViT — Transformer for images
- Dot Product — QK^T is a dot product