I Use This When...
I need a model that generates text, code, or structured tokens one step at a time. GPT-style models are the standard choice when the task is completion, dialogue, generation, tool calling, or open-ended reasoning through token prediction.
History
GPT-1 (Radford, OpenAI 2018). GPT-2 (2019) — 'too dangerous to release.' GPT-3 (2020) — 175B params, few-shot learning emerges. GPT-4 (2023) — multimodal. ChatGPT (2022) — GPT-3.5 + RLHF, 100M users in 2 months.
Why It Exists
The "why" chain is:
- We want a model that can continue text coherently.
- Bidirectional encoders are strong at understanding but not natural generators.
- If we predict the next token repeatedly, generation becomes a unified training objective.
- Scaling that objective turns out to produce broad language competence.
GPT exists because next-token prediction is a simple objective that unlocks general-purpose generation.
How It Works
Visual Intuition
Imagine text being written one token at a time.
- the model sees the prefix so far
- it predicts a distribution over the next token
- one token is chosen
- that token is appended to the context
- the process repeats
The model is always solving the same local task, but the repeated loop creates paragraphs, code, dialogue, and reasoning traces.
The timeline node is here:
Step by Step
- Tokenize the input sequence
- Embed the tokens and their positions
- Run them through a decoder-only Transformer
- Use a causal mask so each position only attends to earlier positions
- Predict the next-token distribution
- Train by minimizing cross-entropy on the true next token
Inference just repeats the next-token step autoregressively.
Code
# concept sketch
# for t in range(len(tokens) - 1):
# probs = model(tokens[: t + 1])
# loss += cross_entropy(probs[-1], tokens[t + 1])
The Math Inside
Autoregressive factorization:
P(x_1, ..., x_n) = product_t P(x_t | x_1, ..., x_{t-1})
This turns sequence modeling into repeated conditional prediction.
GPT uses:
- decoder-only Transformer blocks
- causal attention mask so token
tcannot look ahead - cross-entropy loss on the next true token
Why this matters:
- one training objective covers many text tasks
- pretraining on huge corpora builds reusable representations
- instruction tuning and RLHF can then reshape the behavior for assistant use
The surprising part historically is that scale made this simple objective far more capable than many people expected.
Math Prerequisites
- Transformer - the architecture GPT is built on
- Cross-Entropy - the next-token training loss
- RLHF - how assistant behavior is aligned after pretraining
Related
- Transformer — The architecture
- BERT — Encoder-only alternative
- RLHF — How ChatGPT was aligned
- Timeline — The LLM era