GPT (Generative Pre-trained Transformer)

I Use This When...

I need a model that generates text, code, or structured tokens one step at a time. GPT-style models are the standard choice when the task is completion, dialogue, generation, tool calling, or open-ended reasoning through token prediction.

History

GPT-1 (Radford, OpenAI 2018). GPT-2 (2019) — 'too dangerous to release.' GPT-3 (2020) — 175B params, few-shot learning emerges. GPT-4 (2023) — multimodal. ChatGPT (2022) — GPT-3.5 + RLHF, 100M users in 2 months.

Why It Exists

The "why" chain is:

We want a model that can continue text coherently.
Bidirectional encoders are strong at understanding but not natural generators.
If we predict the next token repeatedly, generation becomes a unified training objective.
Scaling that objective turns out to produce broad language competence.

GPT exists because next-token prediction is a simple objective that unlocks general-purpose generation.

How It Works

Visual Intuition

Imagine text being written one token at a time.

the model sees the prefix so far
it predicts a distribution over the next token
one token is chosen
that token is appended to the context
the process repeats

The model is always solving the same local task, but the repeated loop creates paragraphs, code, dialogue, and reasoning traces.

The timeline node is here:

-> MLViz Node: GPT

Step by Step

Tokenize the input sequence
Embed the tokens and their positions
Run them through a decoder-only Transformer
Use a causal mask so each position only attends to earlier positions
Predict the next-token distribution
Train by minimizing cross-entropy on the true next token

Inference just repeats the next-token step autoregressively.

Code

# concept sketch
# for t in range(len(tokens) - 1):
#     probs = model(tokens[: t + 1])
#     loss += cross_entropy(probs[-1], tokens[t + 1])

The Math Inside

Autoregressive factorization:

P(x_1, ..., x_n) = product_t P(x_t | x_1, ..., x_{t-1})

This turns sequence modeling into repeated conditional prediction.

GPT uses:

decoder-only Transformer blocks
causal attention mask so token t cannot look ahead
cross-entropy loss on the next true token

Why this matters:

one training objective covers many text tasks
pretraining on huge corpora builds reusable representations
instruction tuning and RLHF can then reshape the behavior for assistant use

The surprising part historically is that scale made this simple objective far more capable than many people expected.

Math Prerequisites

Transformer - the architecture GPT is built on
Cross-Entropy - the next-token training loss
RLHF - how assistant behavior is aligned after pretraining

Transformer — The architecture
BERT — Encoder-only alternative
RLHF — How ChatGPT was aligned
Timeline — The LLM era