I Use This When...
I already have a capable pretrained language model, but I need it to behave more helpfully, more safely, or more in line with human preferences. RLHF is used when raw pretraining objective quality is not the same thing as useful assistant behavior.
History
Christiano et al. (2017) — original framework. InstructGPT (Ouyang et al., OpenAI 2022) — applied to GPT-3. ChatGPT (2022) — RLHF at scale, changed everything.
Why It Exists
The "why" chain is:
- Pretraining teaches prediction, not preference.
- A model can be fluent and still be unhelpful, evasive, or misaligned with user intent.
- Humans can compare outputs even when they cannot write a perfect reward function directly.
- We can learn a reward model from those comparisons.
- Then we can optimize the policy against that learned reward while staying near the base model.
RLHF exists because "predict text well" is not the same objective as "behave like a useful assistant."
How It Works
Visual Intuition
Imagine one prompt producing several candidate answers.
- humans rank which answer is better
- a reward model learns to score outputs the way those humans ranked them
- the language model is then optimized to produce higher-scoring outputs
- a KL penalty stops it from drifting too far from the pretrained model
So RLHF is not reward from the environment in the classic game sense. It is reward learned from human preferences.
The alignment milestone is here:
Step by Step
- Start with a pretrained base model
- Do supervised fine-tuning on instruction-response examples
- Collect ranked comparisons of model outputs
- Train a reward model on those comparisons
- Optimize the policy with PPO against that reward
- Add a KL penalty so the model stays close to the reference policy
The pipeline is really "pretraining + preference modeling + constrained policy optimization."
Code
# concept sketch
# reward = reward_model(prompt, response)
# objective = reward - beta * KL(policy || reference_policy)
# optimize policy with PPO
The Math Inside
Typical RLHF objective:
maximize E[r_theta(x, y)] - beta * KL(pi || pi_ref)
r_theta(x, y): reward-model score for promptxand responseypi: current policypi_ref: reference or base policybeta: strength of the stay-close penalty
Three-stage view:
- supervised fine-tuning
- reward-model training from ranked preferences
- PPO optimization with KL regularization
The KL term matters because otherwise the model may exploit weaknesses in the reward model and drift away from the base model's language competence.
Math Prerequisites
- GPT - the pretrained model family commonly aligned this way
- Policy Gradient / PPO - the optimization step used in many RLHF pipelines
- KL Divergence - why the alignment update is constrained
Related
- GPT — The model being aligned
- Policy Gradient / PPO — The optimization algorithm
- KL Divergence — Constraining the update