RLHF (Reinforcement Learning from Human Feedback)

I Use This When...

I already have a capable pretrained language model, but I need it to behave more helpfully, more safely, or more in line with human preferences. RLHF is used when raw pretraining objective quality is not the same thing as useful assistant behavior.

History

Christiano et al. (2017) — original framework. InstructGPT (Ouyang et al., OpenAI 2022) — applied to GPT-3. ChatGPT (2022) — RLHF at scale, changed everything.

Why It Exists

The "why" chain is:

Pretraining teaches prediction, not preference.
A model can be fluent and still be unhelpful, evasive, or misaligned with user intent.
Humans can compare outputs even when they cannot write a perfect reward function directly.
We can learn a reward model from those comparisons.
Then we can optimize the policy against that learned reward while staying near the base model.

RLHF exists because "predict text well" is not the same objective as "behave like a useful assistant."

How It Works

Visual Intuition

Imagine one prompt producing several candidate answers.

humans rank which answer is better
a reward model learns to score outputs the way those humans ranked them
the language model is then optimized to produce higher-scoring outputs
a KL penalty stops it from drifting too far from the pretrained model

So RLHF is not reward from the environment in the classic game sense. It is reward learned from human preferences.

The alignment milestone is here:

-> MLViz Node: RLHF

Step by Step

Start with a pretrained base model
Do supervised fine-tuning on instruction-response examples
Collect ranked comparisons of model outputs
Train a reward model on those comparisons
Optimize the policy with PPO against that reward
Add a KL penalty so the model stays close to the reference policy

The pipeline is really "pretraining + preference modeling + constrained policy optimization."

Code

# concept sketch
# reward = reward_model(prompt, response)
# objective = reward - beta * KL(policy || reference_policy)
# optimize policy with PPO

The Math Inside

Typical RLHF objective:

maximize E[r_theta(x, y)] - beta * KL(pi || pi_ref)

r_theta(x, y): reward-model score for prompt x and response y
pi: current policy
pi_ref: reference or base policy
beta: strength of the stay-close penalty

Three-stage view:

supervised fine-tuning
reward-model training from ranked preferences
PPO optimization with KL regularization

The KL term matters because otherwise the model may exploit weaknesses in the reward model and drift away from the base model's language competence.

Math Prerequisites

GPT - the pretrained model family commonly aligned this way
Policy Gradient / PPO - the optimization step used in many RLHF pipelines
KL Divergence - why the alignment update is constrained

GPT — The model being aligned
Policy Gradient / PPO — The optimization algorithm
KL Divergence — Constraining the update