Policy Gradient & PPO

I Use This When...

I want to optimize the policy directly, especially when the action space is continuous, large, or awkward for value-based methods. PPO is a common default when stable on-policy reinforcement learning is needed.

History

REINFORCE: Williams (1992). Actor-Critic: Sutton (1999). PPO: Schulman (OpenAI, 2017). PPO became the default RL algorithm — used in ChatGPT's RLHF.

Why It Exists

The "why" chain is:

Value-based methods estimate how good actions are.
But some problems care more about directly improving the action distribution itself.
Continuous actions make argmax over actions awkward.
Direct policy optimization can be simpler conceptually.
PPO exists because naive policy updates are unstable and can move too far.

Policy gradients exist to learn behavior directly. PPO exists to keep those updates under control.

How It Works

Visual Intuition

Imagine a policy as a probability distribution over actions.

if a trajectory leads to good outcomes, the policy should make those actions more likely
if outcomes are bad, it should make them less likely
but if updates are too large, the policy can collapse or become unstable

PPO adds a trust-region-like clip so the new policy cannot move too far from the old one in one step.

Step by Step

Roll out trajectories using the current policy
Estimate returns or advantages from those trajectories
Compute how the new policy differs from the old policy
Improve the policy using the policy-gradient objective
In PPO, clip the update ratio to prevent overly aggressive changes
Repeat

The clip is the key stabilizer that made PPO popular in practice.

Code

# concept sketch
# ratio = pi_theta(a|s) / pi_theta_old(a|s)
# objective = min(ratio * A, clip(ratio, 1-eps, 1+eps) * A)

The Math Inside

Policy-gradient core idea:

grad J(theta) = E[grad log pi_theta(a|s) * R]

In practice, we replace raw return R with an advantage estimate A_t to reduce variance.

PPO objective:

L = min(r_t A_t, clip(r_t, 1 - eps, 1 + eps) A_t)

where

r_t = pi_theta(a_t|s_t) / pi_theta_old(a_t|s_t)

The clip prevents the new policy from drifting too far from the old one in one optimization round.

That small design choice is why PPO became a practical default in many RL systems, including RLHF pipelines.

Math Prerequisites

Q-Learning - value-based alternative
Gradient Descent - optimization step intuition
RLHF - major modern use of PPO

Q-Learning — Value-based alternative
DQN — Value-based + neural net
RLHF — PPO applied to LLM alignment