I Use This When...
I need to compare one probability distribution against another, especially when I care about how costly it is to use the wrong distribution to explain data. KL divergence shows up in RLHF, variational methods, t-SNE, and cross-entropy analysis.
Why It Exists
The "why" chain is:
- A model often predicts a distribution, not just one label.
- We need a way to say how far that predicted distribution is from the target.
- Simple subtraction of probabilities misses the information cost.
- We want a measure of how much extra surprise we pay when using
Qinstead ofP.
KL divergence exists because "wrong probabilities" are not all equally wrong. Being confidently wrong should cost much more.
Visual Intuition
Imagine P is the real distribution and Q is your model.
- if
Qputs high probability wherePalso does, the mismatch is small - if
Qignores outcomes thatPthinks are likely, the mismatch becomes large
So KL measures a directional penalty:
KL(P || Q)asks how expensive it is to approximatePusingQ- reversing the order changes the answer
That asymmetry is why KL is not a true geometric distance.
How It Works
- Choose the reference distribution
P - Choose the approximating distribution
Q - Compare their probabilities outcome by outcome
- Weight the log-ratio by how much
Pcares about each outcome - Sum or integrate over all outcomes
This means KL pays most attention to regions where the true distribution assigns substantial mass.
The Math
Discrete case:
KL(P || Q) = sum P(x) log(P(x) / Q(x))
P(x): true or target probabilityQ(x): model or approximate probability
Important facts:
KL(P || Q) >= 0KL(P || Q) = 0only whenP = Q- in general
KL(P || Q) != KL(Q || P)
Connection to cross-entropy:
H(P, Q) = H(P) + KL(P || Q)
So minimizing cross-entropy is equivalent to minimizing KL divergence when the
target entropy H(P) is fixed.
Examples
Suppose the target distribution is:
P(cat) = 0.9P(dog) = 0.1
If a model predicts:
Q(cat) = 0.6Q(dog) = 0.4
the model is not just slightly off. It is assigning too much weight to the wrong outcome, so KL becomes noticeably positive.
Code
import math
def kl_divergence(p, q):
total = 0.0
for pi, qi in zip(p, q):
total += pi * math.log(pi / qi)
return total
Used In
- t-SNE — Minimizes KL divergence
- RLHF — KL penalty prevents model drift
- Cross-Entropy — Related: CE = Entropy + KL
- Entropy — Foundation