I Use This When...
I need to quantify uncertainty in a probability distribution. In ML, entropy shows up directly in decision trees, cross-entropy loss, and many places where we ask how mixed or how surprising a distribution is.
Why It Exists
The "why" chain is:
- Some probability distributions are predictable.
- Others are highly uncertain.
- We want one number that summarizes that uncertainty.
- Entropy is that number.
Entropy exists because probability alone does not tell us how concentrated or spread out the uncertainty is.
Visual Intuition
Compare three class distributions:
[1.0, 0.0]-> fully certain[0.9, 0.1]-> mostly certain[0.5, 0.5]-> maximally uncertain for two classes
Entropy is smallest in the first case and largest in the last. Decision trees use that fact to decide which split makes class labels less mixed.
How It Works
- Start with a probability distribution over outcomes
- For each outcome, measure its surprise with
-log p - Average that surprise using the probabilities themselves
Rare events are more surprising than common events, so they contribute more to the uncertainty measure.
The Math Inside
For discrete outcomes:
H(X) = - sum p_i log2(p_i)
p_i: probability of outcomei- higher entropy: more uncertainty
- lower entropy: more predictability
For binary classification with class probability p:
H(p) = -p log2(p) - (1 - p) log2(1 - p)
Some useful cases:
p = 0or1-> entropy is0p = 0.5-> entropy is maximal
In a decision tree, a split is good if it lowers the weighted average entropy of the child nodes. That drop is information gain.
Examples
- a pure node in a decision tree has entropy
0 - a 50/50 mixed node has high entropy
- cross-entropy compares a predicted distribution to the true one using the same idea of surprise
Code
import math
def entropy(probs):
total = 0.0
for p in probs:
if p > 0:
total -= p * math.log2(p)
return total
Used In
- Decision Tree — Information Gain = entropy reduction
- Cross-Entropy — Classification loss
- KL Divergence — Distance between distributions