I Use This When...
I want a fast probabilistic baseline for text or count-like features, especially when data is sparse and interpretability matters. Spam filtering and simple document classification are classic Naive Bayes territory.
History
Applied to ML in the 1960s. Famously used for email spam filtering (late 1990s). 'Naive' because it assumes features are independent.
Why It Exists
The "why" chain is:
- classification can be phrased as "which label is most probable given the evidence?"
- Bayes' theorem tells us how to update belief from evidence
- but modeling the full joint feature distribution is expensive
- assuming features are conditionally independent makes the model tractable
Naive Bayes exists because a strong simplifying assumption can turn probability theory into a very practical classifier.
How It Works
Visual Intuition
Imagine classifying an email.
- start with a prior spam probability
- inspect words like "lottery", "discount", or "meeting"
- each word nudges the posterior toward one class or the other
The model does not learn a geometric boundary directly. It compares how plausible the observed features are under each class.
Step by Step
- Estimate class priors such as
P(spam)andP(not spam) - Estimate per-feature likelihoods such as
P(word | class) - For a new example, multiply or sum log-likelihoods across features
- Combine them with the class prior
- Predict the class with the largest posterior score
Code
import math
def naive_bayes_score(log_prior, feature_log_probs):
return log_prior + sum(feature_log_probs)
The Math Inside
Bayes gives:
P(class | x) proportional to P(x | class) P(class)
The naive assumption is:
P(x | class) = product P(x_i | class)
So the classifier becomes:
argmax_c P(c) product P(x_i | c)
In practice we usually use logs:
argmax_c [log P(c) + sum log P(x_i | c)]
Why this matters:
- very fast to train
- works surprisingly well on bag-of-words text
- can struggle when feature dependence is strong
Math Prerequisites
- Bayes' Theorem - posterior from prior and likelihood
- Conditional Independence - the key simplifying assumption
- Distributions - Bernoulli, multinomial, or Gaussian variants
Related
- Bayes' Theorem — The foundation
- Logistic Regression — Discriminative vs generative comparison
- Distributions — Gaussian Naive Bayes