Naive Bayes

I Use This When...

I want a fast probabilistic baseline for text or count-like features, especially when data is sparse and interpretability matters. Spam filtering and simple document classification are classic Naive Bayes territory.

History

Applied to ML in the 1960s. Famously used for email spam filtering (late 1990s). 'Naive' because it assumes features are independent.

Why It Exists

The "why" chain is:

classification can be phrased as "which label is most probable given the evidence?"
Bayes' theorem tells us how to update belief from evidence
but modeling the full joint feature distribution is expensive
assuming features are conditionally independent makes the model tractable

Naive Bayes exists because a strong simplifying assumption can turn probability theory into a very practical classifier.

How It Works

Visual Intuition

Imagine classifying an email.

start with a prior spam probability
inspect words like "lottery", "discount", or "meeting"
each word nudges the posterior toward one class or the other

The model does not learn a geometric boundary directly. It compares how plausible the observed features are under each class.

Step by Step

Estimate class priors such as P(spam) and P(not spam)
Estimate per-feature likelihoods such as P(word | class)
For a new example, multiply or sum log-likelihoods across features
Combine them with the class prior
Predict the class with the largest posterior score

Code

import math


def naive_bayes_score(log_prior, feature_log_probs):
    return log_prior + sum(feature_log_probs)

The Math Inside

Bayes gives:

P(class | x) proportional to P(x | class) P(class)

The naive assumption is:

P(x | class) = product P(x_i | class)

So the classifier becomes:

argmax_c P(c) product P(x_i | c)

In practice we usually use logs:

argmax_c [log P(c) + sum log P(x_i | c)]

Why this matters:

very fast to train
works surprisingly well on bag-of-words text
can struggle when feature dependence is strong

Math Prerequisites

Bayes' Theorem - posterior from prior and likelihood
Conditional Independence - the key simplifying assumption
Distributions - Bernoulli, multinomial, or Gaussian variants

Bayes' Theorem — The foundation
Logistic Regression — Discriminative vs generative comparison
Distributions — Gaussian Naive Bayes