I Use This When...
I need a strong classifier with a clean geometric interpretation, especially on small or medium-sized structured datasets. SVMs are useful when the boundary should be robust, not just merely correct on the training data.
History
Vapnik & Cortes (1992). Based on statistical learning theory. Dominated ML before deep learning. Mathematically one of the most elegant classifiers.
Why It Exists
The "why" chain is:
- A perceptron can find a separating boundary if one exists.
- But many separating boundaries may exist.
- Some boundaries pass dangerously close to the data.
- We want the boundary with the widest safety margin.
SVM exists because "any separator" is weaker than "the separator with the best margin."
How It Works
Visual Intuition
Imagine two classes separated by many possible lines. The SVM does not choose the first line that works. It chooses the line that leaves the widest empty band between the classes.
Only a few training points actually touch that band. Those points are the support vectors. They are the examples that define the boundary.
The timeline node for this era is already wired here:
Step by Step
- Represent the boundary as
w . x + b = 0 - Require positive examples to be on one side and negative examples on the other
- Among all valid separators, choose the one with maximum margin
- Solve the resulting constrained optimization problem
- For nonlinear cases, replace dot products with a kernel
The key shift is from "fit a classifier" to "solve a geometry problem under constraints."
Code
# Concept sketch:
# model = SVC(kernel="linear", C=1.0)
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
The Math Inside
Hard-margin SVM:
maximize the margin
margin = 2 / ||w||
subject to
y_i (w . x_i + b) >= 1
w: normal vector to the boundaryb: biasy_i in {-1, +1}: class labels- the constraints force every point to be on the correct side of the margin
Equivalent optimization form:
min (1/2) ||w||^2
subject to the same constraints.
Why minimize ||w||^2? Because a smaller ||w|| means a larger margin.
Kernel trick:
K(x, z) = phi(x) . phi(z)
This lets SVM behave as if it mapped points into a higher-dimensional feature space without explicitly constructing that space.
Math Prerequisites
- Dot Product - why hyperplanes and kernels are written with
w . x - Lagrange Multipliers - how the constrained problem is solved
- Constrained Optimization - margin constraints and dual form
- Convex Optimization - why SVM has a clean global optimum
Related
- Logistic Regression — Probabilistic alternative
- Lagrange Multipliers — How SVM is solved
- Dot Product — Why kernels work
- Perceptron — SVM's simpler ancestor