I Use This When...
My model fits the training set too well and fails to generalize. Regularization is what I reach for when I need to control model complexity, reduce overfitting, or make parameter estimates more stable.
History
L2 (Ridge): Hoerl 1970. L1 (Lasso): Tibshirani 1996. Dropout: Hinton 2012. Early stopping, data augmentation, batch normalization.
Why It Exists
Models with many parameters can memorize training data. Regularization penalizes complexity — the model must justify its parameters, keeping only what truly matters.
How It Works
Visual Intuition
Imagine two curves on noisy data:
- one bends wildly to hit every training point
- the other ignores tiny fluctuations and keeps the larger pattern
Regularization is the pressure that prefers the second curve unless the extra complexity truly pays for itself.
Step by Step
- Start with a training loss
- Add a penalty or structural constraint
- Train the model with that modified objective
- Tune the regularization strength on validation data
- Keep the strongest setting that still preserves useful signal
Code
loss = data_loss + lambda_l2 * (weights**2).sum()
The Math Inside
Common forms:
L_total = L_data + lambda sum w_i^2 for L2
L_total = L_data + lambda sum |w_i| for L1
Interpretation:
- L2 shrinks weights smoothly and prefers smaller parameter magnitudes
- L1 encourages sparsity and can drive some weights exactly to zero
- dropout randomly removes units during training so the network cannot rely too much on one path
- early stopping is another form of capacity control in practice
There is also a Bayesian reading:
- L2 regularization corresponds to a Gaussian prior
- L1 regularization corresponds to a Laplace prior
So regularization is not just a hack. It is one way of expressing a preference for simpler explanations.
Math Prerequisites
- Bias-Variance Tradeoff - why reducing variance helps
- MLE & MAP - regularization as a prior
- Loss Functions - where penalties are added
Related
- Bias-Variance Tradeoff — Why regularization is needed
- Polynomial / Ridge / Lasso — Regularization in regression
- MLP & Backprop — Where dropout lives