I Use This When...
I need to define what "wrong" means for a model. The same prediction can be judged differently depending on whether I care about numeric distance, probability calibration, margin, or robustness to outliers.
History
Each loss function encodes an assumption about the problem. Choosing the right one is as important as choosing the right model.
Why It Exists
A model cannot train on vague feedback. It needs one scalar objective.
The "why" chain is:
- We need predictions.
- We need a way to score predictions.
- Different tasks define mistakes differently.
- So the model learns by minimizing a task-specific loss.
Loss functions exist because learning needs a target, not just a model.
How It Works
Visual Intuition
Imagine several candidate lines on the same scatter plot.
- A line with many small misses should score better than a line with a few huge misses.
- MSE makes huge misses hurt much more because the error is squared.
- That gives you a smooth surface gradient descent can follow.
See that directly in the linear regression demo:
-> Interactive Demo: Linear Regression
Step by Step
- Make a prediction
- Compare it to the true answer
- Convert that comparison into a number
- Average across the batch or dataset
- Use the gradient of that loss to update parameters
Code
def mse(y_true, y_pred):
errors = [(yt - yp) ** 2 for yt, yp in zip(y_true, y_pred)]
return sum(errors) / len(errors)
def cross_entropy(y_true, y_prob):
eps = 1e-9
terms = [
-(yt * math.log(yp + eps) + (1 - yt) * math.log(1 - yp + eps))
for yt, yp in zip(y_true, y_prob)
]
return sum(terms) / len(terms)
The Math Inside
Common losses:
MSE = (1/n) * sum((y_i - y_hat_i)^2)
- best for many regression problems
- smooth and easy to differentiate
- large errors get punished heavily
CrossEntropy = -(1/n) * sum(y_i * log(p_i))
- common for probabilistic classification
- heavily punishes confident wrong predictions
Hinge = max(0, 1 - y * score)
- used in max-margin methods like SVM
- only cares once points violate the margin
For the first wiki bundle, the key object is MSE. It answers:
- which line fits best?
- how wrong is the current line?
- what gradient should training follow?
Math Prerequisites
- Derivatives and Gradients - how loss becomes a direction
- Gradient Descent - how minimization happens in practice
- Cross-Entropy - deeper view for classification tasks
Related
- Cross-Entropy — Information-theoretic view
- Gradient Descent — How loss is minimized
- Optimization (SGD → Adam) — Which optimizer to use