Why Math?
Every ML algorithm is an equation being optimized. Understanding the math lets you:
- Debug models (why is loss not decreasing?)
- Choose the right algorithm (what assumptions does it make?)
- Read papers (the field communicates in math)
- Invent new approaches (you can't improve what you don't understand)
The Four Pillars
Probability & Statistics "How certain are we?"
→ Bayes, distributions, MLE
→ Used in: Naive Bayes, GMM, Bayesian methods
Linear Algebra "How do we represent and transform data?"
→ Vectors, matrices, eigenvalues
→ Used in: PCA, neural networks, SVD
Calculus & Optimization "How do we find the best parameters?"
→ Gradients, chain rule, gradient descent
→ Used in: literally everything that trains
Information Theory "How do we measure uncertainty?"
→ Entropy, KL divergence, cross-entropy
→ Used in: decision trees, loss functions, VAEs
Topics
Probability & Statistics
- Bayes' Theorem — Updating beliefs with evidence
- Probability Distributions — Gaussian, Bernoulli, Poisson, etc.
- MLE & MAP — Finding the most likely parameters
- Conditional Independence — The 'naive' in Naive Bayes
Linear Algebra
- Vectors & Matrices — The language of data
- Eigenvalues & Eigenvectors — Directions that don't change
- Dot Product & Projection — Similarity and shadows
- Matrix Decomposition (SVD) — Breaking matrices apart
Calculus & Optimization
- Derivatives & Gradients — Slope in multiple dimensions
- Chain Rule — Why backpropagation works
- Partial Derivatives — Changing one variable at a time
- Taylor Approximation — Local approximations (used in XGBoost)
- Gradient Descent — Walking downhill
- Convex Optimization — When there's one global minimum
- Lagrange Multipliers — Optimization with constraints (SVM)
- Constrained Optimization — The general framework
Information Theory
- Entropy — Measuring surprise
- KL Divergence — Distance between distributions
- Cross-Entropy — The most common classification loss