Why this model existed, and what broke before it
This route follows the exact chain from the plan: practical model, visual intuition, then the math underneath. Click any node to drill into a model page. The first live demo is linear regression, and the rest of the timeline is already wired so the history can grow without dead ends.
Flow
I use this when...
Why it exists, what failed before it, what changed after it.
See it move, inspect the math, then jump to the wiki.
Timeline
mobile stack1805
Least Squares
Linear Regression
Turn a cloud of noisy points into a predictive line by minimizing squared error.
Prediction was not enough. The next question was whether a machine could separate classes too.
1901
PCA
Project to directions of variance
Principal Component Analysis compresses high-dimensional data by projecting it onto the directions that preserve the most variance.
Compression and visualization mattered, but many unlabeled problems still needed a way to discover groups without any targets.
1957
k-Means
Cluster around centroids
k-Means finds groups in unlabeled data by alternating between cluster assignment and centroid updates.
Unsupervised structure discovery mattered even before neural networks, especially when labels did not exist at all.
1958
Perceptron
One neuron, one boundary
A single weighted sum plus threshold showed that machines could learn a classifier directly from data.
One neuron hits the XOR wall. Depth needs a way to assign credit through multiple layers.
1967
k-NN
Classify by neighborhood
k-Nearest Neighbors skips parameter fitting and predicts from the labels of nearby examples instead.
Local voting was intuitive, but later models tried to learn more global representations and decision rules from data.
1986
Backpropagation
MLP and chain rule
Backprop made deep networks trainable by pushing error signals backward through each layer.
Neural nets came back, but many teams still needed models that were easier to explain and debug.
1986
Decision Trees
Split by questions
Instead of weights, learn a sequence of human-readable if/then splits that reduce uncertainty.
Greedy rules were practical, but researchers wanted stronger geometry and cleaner optimization theory.
1989
Q-Learning
Learn from reward, not labels
Q-Learning estimates how valuable each action is in each state and improves behavior through trial, error, and delayed reward.
Reinforcement learning opened a second path beyond labeled supervision, while classical ML still pushed toward cleaner geometric classifiers.
1992
SVM
Maximum margin geometry
Support Vector Machines look for the widest possible separating boundary, not just any boundary.
Classical ML sharpened geometry, but sequence problems still needed models that could carry context over time.
1997
LSTM
Sequence memory with gates
LSTM made recurrent networks much better at carrying information across long sequences by controlling what to keep, write, and forget.
Sequence models finally had memory, while tree ensembles were about to become the strongest default weapon for structured tabular problems.
1999
Gradient Boosting
Fix the previous tree
Gradient boosting builds trees sequentially so each new learner focuses on the residual mistakes of the current ensemble.
Structured data kept rewarding smarter tree ensembles, while deep learning was about to explode in vision at scale.
2012
AlexNet
Deep learning wins at scale
Convolutional nets plus GPU training shattered ImageNet benchmarks and reset the field.
Deep learning was now winning perception, and the next question was whether neural nets could generate convincingly from scratch.
2014
GAN
Generate by adversarial play
GANs trained a generator and discriminator in opposition, making neural generation vivid but notoriously unstable.
Adversarial generation was powerful, but sequence modeling was about to be reorganized around attention instead of recurrence.
2017
Transformer
Attention replaces recurrence
Self-attention made long-range context easier to model and training much more parallel.
Once attention scaled, the next move was to pretrain giant language models and reuse them everywhere.
2018
BERT
Pretrain understanding
Bidirectional pretraining changed NLP from task-specific models to one large reusable foundation.
Understanding was powerful, but generation at scale ended up reshaping the user interface of AI.
2020
GPT
Next-token prediction at scale
Scaling a simple objective on huge text corpora produced flexible general-purpose behavior.
Language took off first. Generative image models soon followed with a very different training story.
2020
Diffusion
Generate by denoising
Learn how to reverse noise, then turn that reverse process into image generation.
Generation spread beyond text, but language models still needed preference shaping to become useful assistants.
Timeline
desktop horizontal scroll1805
Least Squares
Linear Regression
Turn a cloud of noisy points into a predictive line by minimizing squared error.
Prediction was not enough. The next question was whether a machine could separate classes too.
1901
PCA
Project to directions of variance
Principal Component Analysis compresses high-dimensional data by projecting it onto the directions that preserve the most variance.
Compression and visualization mattered, but many unlabeled problems still needed a way to discover groups without any targets.
1957
k-Means
Cluster around centroids
k-Means finds groups in unlabeled data by alternating between cluster assignment and centroid updates.
Unsupervised structure discovery mattered even before neural networks, especially when labels did not exist at all.
1958
Perceptron
One neuron, one boundary
A single weighted sum plus threshold showed that machines could learn a classifier directly from data.
One neuron hits the XOR wall. Depth needs a way to assign credit through multiple layers.
1967
k-NN
Classify by neighborhood
k-Nearest Neighbors skips parameter fitting and predicts from the labels of nearby examples instead.
Local voting was intuitive, but later models tried to learn more global representations and decision rules from data.
1986
Backpropagation
MLP and chain rule
Backprop made deep networks trainable by pushing error signals backward through each layer.
Neural nets came back, but many teams still needed models that were easier to explain and debug.
1986
Decision Trees
Split by questions
Instead of weights, learn a sequence of human-readable if/then splits that reduce uncertainty.
Greedy rules were practical, but researchers wanted stronger geometry and cleaner optimization theory.
1989
Q-Learning
Learn from reward, not labels
Q-Learning estimates how valuable each action is in each state and improves behavior through trial, error, and delayed reward.
Reinforcement learning opened a second path beyond labeled supervision, while classical ML still pushed toward cleaner geometric classifiers.
1992
SVM
Maximum margin geometry
Support Vector Machines look for the widest possible separating boundary, not just any boundary.
Classical ML sharpened geometry, but sequence problems still needed models that could carry context over time.
1997
LSTM
Sequence memory with gates
LSTM made recurrent networks much better at carrying information across long sequences by controlling what to keep, write, and forget.
Sequence models finally had memory, while tree ensembles were about to become the strongest default weapon for structured tabular problems.
1999
Gradient Boosting
Fix the previous tree
Gradient boosting builds trees sequentially so each new learner focuses on the residual mistakes of the current ensemble.
Structured data kept rewarding smarter tree ensembles, while deep learning was about to explode in vision at scale.
2012
AlexNet
Deep learning wins at scale
Convolutional nets plus GPU training shattered ImageNet benchmarks and reset the field.
Deep learning was now winning perception, and the next question was whether neural nets could generate convincingly from scratch.
2014
GAN
Generate by adversarial play
GANs trained a generator and discriminator in opposition, making neural generation vivid but notoriously unstable.
Adversarial generation was powerful, but sequence modeling was about to be reorganized around attention instead of recurrence.
2017
Transformer
Attention replaces recurrence
Self-attention made long-range context easier to model and training much more parallel.
Once attention scaled, the next move was to pretrain giant language models and reuse them everywhere.
2018
BERT
Pretrain understanding
Bidirectional pretraining changed NLP from task-specific models to one large reusable foundation.
Understanding was powerful, but generation at scale ended up reshaping the user interface of AI.
2020
GPT
Next-token prediction at scale
Scaling a simple objective on huge text corpora produced flexible general-purpose behavior.
Language took off first. Generative image models soon followed with a very different training story.
2020
Diffusion
Generate by denoising
Learn how to reverse noise, then turn that reverse process into image generation.
Generation spread beyond text, but language models still needed preference shaping to become useful assistants.