Why this model existed, and what broke before it

k-Means

Cluster around centroids

k-Means finds groups in unlabeled data by alternating between cluster assignment and centroid updates.

Unsupervised structure discovery mattered even before neural networks, especially when labels did not exist at all.

1958

Perceptron

One neuron, one boundary

A single weighted sum plus threshold showed that machines could learn a classifier directly from data.

One neuron hits the XOR wall. Depth needs a way to assign credit through multiple layers.

1967

k-NN

Classify by neighborhood

k-Nearest Neighbors skips parameter fitting and predicts from the labels of nearby examples instead.

Local voting was intuitive, but later models tried to learn more global representations and decision rules from data.

1986

Backpropagation

MLP and chain rule

Backprop made deep networks trainable by pushing error signals backward through each layer.

Neural nets came back, but many teams still needed models that were easier to explain and debug.

1986

Decision Trees

Split by questions

Instead of weights, learn a sequence of human-readable if/then splits that reduce uncertainty.

Greedy rules were practical, but researchers wanted stronger geometry and cleaner optimization theory.

1989

Q-Learning

Learn from reward, not labels

Q-Learning estimates how valuable each action is in each state and improves behavior through trial, error, and delayed reward.

Reinforcement learning opened a second path beyond labeled supervision, while classical ML still pushed toward cleaner geometric classifiers.

1992

SVM

Maximum margin geometry

Support Vector Machines look for the widest possible separating boundary, not just any boundary.

Classical ML sharpened geometry, but sequence problems still needed models that could carry context over time.

1997

LSTM

Sequence memory with gates

LSTM made recurrent networks much better at carrying information across long sequences by controlling what to keep, write, and forget.

Sequence models finally had memory, while tree ensembles were about to become the strongest default weapon for structured tabular problems.

1999

Gradient Boosting

Fix the previous tree

Gradient boosting builds trees sequentially so each new learner focuses on the residual mistakes of the current ensemble.

Structured data kept rewarding smarter tree ensembles, while deep learning was about to explode in vision at scale.

2012

AlexNet

Deep learning wins at scale

Convolutional nets plus GPU training shattered ImageNet benchmarks and reset the field.

Deep learning was now winning perception, and the next question was whether neural nets could generate convincingly from scratch.

2014

GAN

Generate by adversarial play

GANs trained a generator and discriminator in opposition, making neural generation vivid but notoriously unstable.

Adversarial generation was powerful, but sequence modeling was about to be reorganized around attention instead of recurrence.

2017

Transformer

Attention replaces recurrence

Self-attention made long-range context easier to model and training much more parallel.

Once attention scaled, the next move was to pretrain giant language models and reuse them everywhere.

2018

BERT

Pretrain understanding

Bidirectional pretraining changed NLP from task-specific models to one large reusable foundation.

Understanding was powerful, but generation at scale ended up reshaping the user interface of AI.

2020

GPT

Next-token prediction at scale

Scaling a simple objective on huge text corpora produced flexible general-purpose behavior.

Language took off first. Generative image models soon followed with a very different training story.

2020

Diffusion

Generate by denoising

Learn how to reverse noise, then turn that reverse process into image generation.

Generation spread beyond text, but language models still needed preference shaping to become useful assistants.

2022

RLHF

Align models to human preference

Reinforcement Learning from Human Feedback reshaped pretrained language models into instruction-following assistants by optimizing against learned human preferences.

Timeline

desktop horizontal scroll

1805

Statisticslive demo

Least Squares

Linear Regression

Turn a cloud of noisy points into a predictive line by minimizing squared error.

Prediction was not enough. The next question was whether a machine could separate classes too.

1901

Statisticsroadmap

PCA

Project to directions of variance

Principal Component Analysis compresses high-dimensional data by projecting it onto the directions that preserve the most variance.

Compression and visualization mattered, but many unlabeled problems still needed a way to discover groups without any targets.

1957

k-Means

Cluster around centroids

k-Means finds groups in unlabeled data by alternating between cluster assignment and centroid updates.

Unsupervised structure discovery mattered even before neural networks, especially when labels did not exist at all.

1958

Perceptron

One neuron, one boundary

A single weighted sum plus threshold showed that machines could learn a classifier directly from data.

One neuron hits the XOR wall. Depth needs a way to assign credit through multiple layers.

1967

k-NN

Classify by neighborhood

k-Nearest Neighbors skips parameter fitting and predicts from the labels of nearby examples instead.

Local voting was intuitive, but later models tried to learn more global representations and decision rules from data.

1986

Backpropagation

MLP and chain rule

Backprop made deep networks trainable by pushing error signals backward through each layer.

Neural nets came back, but many teams still needed models that were easier to explain and debug.

1986

Decision Trees

Split by questions

Instead of weights, learn a sequence of human-readable if/then splits that reduce uncertainty.

Greedy rules were practical, but researchers wanted stronger geometry and cleaner optimization theory.

1989

Q-Learning

Learn from reward, not labels

Q-Learning estimates how valuable each action is in each state and improves behavior through trial, error, and delayed reward.

Reinforcement learning opened a second path beyond labeled supervision, while classical ML still pushed toward cleaner geometric classifiers.

1992

SVM

Maximum margin geometry

Support Vector Machines look for the widest possible separating boundary, not just any boundary.

Classical ML sharpened geometry, but sequence problems still needed models that could carry context over time.

1997

LSTM

Sequence memory with gates

LSTM made recurrent networks much better at carrying information across long sequences by controlling what to keep, write, and forget.

Sequence models finally had memory, while tree ensembles were about to become the strongest default weapon for structured tabular problems.

1999

Gradient Boosting

Fix the previous tree

Gradient boosting builds trees sequentially so each new learner focuses on the residual mistakes of the current ensemble.

Structured data kept rewarding smarter tree ensembles, while deep learning was about to explode in vision at scale.

2012

AlexNet

Deep learning wins at scale

Convolutional nets plus GPU training shattered ImageNet benchmarks and reset the field.

Deep learning was now winning perception, and the next question was whether neural nets could generate convincingly from scratch.

2014

GAN

Generate by adversarial play

GANs trained a generator and discriminator in opposition, making neural generation vivid but notoriously unstable.

Adversarial generation was powerful, but sequence modeling was about to be reorganized around attention instead of recurrence.

2017

Transformer

Attention replaces recurrence

Self-attention made long-range context easier to model and training much more parallel.

Once attention scaled, the next move was to pretrain giant language models and reuse them everywhere.

2018

BERT

Pretrain understanding

Bidirectional pretraining changed NLP from task-specific models to one large reusable foundation.

Understanding was powerful, but generation at scale ended up reshaping the user interface of AI.

2020

GPT

Next-token prediction at scale

Scaling a simple objective on huge text corpora produced flexible general-purpose behavior.

Language took off first. Generative image models soon followed with a very different training story.

2020

Diffusion

Generate by denoising

Learn how to reverse noise, then turn that reverse process into image generation.

Generation spread beyond text, but language models still needed preference shaping to become useful assistants.

2022