I Use This When...
I have unlabeled data and I want to group similar points together. k-Means is a
classic first tool for customer segments, embedding grouping, image color
compression, and quick exploratory clustering when I can guess a reasonable
number of clusters k.
History
Stuart Lloyd (1957, published 1982). Originally designed for pulse-code modulation at Bell Labs. One of the most widely used algorithms in all of ML.
Why It Exists
The "why" chain is:
- Many datasets have no labels at all.
- We still want to know whether points naturally form groups.
- We need a simple notion of a cluster center.
- We can assign each point to the nearest center, then recompute the centers.
k-Means exists because "repeat assign and update" is one of the simplest ways to turn unlabeled geometry into clusters.
How It Works
Visual Intuition
Imagine dots scattered across a plane and a few centroids dropped in random places.
- each point chooses the nearest centroid
- each centroid moves to the average of the points assigned to it
- the colored regions shift
- after several rounds, the centroids stabilize
The timeline node is already wired here:
Step by Step
- Choose the number of clusters
k - Initialize
kcentroids - Assign each point to the nearest centroid
- Update each centroid to the mean of its assigned points
- Repeat until assignments stop changing much
The algorithm alternates between grouping points and redefining what each group center means.
Code
centroids = init_centroids(X, k)
for _ in range(max_iters):
clusters = assign_to_nearest_centroid(X, centroids)
centroids = recompute_cluster_means(X, clusters, k)
The Math Inside
Objective:
J = sum_k sum_{x in C_k} ||x - mu_k||^2
C_k: points assigned to clusterkmu_k: centroid of clusterk||x - mu_k||^2: squared distance from a point to its centroid
So k-Means minimizes within-cluster spread.
Centroid update:
mu_k = (1 / |C_k|) * sum_{x in C_k} x
This is why the algorithm is called k-Means: each centroid is literally the mean of its assigned points.
The algorithm is not guaranteed to find the global optimum. It can land in a local optimum depending on initialization, which is why k-means++ and multiple restarts help.
Math Prerequisites
- Vectors & Matrices - how points and datasets are represented
- Dot Product - distance and geometric similarity
- PCA - often used before clustering or for visualization