k-Means Clustering

I Use This When...

I have unlabeled data and I want to group similar points together. k-Means is a classic first tool for customer segments, embedding grouping, image color compression, and quick exploratory clustering when I can guess a reasonable number of clusters k.

History

Stuart Lloyd (1957, published 1982). Originally designed for pulse-code modulation at Bell Labs. One of the most widely used algorithms in all of ML.

Why It Exists

The "why" chain is:

Many datasets have no labels at all.
We still want to know whether points naturally form groups.
We need a simple notion of a cluster center.
We can assign each point to the nearest center, then recompute the centers.

k-Means exists because "repeat assign and update" is one of the simplest ways to turn unlabeled geometry into clusters.

How It Works

Visual Intuition

Imagine dots scattered across a plane and a few centroids dropped in random places.

each point chooses the nearest centroid
each centroid moves to the average of the points assigned to it
the colored regions shift
after several rounds, the centroids stabilize

The timeline node is already wired here:

-> MLViz Node: k-Means

Step by Step

Choose the number of clusters k
Initialize k centroids
Assign each point to the nearest centroid
Update each centroid to the mean of its assigned points
Repeat until assignments stop changing much

The algorithm alternates between grouping points and redefining what each group center means.

Code

centroids = init_centroids(X, k)

for _ in range(max_iters):
    clusters = assign_to_nearest_centroid(X, centroids)
    centroids = recompute_cluster_means(X, clusters, k)

The Math Inside

Objective:

J = sum_k sum_{x in C_k} ||x - mu_k||^2

C_k: points assigned to cluster k
mu_k: centroid of cluster k
||x - mu_k||^2: squared distance from a point to its centroid

So k-Means minimizes within-cluster spread.

Centroid update:

mu_k = (1 / |C_k|) * sum_{x in C_k} x

This is why the algorithm is called k-Means: each centroid is literally the mean of its assigned points.

The algorithm is not guaranteed to find the global optimum. It can land in a local optimum depending on initialization, which is why k-means++ and multiple restarts help.

Math Prerequisites

Vectors & Matrices - how points and datasets are represented
Dot Product - distance and geometric similarity
PCA - often used before clustering or for visualization

GMM — Soft/probabilistic version
DBSCAN — Density-based alternative
PCA — Often used before k-Means