Q-Learning

I Use This When...

I need an agent to learn from reward instead of labeled answers, especially in discrete environments like grid worlds, simple games, or control problems with a manageable state-action space.

History

Christopher Watkins (1989). Proved convergence to optimal policy. TD-Gammon (Tesauro 1992) showed it could master backgammon via self-play.

Why It Exists

The "why" chain is:

In many environments, there is no correct label for each step.
The only signal is reward, often delayed.
We need a way to estimate whether taking one action now will pay off later.
Instead of learning the answer directly, we learn action values.

Q-Learning exists because reward-based decision making needs value estimates, not labeled targets.

How It Works

Visual Intuition

Imagine a grid world with a goal tile and a penalty tile.

the agent tries actions
some moves eventually lead to reward
others lead to dead ends or penalties
over time, each state-action pair gets a score

Those scores become a map of "how promising is this move from here?"

The reinforcement-learning timeline node is here:

-> MLViz Node: Q-Learning

Step by Step

Start with a Q-table initialized arbitrarily
Observe the current state s
Choose an action a, often with exploration
Receive reward r and next state s'
Update Q(s, a) toward the observed reward plus future estimate
Repeat across many episodes

The agent never sees the true optimal policy directly. It bootstraps toward it from its own estimates.

Code

Q[s, a] = Q[s, a] + alpha * (
    r + gamma * max_a_prime(Q[s_next, a_prime]) - Q[s, a]
)

The Math Inside

Bellman optimality target:

Q*(s, a) = r + gamma max_a' Q*(s', a')

Q-Learning update:

Q(s, a) <- Q(s, a) + alpha [r + gamma max_a' Q(s', a') - Q(s, a)]

alpha: learning rate
gamma: discount factor
max_a' Q(s', a'): best estimated future value

The bracketed term is the temporal-difference error. It measures how surprised the current estimate is after seeing one more step of experience.

With enough exploration and the right assumptions, tabular Q-Learning converges to the optimal action-value function.

Math Prerequisites

Gradient Descent - general update intuition, even though tabular Q-Learning is not gradient-based
DQN - what happens when the Q-table no longer scales
Policy Gradient / PPO - the direct policy alternative

DQN — Q-Learning + Neural Network
Policy Gradient / PPO — Alternative: learn policy directly
Timeline — RL in the broader AI story