I Use This When...
I need an agent to learn from reward instead of labeled answers, especially in discrete environments like grid worlds, simple games, or control problems with a manageable state-action space.
History
Christopher Watkins (1989). Proved convergence to optimal policy. TD-Gammon (Tesauro 1992) showed it could master backgammon via self-play.
Why It Exists
The "why" chain is:
- In many environments, there is no correct label for each step.
- The only signal is reward, often delayed.
- We need a way to estimate whether taking one action now will pay off later.
- Instead of learning the answer directly, we learn action values.
Q-Learning exists because reward-based decision making needs value estimates, not labeled targets.
How It Works
Visual Intuition
Imagine a grid world with a goal tile and a penalty tile.
- the agent tries actions
- some moves eventually lead to reward
- others lead to dead ends or penalties
- over time, each state-action pair gets a score
Those scores become a map of "how promising is this move from here?"
The reinforcement-learning timeline node is here:
Step by Step
- Start with a Q-table initialized arbitrarily
- Observe the current state
s - Choose an action
a, often with exploration - Receive reward
rand next states' - Update
Q(s, a)toward the observed reward plus future estimate - Repeat across many episodes
The agent never sees the true optimal policy directly. It bootstraps toward it from its own estimates.
Code
Q[s, a] = Q[s, a] + alpha * (
r + gamma * max_a_prime(Q[s_next, a_prime]) - Q[s, a]
)
The Math Inside
Bellman optimality target:
Q*(s, a) = r + gamma max_a' Q*(s', a')
Q-Learning update:
Q(s, a) <- Q(s, a) + alpha [r + gamma max_a' Q(s', a') - Q(s, a)]
alpha: learning rategamma: discount factormax_a' Q(s', a'): best estimated future value
The bracketed term is the temporal-difference error. It measures how surprised the current estimate is after seeing one more step of experience.
With enough exploration and the right assumptions, tabular Q-Learning converges to the optimal action-value function.
Math Prerequisites
- Gradient Descent - general update intuition, even though tabular Q-Learning is not gradient-based
- DQN - what happens when the Q-table no longer scales
- Policy Gradient / PPO - the direct policy alternative
Related
- DQN — Q-Learning + Neural Network
- Policy Gradient / PPO — Alternative: learn policy directly
- Timeline — RL in the broader AI story