Reinforcement Learning Lab
A DQN agent and a Connect-4 game-AI arena in one workbench.
3 search algorithms · play live in browser
The problem
Two strands of RL teaching usually live in different repos: value-based control on Gym environments, and search-based game AI. This single workbench puts them side by side — a DQN trainer that streams reward and loss live, and a Connect-4 arena where you can play alpha-beta minimax, MCTS, and an AlphaZero-style self-play network from the same UI.
Who this is for
Anyone interviewing for an RL / game-AI role, students wiring up their first agent and wanting to see one that already streams metrics.
Architecture
- Gymnasium training loop
- DQN on CartPole / Acrobot; reward and loss pushed live over WebSocket as the run progresses.
- Train session manager
- POST /api/train/start → session id; status + WS feed per session; stop endpoint.
- Connect-4 arena
- Same UI plays you against alpha-beta minimax, MCTS, or an AlphaZero-style self-play network.
- Next.js front-end
- Live charts for reward / loss, plus the board for the arena.
Request / data flow
- 01User selects env + hyperparams → POST /api/train/start opens a session.
- 02Background trainer steps the env, updates the replay buffer, pushes reward / loss / epsilon to the WS each episode.
- 03User can stop a session mid-run; the model is checkpointed.
- 04Arena uses pure search (minimax / MCTS) or runs inference against the trained Connect-4 network.
Key decisions
WebSocket for training telemetry, not polling.
whyWatching loss climb out of a flat region is the whole pedagogical point; polling buries the dynamics.
Three Connect-4 agents in one arena.
whyYou can feel the difference between brute search, sample-based search, and a trained policy by playing the same board against each.
Sessions instead of one global training run.
whyLets several runs proceed in parallel from different starting hyperparams without contaminating each other.
Stack
If I rebuilt it
- ›Add PPO alongside DQN so the comparison covers on-policy vs off-policy.
- ›Persist past sessions to disk so a recruiter can browse a run they didn't start themselves.