A simple reinforcement learning example using tabular Q-learning to teach an agent to navigate a 5×5 grid from the top-left corner to the bottom-right corner, avoiding obstacles along the way.
train.mp4
The agent (🚘) starts at cell 0 and must reach the goal (🍺) at cell 24. Two obstacles (☠️) block cells 8 and 12. The agent learns by trial and error — exploring randomly at first, then gradually exploiting what it has learned.
- +10.0 reward for reaching the goal
- -1.0 penalty for hitting a wall or obstacle
- -0.01 per-step cost to encourage finding short paths
Each cell displays its current Q-value — the agent's learned estimate of how much future reward it can expect from that position. Higher Q-values (like 9.354 near the goal) indicate the agent has learned those cells are close to the reward.
You need Python 3.9+ with numpy and jupyter. Choose one of the options below.
conda create -n gridworld python=3.11 numpy jupyter -y
conda activate gridworlduv venv --python 3.11
source .venv/bin/activate
uv pip install numpy jupyterLaunch Jupyter:
jupyter notebook gridworld.ipynbThe notebook has four cells. Run them in order.
%load_ext autoreload
%autoreload 2This makes Jupyter automatically pick up any changes you make to gridworld.py without restarting the kernel.
from gridworld import GridWorld, QLearner, train, test_policyImports the environment (GridWorld), the agent (QLearner), and the two main functions.
env = GridWorld(size=5)
agent = QLearner(
n_states=env.n_states,
n_actions=env.n_actions,
learning_rate=0.1,
discount=0.9,
epsilon=1.0
)
train(env, agent, episodes=50, max_steps=100, render=True)This creates the 5×5 grid and a Q-learning agent, then trains for 50 episodes. With render=True the grid animates live in the notebook — you'll see the agent stumbling around at first, then gradually finding shorter paths as the Q-values converge.
Parameters you can tweak:
| Parameter | Default | Effect |
|---|---|---|
size |
5 | Grid dimensions (size × size) |
learning_rate |
0.1 | How fast Q-values update toward new information |
discount |
0.9 | How much the agent values future vs. immediate rewards |
epsilon |
1.0 | Starting exploration rate (decays linearly to 0.01) |
episodes |
50 | Number of training episodes |
max_steps |
100 | Step limit per episode (prevents infinite loops) |
After training, the grid should look something like this — Q-values are highest near the goal and decrease with distance:
path, total_reward = test_policy(env, agent, render=True)
print("Policy path: [(state, action)] = ", path)Runs a single episode with exploration disabled (epsilon=0) so the agent always picks its best-known action. The grid shows directional arrows tracing the greedy path from start to goal:
The output also prints the path as a list of (state, action) pairs, where actions map to: 0=down, 1=up, 2=left, 3=right.
test_policy.mp4
gridworld.py # GridWorld environment, QLearner agent, train/test functions
gridworld.ipynb # Interactive notebook to run everything
images/ # Screenshots for this README
video/ # Screen recordings of training and test runs

