Grid World

A simple reinforcement learning example using tabular Q-learning to teach an agent to navigate a 5×5 grid from the top-left corner to the bottom-right corner, avoiding obstacles along the way.

train.mp4

How It Works

The agent (🚘) starts at cell 0 and must reach the goal (🍺) at cell 24. Two obstacles (☠️) block cells 8 and 12. The agent learns by trial and error — exploring randomly at first, then gradually exploiting what it has learned.

+10.0 reward for reaching the goal
-1.0 penalty for hitting a wall or obstacle
-0.01 per-step cost to encourage finding short paths

Each cell displays its current Q-value — the agent's learned estimate of how much future reward it can expect from that position. Higher Q-values (like 9.354 near the goal) indicate the agent has learned those cells are close to the reward.

Setup

You need Python 3.9+ with numpy and jupyter. Choose one of the options below.

Option A: conda

conda create -n gridworld python=3.11 numpy jupyter -y
conda activate gridworld

Option B: uv

uv venv --python 3.11
source .venv/bin/activate
uv pip install numpy jupyter

Running the Notebook

Launch Jupyter:

jupyter notebook gridworld.ipynb

The notebook has four cells. Run them in order.

Cell 1 — Enable Auto-Reload

%load_ext autoreload
%autoreload 2

This makes Jupyter automatically pick up any changes you make to gridworld.py without restarting the kernel.

Cell 2 — Import

from gridworld import GridWorld, QLearner, train, test_policy

Imports the environment (GridWorld), the agent (QLearner), and the two main functions.

Cell 3 — Train the Agent

env = GridWorld(size=5)

agent = QLearner(
    n_states=env.n_states,
    n_actions=env.n_actions,
    learning_rate=0.1,
    discount=0.9,
    epsilon=1.0
)

train(env, agent, episodes=50, max_steps=100, render=True)

This creates the 5×5 grid and a Q-learning agent, then trains for 50 episodes. With render=True the grid animates live in the notebook — you'll see the agent stumbling around at first, then gradually finding shorter paths as the Q-values converge.

Parameters you can tweak:

Parameter	Default	Effect
`size`	5	Grid dimensions (size × size)
`learning_rate`	0.1	How fast Q-values update toward new information
`discount`	0.9	How much the agent values future vs. immediate rewards
`epsilon`	1.0	Starting exploration rate (decays linearly to 0.01)
`episodes`	50	Number of training episodes
`max_steps`	100	Step limit per episode (prevents infinite loops)

After training, the grid should look something like this — Q-values are highest near the goal and decrease with distance:

Cell 4 — Test the Learned Policy

path, total_reward = test_policy(env, agent, render=True)
print("Policy path: [(state, action)] = ", path)

Runs a single episode with exploration disabled (epsilon=0) so the agent always picks its best-known action. The grid shows directional arrows tracing the greedy path from start to goal:

The output also prints the path as a list of (state, action) pairs, where actions map to: 0=down, 1=up, 2=left, 3=right.

test_policy.mp4

Project Structure

gridworld.py        # GridWorld environment, QLearner agent, train/test functions
gridworld.ipynb     # Interactive notebook to run everything
images/             # Screenshots for this README
video/              # Screen recordings of training and test runs

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
video		video
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
gridworld.ipynb		gridworld.ipynb
gridworld.py		gridworld.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grid World

How It Works

Setup

Option A: conda

Option B: uv

Running the Notebook

Cell 1 — Enable Auto-Reload

Cell 2 — Import

Cell 3 — Train the Agent

Cell 4 — Test the Learned Policy

Project Structure

About

Uh oh!

Languages

License

commanderfun/gridworld

Folders and files

Latest commit

History

Repository files navigation

Grid World

How It Works

Setup

Option A: conda

Option B: uv

Running the Notebook

Cell 1 — Enable Auto-Reload

Cell 2 — Import

Cell 3 — Train the Agent

Cell 4 — Test the Learned Policy

Project Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages