A Deep Q-Network (DQN) implementation for learning optimal routing in dynamic transportation networks.
This project implements a reinforcement learning agent that learns to navigate through a transportation network with 1000+ nodes. The network simulates real-world conditions where edge weights (travel times) change dynamically, representing traffic variations.
This type of system could be applied to:
- EV Charging Scheduling: Finding optimal routes to charging stations considering wait times
- Battery Dispatch Optimization: Routing energy through grid networks with volatile prices
- General Logistics: Any scenario where conditions change and optimal paths aren't static
dqn_routing/
├── environment.py # Transportation network simulation
├── dqn_agent.py # DQN agent with experience replay
├── train.py # Training loop
├── evaluate.py # Comparison with greedy baseline
├── requirements.txt # Dependencies
└── README.md # This file
- Creates a network of 1200 nodes using Watts-Strogatz model (small-world properties like real road networks)
- Edge weights represent travel times and change periodically to simulate traffic
- Agent gets reward = -travel_time (negative because we want to minimize time)
- Big bonus (+100) for reaching the goal
Key Components:
- Neural Network: Simple 3-layer MLP that takes state (current position + goal) and outputs Q-values for each possible action
- Experience Replay: Stores past transitions and samples random batches to break correlation between consecutive samples
- Target Network: Separate network that stabilizes training by providing consistent Q-value targets
- Epsilon-Greedy: Balances exploration (random actions) vs exploitation (best known action)
Hyperparameters I Used:
- Learning rate: 0.0005 (lowered because state space is big)
- Discount factor (gamma): 0.99
- Epsilon decay: 0.995 per episode
- Replay buffer: 50,000 transitions
- Batch size: 64
- Target update: Every 1000 training steps
- Runs for 2000 episodes by default
- Each episode: agent navigates from random start to random goal
- Tracks rewards, episode lengths, and loss
- Saves model and plots training curves
Compares DQN against a greedy baseline that always picks the neighbor with lowest immediate travel time.
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtpython train.pyThis will:
- Train the agent for 2000 episodes
- Print progress every 100 episodes
- Save the model to
dqn_routing_model.pt - Save training curves to
training_curves.png
python evaluate.pyThis compares DQN vs greedy baseline on 100 test episodes and shows:
- Average route completion time
- Average number of steps
- Success rate
- Percentage improvement
After training, the DQN agent should show approximately 25-30% improvement over the greedy baseline in terms of total route time.
The greedy algorithm only looks at immediate cost (next edge weight), while DQN learns to:
- Consider future consequences (temporal credit assignment)
- Adapt to traffic patterns
- Sometimes take slightly longer immediate paths that lead to better overall routes
-
State Representation: Using one-hot encoding for position works but is memory-intensive. For larger networks, embedding layers might be better.
-
Action Masking: Since nodes have different numbers of neighbors, I mask invalid actions when computing Q-values. This was tricky to get right.
-
Exploration: Epsilon decay is important - start random, gradually trust the network more.
-
Target Networks: These really do help stabilize training. Without them, the Q-values oscillate a lot.
- State Space: One-hot encoding 1200 nodes means 2400-dimensional state. This is fine but doesn't scale to millions of nodes.
- No Hierarchical Learning: Real navigation uses hierarchies (highways → streets → alleys). Could add this.
- Simplified Dynamics: Traffic changes are random. Could use real traffic patterns.
- No Multi-Agent: Real roads have other agents affecting traffic.
- The replay buffer is just a deque - prioritized replay might help
- Could try Double DQN to reduce overestimation
- Learning rate scheduling might improve convergence
- Could add dropout for regularization
- Mnih et al. (2015) - "Human-level control through deep reinforcement learning" (original DQN paper)
- Watts & Strogatz (1998) - "Collective dynamics of 'small-world' networks"
- Sutton & Barto (2018) - "Reinforcement Learning: An Introduction" (textbook)
This is a student project for educational purposes.