This is my reinforcement learning solver for the Alphabet 2048-style 2A1B game. The project includes:
- a Gymnasium environment for the game logic
- a PyTorch DQN agent (Double DQN + optional dueling/prioritized replay)
- train/eval scripts
- a web bridge to control the live browser game
I also added a core enhancement stack for better training quality:
- curriculum learning
- reward shaping
- decision diagnostics
2A1B_Solver/
├── rl2048/
│ ├── env_alphabet2048.py
│ ├── dqn_agent.py
│ ├── curriculum.py
│ ├── reward_shaper.py
│ ├── diagnostics.py
│ └── scripts/
│ ├── train_dqn.py
│ └── eval_dqn.py
├── scripts/
│ ├── train_dqn.py
│ ├── eval_dqn.py
│ └── web_bridge_dom.py
└── tests/
python -m venv .venv
source .venv/bin/activate
pip install -e .[test]For browser play:
pip install -e .[web]
playwright install chromiumBaseline training:
python -m rl2048.scripts.train_dqn \
--total-steps 1000000 \
--checkpoint-dir checkpoints \
--eval-every 50000 \
--eval-episodes 15 \
--use-prioritized \
--device cpuTraining with enhancements:
python -m rl2048.scripts.train_dqn \
--total-steps 500000 \
--curriculum \
--curriculum-type adaptive \
--initial-density 2 \
--max-density 12 \
--use-reward-shapingStandard eval:
python -m rl2048.scripts.eval_dqn \
--checkpoint checkpoints/best.pt \
--episodes 50 \
--target-letter K \
--output eval_summary.jsonEval with diagnostics output:
python -m rl2048.scripts.eval_dqn \
--checkpoint checkpoints/best.pt \
--episodes 50 \
--save-diagnostics \
--diagnostics-output diagnostics.jsonpython -m rl2048.scripts.web_bridge_dom \
--checkpoint checkpoints/best.pt \
--url https://roguetex.github.io/2A1BGame/ \
--delay 0.12 \
--headlesspytestThe project is organised into four cooperating layers:
| Layer | Module | Responsibility |
|---|---|---|
| Environment | env_alphabet2048.py |
Gymnasium-compliant game logic, step/reset, invalid-move masking |
| Agent | dqn_agent.py |
Double DQN, optional dueling head, optional prioritised experience replay |
| Enhancement | curriculum.py, reward_shaper.py |
Board-density curriculum scheduling; shaped reward signal on top of base env rewards |
| Diagnostics | diagnostics.py |
Decision-point logging, Q-value confidence tracking, post-hoc JSON analysis |
Data flows as follows during training:
Env.reset() → Board state
→ Agent.select_action() → Action
→ Env.step(action) → (next_state, base_reward, done, info)
→ RewardShaper.shape_reward() → shaped_reward
→ Agent.update() → gradient step
→ Diagnostics.log_decision() → decision record
- State is a
4x4board with values0..26(0is empty,1is A, ...). - Actions use the project’s internal mapping (
0=up, 1=right, 2=down, 3=left). - Invalid moves are masked during inference and penalized in training.
rl2048/scripts/*provides docs-style module entrypoints.