A World Model with Hierarchical Memory for Robotics
Supreme replaces the MDN-RNN in the classic World Models architecture (Ha & Schmidhuber, 2018) with HOPE — a hierarchical memory system combining Self-Modifying Titans and a Continuum Memory System (CMS) from the Nested Learning paper. The agent learns a compressed world model of its environment and can train entirely inside its own hallucinated dreams.
- Python 3.12+
- macOS with Apple Silicon (M4 Air tested) or Linux with CUDA
- uv package manager
- ~2 GB disk space for data and checkpoints
# Clone the repository
git clone https://github.com/NeelakandanNC/supreme_v1.git
cd supreme_v1
# Install dependencies (uv reads pyproject.toml automatically)
uv syncuv run python -c "
import mujoco; print(f'MuJoCo {mujoco.__version__}')
import torch; print(f'PyTorch {torch.__version__}, MPS: {torch.backends.mps.is_available()}')
import gymnasium; env = gymnasium.make('Reacher-v5'); print(f'Reacher-v5 obs: {env.observation_space.shape}, act: {env.action_space.shape}')
env.close(); print('All checks passed.')
"supreme_v1/
├── supreme/
│ ├── configs/
│ │ └── supreme.yaml # All hyperparameters
│ ├── models/
│ │ ├── vae.py # V model — ConvVAE
│ │ ├── hope_memory.py # M model — HOPE (Titans + CMS + MDN)
│ │ ├── controller.py # C model — MLP controller
│ │ └── supreme.py # Full agent combining V, M, C
│ ├── memory/
│ │ ├── self_modifying_titans.py # Self-Modifying Titans layer
│ │ ├── cms.py # Continuum Memory System (4 levels)
│ │ └── associative.py # Associative memory (Hebbian/Delta/Oja)
│ ├── optimizers/
│ │ ├── m3.py # Multi-scale Momentum Muon (M3)
│ │ └── dgd.py # Delta Gradient Descent
│ ├── training/
│ │ ├── collect_data.py # Random rollout data collection
│ │ ├── train_vae.py # Phase 1: train V
│ │ ├── train_memory.py # Phase 2: train M
│ │ ├── evolve_controller.py # Phase 3: CMA-ES for C
│ │ ├── dream_training.py # Phase 4: train C inside M's dream
│ │ ├── iterative.py # Phase 5: full iterative loop
│ │ ├── phase6_train.py # Phase 6: CMS inner-loop + ablations
│ │ ├── phase7_train.py # Phase 7: decoupled reward head
│ │ └── phase8_train.py # Phase 8: branched dream training
│ ├── envs/
│ │ └── robot_env.py # MuJoCo Reacher wrapper
│ ├── utils/
│ │ ├── mdn.py # Mixture Density Network utilities
│ │ ├── data.py # Dataset & dataloader helpers
│ │ └── visualization.py # Reconstruct frames, plot latents
│ └── tests/
│ ├── test_vae.py
│ ├── test_hope.py
│ ├── test_cms.py
│ └── test_supreme.py
├── checkpoints/ # Frozen model weights (gitignored)
├── data/ # HDF5 rollout data (gitignored)
├── validation/ # Per-phase validation outputs
├── Supreme.md # Full project overview & development log
├── modelcard.tex # LaTeX model card
├── pyproject.toml # Dependencies (managed by uv)
└── uv.lock # Locked dependency versions
Supreme trains in phases. Each phase builds on the previous one's frozen checkpoints. All commands use uv run to execute within the project's virtual environment.
Collect random rollouts and train a ConvVAE to compress 64x64 RGB frames into 32-dim latents.
# Step 1: Collect data
uv run python -m supreme.training.collect_data \
--num_rollouts 2000 --steps_per_rollout 50
# Step 2: Train VAE
uv run python -m supreme.training.train_vae \
--h5_path data/reacher_random.h5 \
--latent_dim 32 --epochs 1
# Output: checkpoints/vae/vae_frozen.ptTrain the HOPE-based memory model to predict next-latent distributions.
uv run python -m supreme.training.train_memory \
--h5_path data/reacher_random.h5 \
--vae_path checkpoints/vae/vae_frozen.pt \
--hidden_dim 128 --epochs 20
# Output: checkpoints/memory/memory_best.ptTrain a minimal controller via CMA-ES in the real environment.
uv run python -m supreme.training.evolve_controller \
--vae_path checkpoints/vae/vae_frozen.pt \
--m_path checkpoints/memory/memory_best.pt \
--controller_hidden 32 \
--pop_size 64 --generations 100 --rollouts_per_eval 8
# Output: checkpoints/controller/controller_v2.pt
# Expected: ~58 min on M4 Air, eval reward ~ -9.88 (8-rollout mean)Train the controller inside the world model's imagination using increasingly sophisticated techniques.
# Phase 4: Dream training (train C inside M's hallucinated rollouts)
uv run python -m supreme.training.dream_training
# Phase 5: Iterative training (collect → retrain M → evolve C, 3 iterations)
uv run python -m supreme.training.iterative
# Phase 6: CMS inner-loop + reward head ablations
uv run python -m supreme.training.phase6_train
# Phase 7: Decoupled reward head + dream CMA-ES
uv run python -m supreme.training.phase7_train
# Phase 8: Multi-step fine-tuning + branched rollouts + uncertainty penalty
uv run python -m supreme.training.phase8_train
# Output: checkpoints/phase8/controller_phase8_best.ptAll hyperparameters live in supreme/configs/supreme.yaml:
# V model
vae:
latent_dim: 32
kl_tolerance: 0.5
# M model (HOPE)
memory:
titans:
hidden_dim: 128
heads: 4
conv_kernel: 4
cms:
frequencies: [1, 16, 1000000, 16000000]
mlp_dim: 256
mdn:
num_gaussians: 5
temperature: 1.0
# C model
controller:
controller_hidden: 32 # 0 = linear, 32 = one hidden ELU layer
# CMA-ES
training:
cma_population: 64
cma_generations: 100
num_rollout_per_eval: 8
# Environment
env:
name: "Reacher-v5"
frame_size: 64
# Hardware
hardware:
device: "auto" # auto-detect: mps > cuda > cpu| Component | Minimum | Tested On |
|---|---|---|
| RAM | 8 GB | 16 GB (M4 Air) |
| GPU | Apple MPS or CUDA | Apple M4 MPS |
| Disk | 2 GB | ~1.5 GB used |
| Training time (full pipeline) | ~3 hours | ~4 hours (M4 Air) |
The entire agent (~6.7M parameters) fits comfortably on consumer hardware. CMA-ES operates only on the ~5K controller parameters while V and M handle all representational work.
| Agent | Real Reward | Training Mode |
|---|---|---|
| Random baseline | -43.17 +/- 3.44 | -- |
| Phase 3v2 (real CMA-ES) | -9.88 +/- 3.22 | Real environment |
| Phase 8 (branched dream) | -59.71 +/- 2.19 | Dream + real buffer |
The real-environment-trained controller (Phase 3v2, source: validation/controller_v2_validation.txt) is the strongest agent and the only one that beats random by a meaningful margin.
Honest disclosure (post-audit, 2026-04-09): the dream-trained Phase 8 controller (-59.71) currently performs worse than a random policy (-43.17). The 47-point improvement over Phase 7 (-107 -> -60) is real, but it does not yet bridge the dream-to-real transfer gap. Several methodology issues identified in
audit_report.md(fixed-seed data collection, no train/val split, hidden-state distribution shift across M versions) likely contribute to this gap, and most are being fixed before re-running the pipeline. Treat the current dream-training numbers as a research artifact, not a benchmark result.
Scientific limitation: hidden-state distribution shift across M versions
The controller is conditioned on the memory model's recurrent hidden state, so it implicitly depends on the exact M instance it was trained against. We measured this empirically:
| Controller | M used at evaluation | Real reward |
|---|---|---|
| Phase 3v2 | Phase 2 M (its training-time M) | -10.81 |
| Phase 3v2 | Phase 7 M | -73.88 |
The same set of controller weights drops by 63 points when paired with a re-trained M, even though both M models predict the same Reacher dynamics. The practical consequence is that the dream-to-real transfer story has two gaps, not one:
- Sim-to-real: M's learned dynamics differ from MuJoCo's.
- M-to-M: M's hidden-state geometry is not preserved across training runs, so even a perfect controller for "M_train" need not work on "M_eval".
Iterative training (Phase 5+) silently retrains M between iterations, which continually shifts the controller's input distribution. Until the M hidden-state distribution is stabilized (e.g., via a fixed bottleneck, distillation, or controller fine-tuning per M version), claims of the form "controller learned in dream transfers to real" are bounded above by this M-to-M gap. None of the current dream-trained numbers should be read as evidence for or against the sim-to-real question in isolation.
- Ha, D. & Schmidhuber, J. (2018). World Models. arXiv:1803.10122
- Behrouz et al. (2025). Nested Learning (HOPE architecture)
- Janner, M. et al. (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS
- Yu, T. et al. (2020). MOPO: Model-Based Offline Policy Optimization. NeurIPS
MIT