Skip to content

robotron-models/supreme_v1

Repository files navigation

Supreme

A World Model with Hierarchical Memory for Robotics

Supreme replaces the MDN-RNN in the classic World Models architecture (Ha & Schmidhuber, 2018) with HOPE — a hierarchical memory system combining Self-Modifying Titans and a Continuum Memory System (CMS) from the Nested Learning paper. The agent learns a compressed world model of its environment and can train entirely inside its own hallucinated dreams.


Quick Start

Prerequisites

  • Python 3.12+
  • macOS with Apple Silicon (M4 Air tested) or Linux with CUDA
  • uv package manager
  • ~2 GB disk space for data and checkpoints

Installation

# Clone the repository
git clone https://github.com/NeelakandanNC/supreme_v1.git
cd supreme_v1

# Install dependencies (uv reads pyproject.toml automatically)
uv sync

Verify Environment

uv run python -c "
import mujoco; print(f'MuJoCo {mujoco.__version__}')
import torch; print(f'PyTorch {torch.__version__}, MPS: {torch.backends.mps.is_available()}')
import gymnasium; env = gymnasium.make('Reacher-v5'); print(f'Reacher-v5 obs: {env.observation_space.shape}, act: {env.action_space.shape}')
env.close(); print('All checks passed.')
"

Project Structure

supreme_v1/
├── supreme/
│   ├── configs/
│   │   └── supreme.yaml              # All hyperparameters
│   ├── models/
│   │   ├── vae.py                     # V model — ConvVAE
│   │   ├── hope_memory.py             # M model — HOPE (Titans + CMS + MDN)
│   │   ├── controller.py              # C model — MLP controller
│   │   └── supreme.py                 # Full agent combining V, M, C
│   ├── memory/
│   │   ├── self_modifying_titans.py   # Self-Modifying Titans layer
│   │   ├── cms.py                     # Continuum Memory System (4 levels)
│   │   └── associative.py             # Associative memory (Hebbian/Delta/Oja)
│   ├── optimizers/
│   │   ├── m3.py                      # Multi-scale Momentum Muon (M3)
│   │   └── dgd.py                     # Delta Gradient Descent
│   ├── training/
│   │   ├── collect_data.py            # Random rollout data collection
│   │   ├── train_vae.py               # Phase 1: train V
│   │   ├── train_memory.py            # Phase 2: train M
│   │   ├── evolve_controller.py       # Phase 3: CMA-ES for C
│   │   ├── dream_training.py          # Phase 4: train C inside M's dream
│   │   ├── iterative.py               # Phase 5: full iterative loop
│   │   ├── phase6_train.py            # Phase 6: CMS inner-loop + ablations
│   │   ├── phase7_train.py            # Phase 7: decoupled reward head
│   │   └── phase8_train.py            # Phase 8: branched dream training
│   ├── envs/
│   │   └── robot_env.py               # MuJoCo Reacher wrapper
│   ├── utils/
│   │   ├── mdn.py                     # Mixture Density Network utilities
│   │   ├── data.py                    # Dataset & dataloader helpers
│   │   └── visualization.py           # Reconstruct frames, plot latents
│   └── tests/
│       ├── test_vae.py
│       ├── test_hope.py
│       ├── test_cms.py
│       └── test_supreme.py
├── checkpoints/                       # Frozen model weights (gitignored)
├── data/                              # HDF5 rollout data (gitignored)
├── validation/                        # Per-phase validation outputs
├── Supreme.md                         # Full project overview & development log
├── modelcard.tex                      # LaTeX model card
├── pyproject.toml                     # Dependencies (managed by uv)
└── uv.lock                            # Locked dependency versions

Training Pipeline

Supreme trains in phases. Each phase builds on the previous one's frozen checkpoints. All commands use uv run to execute within the project's virtual environment.

Phase 1 — Train the VAE (V model)

Collect random rollouts and train a ConvVAE to compress 64x64 RGB frames into 32-dim latents.

# Step 1: Collect data
uv run python -m supreme.training.collect_data \
    --num_rollouts 2000 --steps_per_rollout 50

# Step 2: Train VAE
uv run python -m supreme.training.train_vae \
    --h5_path data/reacher_random.h5 \
    --latent_dim 32 --epochs 1

# Output: checkpoints/vae/vae_frozen.pt

Phase 2 — Train the World Model (M model)

Train the HOPE-based memory model to predict next-latent distributions.

uv run python -m supreme.training.train_memory \
    --h5_path data/reacher_random.h5 \
    --vae_path checkpoints/vae/vae_frozen.pt \
    --hidden_dim 128 --epochs 20

# Output: checkpoints/memory/memory_best.pt

Phase 3 — Train the Controller (C model)

Train a minimal controller via CMA-ES in the real environment.

uv run python -m supreme.training.evolve_controller \
    --vae_path checkpoints/vae/vae_frozen.pt \
    --m_path checkpoints/memory/memory_best.pt \
    --controller_hidden 32 \
    --pop_size 64 --generations 100 --rollouts_per_eval 8

# Output: checkpoints/controller/controller_v2.pt
# Expected: ~58 min on M4 Air, eval reward ~ -9.88 (8-rollout mean)

Phases 4-8 — Dream Training

Train the controller inside the world model's imagination using increasingly sophisticated techniques.

# Phase 4: Dream training (train C inside M's hallucinated rollouts)
uv run python -m supreme.training.dream_training

# Phase 5: Iterative training (collect → retrain M → evolve C, 3 iterations)
uv run python -m supreme.training.iterative

# Phase 6: CMS inner-loop + reward head ablations
uv run python -m supreme.training.phase6_train

# Phase 7: Decoupled reward head + dream CMA-ES
uv run python -m supreme.training.phase7_train

# Phase 8: Multi-step fine-tuning + branched rollouts + uncertainty penalty
uv run python -m supreme.training.phase8_train

# Output: checkpoints/phase8/controller_phase8_best.pt

Configuration

All hyperparameters live in supreme/configs/supreme.yaml:

# V model
vae:
  latent_dim: 32
  kl_tolerance: 0.5

# M model (HOPE)
memory:
  titans:
    hidden_dim: 128
    heads: 4
    conv_kernel: 4
  cms:
    frequencies: [1, 16, 1000000, 16000000]
    mlp_dim: 256
  mdn:
    num_gaussians: 5
    temperature: 1.0

# C model
controller:
  controller_hidden: 32    # 0 = linear, 32 = one hidden ELU layer

# CMA-ES
training:
  cma_population: 64
  cma_generations: 100
  num_rollout_per_eval: 8

# Environment
env:
  name: "Reacher-v5"
  frame_size: 64

# Hardware
hardware:
  device: "auto"           # auto-detect: mps > cuda > cpu

Hardware Requirements

Component Minimum Tested On
RAM 8 GB 16 GB (M4 Air)
GPU Apple MPS or CUDA Apple M4 MPS
Disk 2 GB ~1.5 GB used
Training time (full pipeline) ~3 hours ~4 hours (M4 Air)

The entire agent (~6.7M parameters) fits comfortably on consumer hardware. CMA-ES operates only on the ~5K controller parameters while V and M handle all representational work.


Results

Agent Real Reward Training Mode
Random baseline -43.17 +/- 3.44 --
Phase 3v2 (real CMA-ES) -9.88 +/- 3.22 Real environment
Phase 8 (branched dream) -59.71 +/- 2.19 Dream + real buffer

The real-environment-trained controller (Phase 3v2, source: validation/controller_v2_validation.txt) is the strongest agent and the only one that beats random by a meaningful margin.

Honest disclosure (post-audit, 2026-04-09): the dream-trained Phase 8 controller (-59.71) currently performs worse than a random policy (-43.17). The 47-point improvement over Phase 7 (-107 -> -60) is real, but it does not yet bridge the dream-to-real transfer gap. Several methodology issues identified in audit_report.md (fixed-seed data collection, no train/val split, hidden-state distribution shift across M versions) likely contribute to this gap, and most are being fixed before re-running the pipeline. Treat the current dream-training numbers as a research artifact, not a benchmark result.

Scientific limitation: hidden-state distribution shift across M versions

The controller is conditioned on the memory model's recurrent hidden state, so it implicitly depends on the exact M instance it was trained against. We measured this empirically:

Controller M used at evaluation Real reward
Phase 3v2 Phase 2 M (its training-time M) -10.81
Phase 3v2 Phase 7 M -73.88

The same set of controller weights drops by 63 points when paired with a re-trained M, even though both M models predict the same Reacher dynamics. The practical consequence is that the dream-to-real transfer story has two gaps, not one:

  1. Sim-to-real: M's learned dynamics differ from MuJoCo's.
  2. M-to-M: M's hidden-state geometry is not preserved across training runs, so even a perfect controller for "M_train" need not work on "M_eval".

Iterative training (Phase 5+) silently retrains M between iterations, which continually shifts the controller's input distribution. Until the M hidden-state distribution is stabilized (e.g., via a fixed bottleneck, distillation, or controller fine-tuning per M version), claims of the form "controller learned in dream transfers to real" are bounded above by this M-to-M gap. None of the current dream-trained numbers should be read as evidence for or against the sim-to-real question in isolation.


References

  • Ha, D. & Schmidhuber, J. (2018). World Models. arXiv:1803.10122
  • Behrouz et al. (2025). Nested Learning (HOPE architecture)
  • Janner, M. et al. (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS
  • Yu, T. et al. (2020). MOPO: Model-Based Offline Policy Optimization. NeurIPS

License

MIT

About

World Models (VMC) with the LSTM backbone replaced by HOPE hierarchical memory — Self-Modifying Titans + multi-frequency Continuum Memory — retaining the MDN output head. CMA-ES controller training in real and dreamed rollouts on MuJoCo Reacher-v5.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors