Supreme

A World Model with Hierarchical Memory for Robotics

Supreme replaces the MDN-RNN in the classic World Models architecture (Ha & Schmidhuber, 2018) with HOPE — a hierarchical memory system combining Self-Modifying Titans and a Continuum Memory System (CMS) from the Nested Learning paper. The agent learns a compressed world model of its environment and can train entirely inside its own hallucinated dreams.

Quick Start

Prerequisites

Python 3.12+
macOS with Apple Silicon (M4 Air tested) or Linux with CUDA
uv package manager
~2 GB disk space for data and checkpoints

Installation

# Clone the repository
git clone https://github.com/NeelakandanNC/supreme_v1.git
cd supreme_v1

# Install dependencies (uv reads pyproject.toml automatically)
uv sync

Verify Environment

uv run python -c "
import mujoco; print(f'MuJoCo {mujoco.__version__}')
import torch; print(f'PyTorch {torch.__version__}, MPS: {torch.backends.mps.is_available()}')
import gymnasium; env = gymnasium.make('Reacher-v5'); print(f'Reacher-v5 obs: {env.observation_space.shape}, act: {env.action_space.shape}')
env.close(); print('All checks passed.')
"

Project Structure

supreme_v1/
├── supreme/
│   ├── configs/
│   │   └── supreme.yaml              # All hyperparameters
│   ├── models/
│   │   ├── vae.py                     # V model — ConvVAE
│   │   ├── hope_memory.py             # M model — HOPE (Titans + CMS + MDN)
│   │   ├── controller.py              # C model — MLP controller
│   │   └── supreme.py                 # Full agent combining V, M, C
│   ├── memory/
│   │   ├── self_modifying_titans.py   # Self-Modifying Titans layer
│   │   ├── cms.py                     # Continuum Memory System (4 levels)
│   │   └── associative.py             # Associative memory (Hebbian/Delta/Oja)
│   ├── optimizers/
│   │   ├── m3.py                      # Multi-scale Momentum Muon (M3)
│   │   └── dgd.py                     # Delta Gradient Descent
│   ├── training/
│   │   ├── collect_data.py            # Random rollout data collection
│   │   ├── train_vae.py               # Phase 1: train V
│   │   ├── train_memory.py            # Phase 2: train M
│   │   ├── evolve_controller.py       # Phase 3: CMA-ES for C
│   │   ├── dream_training.py          # Phase 4: train C inside M's dream
│   │   ├── iterative.py               # Phase 5: full iterative loop
│   │   ├── phase6_train.py            # Phase 6: CMS inner-loop + ablations
│   │   ├── phase7_train.py            # Phase 7: decoupled reward head
│   │   └── phase8_train.py            # Phase 8: branched dream training
│   ├── envs/
│   │   └── robot_env.py               # MuJoCo Reacher wrapper
│   ├── utils/
│   │   ├── mdn.py                     # Mixture Density Network utilities
│   │   ├── data.py                    # Dataset & dataloader helpers
│   │   └── visualization.py           # Reconstruct frames, plot latents
│   └── tests/
│       ├── test_vae.py
│       ├── test_hope.py
│       ├── test_cms.py
│       └── test_supreme.py
├── checkpoints/                       # Frozen model weights (gitignored)
├── data/                              # HDF5 rollout data (gitignored)
├── validation/                        # Per-phase validation outputs
├── Supreme.md                         # Full project overview & development log
├── modelcard.tex                      # LaTeX model card
├── pyproject.toml                     # Dependencies (managed by uv)
└── uv.lock                            # Locked dependency versions

Training Pipeline

Supreme trains in phases. Each phase builds on the previous one's frozen checkpoints. All commands use uv run to execute within the project's virtual environment.

Phase 1 — Train the VAE (V model)

Collect random rollouts and train a ConvVAE to compress 64x64 RGB frames into 32-dim latents.

# Step 1: Collect data
uv run python -m supreme.training.collect_data \
    --num_rollouts 2000 --steps_per_rollout 50

# Step 2: Train VAE
uv run python -m supreme.training.train_vae \
    --h5_path data/reacher_random.h5 \
    --latent_dim 32 --epochs 1

# Output: checkpoints/vae/vae_frozen.pt

Phase 2 — Train the World Model (M model)

Train the HOPE-based memory model to predict next-latent distributions.

uv run python -m supreme.training.train_memory \
    --h5_path data/reacher_random.h5 \
    --vae_path checkpoints/vae/vae_frozen.pt \
    --hidden_dim 128 --epochs 20

# Output: checkpoints/memory/memory_best.pt

Phase 3 — Train the Controller (C model)

Train a minimal controller via CMA-ES in the real environment.

uv run python -m supreme.training.evolve_controller \
    --vae_path checkpoints/vae/vae_frozen.pt \
    --m_path checkpoints/memory/memory_best.pt \
    --controller_hidden 32 \
    --pop_size 64 --generations 100 --rollouts_per_eval 8

# Output: checkpoints/controller/controller_v2.pt
# Expected: ~58 min on M4 Air, eval reward ~ -9.88 (8-rollout mean)

Phases 4-8 — Dream Training

Train the controller inside the world model's imagination using increasingly sophisticated techniques.

# Phase 4: Dream training (train C inside M's hallucinated rollouts)
uv run python -m supreme.training.dream_training

# Phase 5: Iterative training (collect → retrain M → evolve C, 3 iterations)
uv run python -m supreme.training.iterative

# Phase 6: CMS inner-loop + reward head ablations
uv run python -m supreme.training.phase6_train

# Phase 7: Decoupled reward head + dream CMA-ES
uv run python -m supreme.training.phase7_train

# Phase 8: Multi-step fine-tuning + branched rollouts + uncertainty penalty
uv run python -m supreme.training.phase8_train

# Output: checkpoints/phase8/controller_phase8_best.pt

Configuration

All hyperparameters live in supreme/configs/supreme.yaml:

# V model
vae:
  latent_dim: 32
  kl_tolerance: 0.5

# M model (HOPE)
memory:
  titans:
    hidden_dim: 128
    heads: 4
    conv_kernel: 4
  cms:
    frequencies: [1, 16, 1000000, 16000000]
    mlp_dim: 256
  mdn:
    num_gaussians: 5
    temperature: 1.0

# C model
controller:
  controller_hidden: 32    # 0 = linear, 32 = one hidden ELU layer

# CMA-ES
training:
  cma_population: 64
  cma_generations: 100
  num_rollout_per_eval: 8

# Environment
env:
  name: "Reacher-v5"
  frame_size: 64

# Hardware
hardware:
  device: "auto"           # auto-detect: mps > cuda > cpu

Hardware Requirements

Component	Minimum	Tested On
RAM	8 GB	16 GB (M4 Air)
GPU	Apple MPS or CUDA	Apple M4 MPS
Disk	2 GB	~1.5 GB used
Training time (full pipeline)	~3 hours	~4 hours (M4 Air)

The entire agent (~6.7M parameters) fits comfortably on consumer hardware. CMA-ES operates only on the ~5K controller parameters while V and M handle all representational work.

Results

Agent	Real Reward	Training Mode
Random baseline	-43.17 +/- 3.44	--
Phase 3v2 (real CMA-ES)	-9.88 +/- 3.22	Real environment
Phase 8 (branched dream)	-59.71 +/- 2.19	Dream + real buffer

The real-environment-trained controller (Phase 3v2, source: validation/controller_v2_validation.txt) is the strongest agent and the only one that beats random by a meaningful margin.

Honest disclosure (post-audit, 2026-04-09): the dream-trained Phase 8 controller (-59.71) currently performs worse than a random policy (-43.17). The 47-point improvement over Phase 7 (-107 -> -60) is real, but it does not yet bridge the dream-to-real transfer gap. Several methodology issues identified in audit_report.md (fixed-seed data collection, no train/val split, hidden-state distribution shift across M versions) likely contribute to this gap, and most are being fixed before re-running the pipeline. Treat the current dream-training numbers as a research artifact, not a benchmark result.

Scientific limitation: hidden-state distribution shift across M versions

The controller is conditioned on the memory model's recurrent hidden state, so it implicitly depends on the exact M instance it was trained against. We measured this empirically:

Controller	M used at evaluation	Real reward
Phase 3v2	Phase 2 M (its training-time M)	-10.81
Phase 3v2	Phase 7 M	-73.88

The same set of controller weights drops by 63 points when paired with a re-trained M, even though both M models predict the same Reacher dynamics. The practical consequence is that the dream-to-real transfer story has two gaps, not one:

Sim-to-real: M's learned dynamics differ from MuJoCo's.
M-to-M: M's hidden-state geometry is not preserved across training runs, so even a perfect controller for "M_train" need not work on "M_eval".

Iterative training (Phase 5+) silently retrains M between iterations, which continually shifts the controller's input distribution. Until the M hidden-state distribution is stabilized (e.g., via a fixed bottleneck, distillation, or controller fine-tuning per M version), claims of the form "controller learned in dream transfers to real" are bounded above by this M-to-M gap. None of the current dream-trained numbers should be read as evidence for or against the sim-to-real question in isolation.

References

Ha, D. & Schmidhuber, J. (2018). World Models. arXiv:1803.10122
Behrouz et al. (2025). Nested Learning (HOPE architecture)
Janner, M. et al. (2019). When to Trust Your Model: Model-Based Policy Optimization. NeurIPS
Yu, T. et al. (2020). MOPO: Model-Based Offline Policy Optimization. NeurIPS

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
supreme		supreme
validation		validation
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
Supreme.md		Supreme.md
audit_report.md		audit_report.md
main.py		main.py
modelcard.tex		modelcard.tex
modelcard2.tex		modelcard2.tex
project_plan.md		project_plan.md
pyproject.toml		pyproject.toml
training.tex		training.tex
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supreme

Quick Start

Prerequisites

Installation

Verify Environment

Project Structure

Training Pipeline

Phase 1 — Train the VAE (V model)

Phase 2 — Train the World Model (M model)

Phase 3 — Train the Controller (C model)

Phases 4-8 — Dream Training

Configuration

Hardware Requirements

Results

Scientific limitation: hidden-state distribution shift across M versions

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Supreme

Quick Start

Prerequisites

Installation

Verify Environment

Project Structure

Training Pipeline

Phase 1 — Train the VAE (V model)

Phase 2 — Train the World Model (M model)

Phase 3 — Train the Controller (C model)

Phases 4-8 — Dream Training

Configuration

Hardware Requirements

Results

Scientific limitation: hidden-state distribution shift across M versions

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages