AlpaGym

AlpaGym is a reinforcement-learning framework for end-to-end autonomous-driving policies. It runs a policy in closed loop inside a simulator, scores the resulting drives, and trains on them — so the policy learns from the consequences of its own steering rather than from logged ground truth alone.

It stands on two systems: AlpaSim provides the closed-loop simulator (the environment), and Cosmos-RL provides distributed rollout and training orchestration (the trainer). AlpaGym is the harness that wires them to a driving policy and keeps the interfaces small enough to swap any one piece.

AlpaGym is in early but active development. It currently supports the Alpamayo 1.5 model with 10b parameters. Current work focuses on throughput and scaling, and on supporting more models and training algorithms.

Quick Start

Full setup — host dependencies, Hugging Face and W&B auth, and downloading and converting a model bundle — is in the Onboarding Guide. Once the host tools are installed and a converted checkpoint is under ./tmp/checkpoints/:

# from the repository root
uv sync --all-packages

uv run --no-sync --all-packages python -m alpagym_host.cli \
  experiment=alpamayo_1_5_local_2gpu_smoke \
  policy.model.path="$(pwd)/tmp/checkpoints/alpamayo-1.5-10B_alpagym_ckpt" \
  reward=progress_safety

This runs one episode against AlpaSim, computes a reward, takes one training step, and writes artifacts under tmp/alpagym-runs/. Hydra output is written to outputs/.

Note: The default 10B Alpamayo model requires two GPUs; see the Onboarding Guide for getting the checkpoint.

Configuration

See packages/host/src/alpagym_host/conf/default.yaml for the full set of configuration options. For example:

expected_valid_steps controls the simulation length
dataset controls which scenes to simulate. This argument is passed to AlpaSim.
alpasim allows you to control the simulation parameters (see AlpaSim for details)
cosmos controls the topology and training parameters.
deploy=slurm and the topology=slurm_* presets are starting points for cluster deployments. They are not expected to run out of the box; set the partition, account, container image, cache paths, mounts, and AlpaSim Wizard deploy preset for your Slurm environment.

Architecture

AlpaGym is split into small workspace packages — a host control plane and a runtime harness. During a run, Cosmos-RL asks the runtime for rollouts; the runtime drives AlpaSim episodes over gRPC, accumulates observations, scores the completed episodes, and hands the artifacts back through Cosmos-RL to the trainer.

Packages

AlpaGym is split into small workspace packages, each with one job. The workspace is designed to make adding new packages straightforward.

packages/host — the control plane. Owns the CLI, the typed Hydra config, AlpaSim checkout and bring-up, Slurm submission, and the per-run artifact directory. On a cluster it runs on the login node.
packages/runtime — the harness that runs inside the GPU container. Owns the Cosmos-RL adapters (cosmos/), the rollout loop (episode_runner/), the gRPC egodriver (alpasim/), batched GPU inference (inference/), the policies and model adapters (policies/), the reward terms (rewards/), and the episode transport (transport/).
packages/policies — the driving policies. Currently the Alpamayo 1.5 policy bundle, configs, tokenizer, and checkpoint conversion script.
packages/alpasim_configs — AlpaSim topology configs used by AlpaGym.
packages/plugins — utilities for discovering optioninal plugin packages.

End-to-End Run Flow

A run has two halves: the host prepares and launches, the runtime drives and trains.

On the host (control plane):

Compose the config with Hydra and freeze it to a per-run directory.
Resolve and cache the AlpaSim checkout.
Start AlpaSim's Wizard launcher and publish its gRPC endpoint.
Launch the Cosmos-RL runtime.

In the runtime (GPU):

The rollout backend pulls scenes and runs episodes against AlpaSim. Each tick, AlpaSim sends camera, ego, and route observations to the egodriver gRPC server; the policy steps and returns a trajectory; AlpaSim advances the ego car and asks again.
Completed episodes are scored and written as artifacts.
The trainer consumes the artifacts and takes a GRPO step; updated weights sync back to the rollout workers.

Local runs do all of this on one machine. Slurm runs prepare steps 1–4 on the login node, then re-load the frozen config and run the same lifecycle on the cluster.

Documentation

Onboarding Guide — host setup, auth, getting a model, and running locally.
Contributing — code style and review process.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
install		install
packages		packages
tests		tests
.gitignore		.gitignore
.license-header.txt		.license-header.txt
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlpaGym

Table of Contents

Quick Start

Configuration

Architecture

Packages

End-to-End Run Flow

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AlpaGym

Table of Contents

Quick Start

Configuration

Architecture

Packages

End-to-End Run Flow

Documentation

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages