Minimal, hackable transformer experiments.
git clone <repo-url> && cd cambrian-stack
uv venv && source .venv/bin/activate
uv pip install -e ".[ml]" # or ".[dev]" for tooling; base install skips torch
./scripts/train_baseline_transformer.sh # baseline AR
python -m cambrian_stack.train --config-name=diffusion_transformer # diffusionOutputs: checkpoints in out/, logs in logs/, optional W&B if WANDB_API_KEY is set.
This section explains how to run training jobs on SLURM-managed HPC clusters using Apptainer (formerly Singularity) containers.
Before you start, ensure you have:
-
Apptainer/Singularity container with PyTorch
- Example: NGC PyTorch container converted to SIF format
- The container must have PyTorch with CUDA support
-
SLURM account with GPU partition access
- Know your account name (e.g.,
mygroup-gpu) - Know available partitions (e.g.,
ghx4,ghx4-interactive)
- Know your account name (e.g.,
-
Environment file (optional but recommended)
- Create
.envin the repo root with:WANDB_API_KEY=your_wandb_key HF_TOKEN=your_huggingface_token
- Create
The SLURM scripts use these environment variables:
| Variable | Required | Description | Example |
|---|---|---|---|
SIF |
Yes | Path to PyTorch Apptainer/Singularity image | /scratch/containers/pytorch.sif |
CACHE_DIR |
No | Cache directory for HuggingFace, pip, etc. Defaults to .cache/ in repo |
/scratch/myuser/.cache |
SBATCH_ACCOUNT |
No | Default SLURM account (alternative to -A flag) |
mygroup-gpu |
This is the standard way to run training jobs.
Step 1: Set your environment variables
export SIF=/path/to/your/pytorch.sif
export CACHE_DIR=/path/to/cache # optional, but recommended for shared systemsStep 2: Submit the job
# From the repo root directory:
sbatch -A <your-account> scripts/test_baseline.sbatchStep 3: Monitor the job
# Check job status
squeue -u $USER
# Watch training progress (training logs go to stderr)
tail -f logs/slurm-baseline-test-<jobid>.err
# Watch job output (setup, GPU info, etc.)
tail -f logs/slurm-baseline-test-<jobid>.out| Script | Description | Duration | GPUs |
|---|---|---|---|
scripts/test_baseline.sbatch |
Sanity check run | 1 hour | 1 |
scripts/train_nanochat_speedrun.sbatch |
Full training run | 24 hours | 2 |
For debugging, development, or when you want to see output in real-time.
Step 1: Request an interactive GPU session
srun -A <your-account> \
-p ghx4-interactive \
--gres=gpu:1 \
--mem=32G \
--cpus-per-task=8 \
--time=02:00:00 \
--pty bashStep 2: Once on the GPU node, start the container
apptainer exec --nv /path/to/pytorch.sif bashThe --nv flag is required - it enables GPU access inside the container.
Step 3: Inside the container, set up and run
cd /path/to/cambrian-stack
# Install the package (first time only, or after code changes)
pip install --user -e .
pip install --user accelerate
# Run training
python -m cambrian_stack.train training.max_steps=100srun -A <your-account> -p ghx4-interactive --gres=gpu:1 --mem=32G --time=02:00:00 \
apptainer exec --nv /path/to/pytorch.sif bash -c '
cd /path/to/cambrian-stack
pip install --user -e . && pip install --user accelerate
python -m cambrian_stack.train training.max_steps=100
'For convenience, create a personal wrapper script:
#!/bin/bash
# scripts/run_myname.sh - Personal wrapper for <your-name>
export SIF=/path/to/your/pytorch.sif
export CACHE_DIR=/path/to/your/cache
sbatch -A your-account scripts/test_baseline.sbatch "$@"Make it executable: chmod +x scripts/run_myname.sh
Then run with: ./scripts/run_myname.sh
For faster debugging cycles:
# Use a smaller model
python -m cambrian_stack.train model.depth=4
# Fewer training steps
python -m cambrian_stack.train training.max_steps=50 training.eval_every=25
# Disable saving and sampling
python -m cambrian_stack.train training.save_every=0 training.sample_every=0
# Combine for fast iteration
python -m cambrian_stack.train \
model.depth=4 \
training.max_steps=50 \
training.eval_every=25 \
training.save_every=0Hydra configs live in src/cambrian_stack/conf/ (grouped by experiment/model/data/training/logging/output). Example overrides:
python -m cambrian_stack.train training.max_steps=2000 training.eval_every=200
python -m cambrian_stack.train --config-name=baseline_transformer model.depth=6- Autoregressive (GPT-style) and diffusion strategies are registered in
cambrian_stack/experiments. Add new ideas by adding an experiment module + config, without touching the trainer.
- Build locally:
pip install -e ".[docs]" && sphinx-build -b html docs docs/_build/html - Deploy: GitHub Actions workflow
docs.ymlpublishes to Pages.
MIT