|
| 1 | +# Cluster / Remote Training Guide |
| 2 | + |
| 3 | +This document covers running pre-decoder training on remote GPU nodes |
| 4 | +using Docker containers, with optional SLURM integration. |
| 5 | +For local single-machine usage, see `README.md`. |
| 6 | + |
| 7 | +## Prerequisites |
| 8 | + |
| 9 | +- Docker with NVIDIA GPU support (`nvidia-docker` / `--gpus`) |
| 10 | +- One or more NVIDIA GPUs (H100, A100, or similar) |
| 11 | +- A persistent directory for checkpoints and logs |
| 12 | + |
| 13 | +## Quick start (Docker — recommended) |
| 14 | + |
| 15 | +### Option A: build the image once, reuse everywhere |
| 16 | + |
| 17 | +```bash |
| 18 | +# Build (once, from repo root) |
| 19 | +docker build -t predecoder-train . |
| 20 | + |
| 21 | +# Optionally, for a different CUDA version: |
| 22 | +docker build -t predecoder-train --build-arg TORCH_CUDA=cu124 . |
| 23 | + |
| 24 | +# Train |
| 25 | +docker run --rm --gpus all \ |
| 26 | + -v $(pwd):/app:ro \ |
| 27 | + -v $HOME/predecoder_outputs:/data \ |
| 28 | + -e SHARED_OUTPUT_DIR=/data \ |
| 29 | + predecoder-train |
| 30 | +``` |
| 31 | + |
| 32 | +The image includes Python 3.11, PyTorch with CUDA, and all training dependencies. |
| 33 | +Dependencies are baked in, so startup is fast and no internet access is needed at |
| 34 | +runtime. |
| 35 | + |
| 36 | +### Option B: install deps at runtime from a CUDA base image |
| 37 | + |
| 38 | +If you cannot pre-build the image (e.g. in a locked-down environment): |
| 39 | + |
| 40 | +```bash |
| 41 | +docker run --rm --gpus all \ |
| 42 | + -v $(pwd):/app:ro \ |
| 43 | + -v $HOME/predecoder_outputs:/data \ |
| 44 | + -e SHARED_OUTPUT_DIR=/data \ |
| 45 | + -e INSTALL_DIR=/opt/predecoder_env \ |
| 46 | + nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 \ |
| 47 | + bash -c 'apt-get update -qq && apt-get install -y -qq python3.11 python3.11-venv python3.11-dev curl git build-essential cmake >/dev/null 2>&1; bash /app/code/scripts/cluster_container_install_and_train.sh' |
| 48 | +``` |
| 49 | + |
| 50 | +This installs dependencies on every run, so it is slower. Use Option A when possible. |
| 51 | + |
| 52 | +## Quick start (SLURM + Docker) |
| 53 | + |
| 54 | +1. Build the image on a machine with Docker access: |
| 55 | + ```bash |
| 56 | + docker build -t predecoder-train . |
| 57 | + ``` |
| 58 | + |
| 59 | +2. Edit the `#SBATCH` directives in `code/scripts/sbatch_train.sh` for your cluster |
| 60 | + (partition name, GPU count, memory, time limit). |
| 61 | + |
| 62 | +3. Submit: |
| 63 | + ```bash |
| 64 | + export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs |
| 65 | + sbatch code/scripts/sbatch_train.sh |
| 66 | + ``` |
| 67 | + |
| 68 | +4. Monitor: |
| 69 | + ```bash |
| 70 | + tail -f predecoder_train_<jobid>.out |
| 71 | + ``` |
| 72 | + |
| 73 | +The sbatch script auto-detects: pre-built image > base CUDA image > bare-metal fallback. |
| 74 | + |
| 75 | +## Script overview |
| 76 | + |
| 77 | +| Script | Purpose | |
| 78 | +|--------|---------| |
| 79 | +| `Dockerfile` | Builds `predecoder-train` image with all dependencies. | |
| 80 | +| `code/scripts/local_run.sh` | Core runner. Handles Hydra config, GPU detection, logging, checkpoints. Works everywhere. | |
| 81 | +| `code/scripts/cluster_install_deps.sh` | Installs Python 3.11+ and training dependencies into an isolated environment. | |
| 82 | +| `code/scripts/cluster_train.sh` | Sets up output dirs, exports env, then calls `local_run.sh`. Expects `SHARED_OUTPUT_DIR`. | |
| 83 | +| `code/scripts/cluster_container_install_and_train.sh` | Runs inside a Docker container: install deps (if needed), then train. | |
| 84 | +| `code/scripts/sbatch_train.sh` | SLURM submission script (template). Edit `#SBATCH` directives for your cluster. | |
| 85 | + |
| 86 | +### Call chain |
| 87 | + |
| 88 | +``` |
| 89 | +sbatch_train.sh (or: docker run ... predecoder-train) |
| 90 | + ├─ (pre-built image) → cluster_container_install_and_train.sh |
| 91 | + │ └─ cluster_train.sh → local_run.sh |
| 92 | + ├─ (base CUDA image) → cluster_container_install_and_train.sh |
| 93 | + │ ├─ cluster_install_deps.sh |
| 94 | + │ └─ cluster_train.sh → local_run.sh |
| 95 | + └─ (no Docker) → cluster_install_deps.sh |
| 96 | + → cluster_train.sh → local_run.sh |
| 97 | +``` |
| 98 | + |
| 99 | +## Quick start (bare-metal node, no Docker) |
| 100 | + |
| 101 | +If Docker is unavailable, you can install directly on the node: |
| 102 | + |
| 103 | +```bash |
| 104 | +# Install deps once |
| 105 | +export INSTALL_DIR=$HOME/predecoder_env |
| 106 | +bash code/scripts/cluster_install_deps.sh |
| 107 | + |
| 108 | +# Train |
| 109 | +export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs |
| 110 | +export PREDECODER_PYTHON=$INSTALL_DIR/venv/bin/python |
| 111 | +bash code/scripts/cluster_train.sh |
| 112 | +``` |
| 113 | + |
| 114 | +## Available training configs |
| 115 | + |
| 116 | +| Config file | Model | R | Noise | |
| 117 | +|-------------|-------|---|-------| |
| 118 | +| `conf/config_qec_decoder_r9_fp8.yaml` | Model 1 | 9 | Depolarizing p=0.006 | |
| 119 | +| `conf/config_qec_decoder_r13_fp8.yaml` | Model 4 | 13 | Depolarizing p=0.006 | |
| 120 | +| `conf/config_public.yaml` | Any | Varies | User-defined | |
| 121 | + |
| 122 | +Select a config by setting `CONFIG_NAME` (without the `.yaml` extension): |
| 123 | +```bash |
| 124 | +export CONFIG_NAME=config_qec_decoder_r13_fp8 |
| 125 | +``` |
| 126 | + |
| 127 | +## Environment variable reference |
| 128 | + |
| 129 | +### Core variables |
| 130 | + |
| 131 | +| Variable | Default | Description | |
| 132 | +|----------|---------|-------------| |
| 133 | +| `SHARED_OUTPUT_DIR` | *(required for cluster)* | Persistent directory for outputs, logs, checkpoints. | |
| 134 | +| `EXPERIMENT_NAME` | `qec-decoder-depolarizing-r9-fp8` | Subdirectory under `outputs/` for this run. Change this when changing configs. | |
| 135 | +| `CONFIG_NAME` | `config_qec_decoder_r9_fp8` | Hydra config name (file in `conf/` without `.yaml`). | |
| 136 | +| `WORKFLOW` | `train` | `train` or `inference`. | |
| 137 | +| `GPUS` | auto-detect | Number of GPUs. Must match SLURM `--gres=gpu:N`. | |
| 138 | +| `FRESH_START` | `0` | Set `1` to ignore existing checkpoints and start from scratch. | |
| 139 | + |
| 140 | +### Training overrides |
| 141 | + |
| 142 | +| Variable | Default | Description | |
| 143 | +|----------|---------|-------------| |
| 144 | +| `PREDECODER_TRAIN_EPOCHS` | `100` | Total number of training epochs. | |
| 145 | +| `PREDECODER_TRAIN_SAMPLES` | config-defined | Samples per epoch. Bypasses auto-scaling when set explicitly. | |
| 146 | +| `PREDECODER_LR_MILESTONES` | config-defined | Comma-separated LR schedule milestone fractions (e.g. `0.25,0.5,1.0`). | |
| 147 | +| `PREDECODER_TIMING_RUN` | unset | Set `1` for timing/benchmarking mode (disables some overhead). | |
| 148 | +| `PREDECODER_TORCH_COMPILE` | unset | `0` to disable `torch.compile`, `1` to enable. | |
| 149 | +| `PREDECODER_DISABLE_SDR` | unset | `1` to skip Syndrome Density Reduction computation (saves time). | |
| 150 | +| `TORCH_COMPILE` | unset | Alternative way to control `torch.compile` (`0`/`1`). | |
| 151 | +| `TORCH_COMPILE_MODE` | unset | `default`, `reduce-overhead`, or `max-autotune`. | |
| 152 | + |
| 153 | +### Infrastructure variables |
| 154 | + |
| 155 | +| Variable | Default | Description | |
| 156 | +|----------|---------|-------------| |
| 157 | +| `INSTALL_DIR` | `$HOME/predecoder_env` | Where `cluster_install_deps.sh` creates the Python environment. | |
| 158 | +| `PREDECODER_PYTHON` | auto-detect | Explicit path to the Python binary. | |
| 159 | +| `TORCH_CUDA` | `cu121` | PyTorch CUDA wheel tag (e.g. `cu121`, `cu124`, `cu130`). | |
| 160 | +| `DOCKER_IMAGE` | `predecoder-train` | Pre-built Docker image name. | |
| 161 | +| `DOCKER_BASE_IMAGE` | `nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04` | Fallback CUDA base image. | |
| 162 | +| `PREDECODER_BASE_OUTPUT_DIR` | `$SHARED_OUTPUT_DIR/outputs` | Override the outputs root (advanced). | |
| 163 | +| `PREDECODER_LOG_BASE_DIR` | `$SHARED_OUTPUT_DIR/logs` | Override the logs root (advanced). | |
| 164 | + |
| 165 | +## Example SLURM configurations |
| 166 | + |
| 167 | +### R=9, 1 GPU (Model 1) |
| 168 | + |
| 169 | +```bash |
| 170 | +export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs |
| 171 | +sbatch code/scripts/sbatch_train.sh |
| 172 | +``` |
| 173 | + |
| 174 | +### R=13, 1 GPU (Model 4) |
| 175 | + |
| 176 | +```bash |
| 177 | +export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs |
| 178 | +EXPERIMENT_NAME=qec-decoder-depolarizing-r13-fp8 \ |
| 179 | +CONFIG_NAME=config_qec_decoder_r13_fp8 \ |
| 180 | + sbatch code/scripts/sbatch_train.sh |
| 181 | +``` |
| 182 | + |
| 183 | +### R=13, 4 GPUs (Model 4) |
| 184 | + |
| 185 | +Override SLURM resources on the command line: |
| 186 | + |
| 187 | +```bash |
| 188 | +export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs |
| 189 | +EXPERIMENT_NAME=qec-decoder-depolarizing-r13-fp8-4gpu \ |
| 190 | +CONFIG_NAME=config_qec_decoder_r13_fp8 \ |
| 191 | +GPUS=4 FRESH_START=1 \ |
| 192 | + sbatch --partition=<your-4gpu-partition> \ |
| 193 | + --gres=gpu:4 --cpus-per-task=80 --mem=240G \ |
| 194 | + code/scripts/sbatch_train.sh |
| 195 | +``` |
| 196 | + |
| 197 | +### Resume a 1-GPU checkpoint on 4 GPUs |
| 198 | + |
| 199 | +When moving from 1 to N GPUs mid-training, fix the sample count and LR milestones |
| 200 | +so the schedule matches the original trajectory: |
| 201 | + |
| 202 | +```bash |
| 203 | +export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs |
| 204 | +EXPERIMENT_NAME=qec-decoder-depolarizing-r13-fp8 \ |
| 205 | +CONFIG_NAME=config_qec_decoder_r13_fp8 \ |
| 206 | +GPUS=4 \ |
| 207 | +PREDECODER_TRAIN_SAMPLES=8388608 \ |
| 208 | +PREDECODER_LR_MILESTONES="1.0,2.0,4.0" \ |
| 209 | + sbatch --partition=<your-4gpu-partition> \ |
| 210 | + --gres=gpu:4 --cpus-per-task=80 --mem=240G \ |
| 211 | + code/scripts/sbatch_train.sh |
| 212 | +``` |
| 213 | + |
| 214 | +The milestone rescaling formula: if original milestones are `[m1, m2, m3]` and you |
| 215 | +increase GPU count by factor `k`, new milestones are `[m1*k, m2*k, m3*k]`. |
| 216 | + |
| 217 | +## Multi-GPU training |
| 218 | + |
| 219 | +- Training uses PyTorch DDP (`torch.distributed.run`). Set `GPUS=N` and ensure N GPUs are visible. |
| 220 | +- Auto-scaling: with N GPUs, each GPU processes `num_samples / N` samples per epoch. |
| 221 | + To keep the *total* samples identical to a 1-GPU run, set `PREDECODER_TRAIN_SAMPLES` explicitly. |
| 222 | +- LR milestones are expressed as fractions of total steps. Changing GPU count changes total steps, |
| 223 | + so milestones may need rescaling (see the resume example above). |
| 224 | +- The `MASTER_PORT` is auto-selected if not set. Override it to avoid port conflicts |
| 225 | + when running multiple jobs on the same node. |
| 226 | + |
| 227 | +## Resuming training |
| 228 | + |
| 229 | +Training auto-resumes from the latest checkpoint found in |
| 230 | +`$SHARED_OUTPUT_DIR/outputs/$EXPERIMENT_NAME/models/`. |
| 231 | + |
| 232 | +- Same experiment name = resume. Different experiment name = fresh run. |
| 233 | +- To force a clean restart on the same experiment: `export FRESH_START=1`. |
| 234 | +- A lock file prevents two SLURM jobs from writing to the same experiment directory concurrently. |
| 235 | + |
| 236 | +## Output structure |
| 237 | + |
| 238 | +``` |
| 239 | +$SHARED_OUTPUT_DIR/ |
| 240 | +├── outputs/ |
| 241 | +│ └── <experiment_name>/ |
| 242 | +│ ├── models/ # Checkpoints and final model |
| 243 | +│ ├── tensorboard/ # TensorBoard logs |
| 244 | +│ ├── config/ # Config snapshots per run |
| 245 | +│ └── run.log # Latest run log |
| 246 | +└── logs/ |
| 247 | + └── <experiment_name>_<timestamp>/ |
| 248 | + └── train.log # Full stdout/stderr |
| 249 | +``` |
| 250 | + |
| 251 | +## Adapting to your cluster |
| 252 | + |
| 253 | +1. **Edit `#SBATCH` directives** in `sbatch_train.sh`: |
| 254 | + - `--partition=` your cluster's GPU partition |
| 255 | + - `--gres=gpu:N` matching your GPU count |
| 256 | + - `--cpus-per-task=`, `--mem=`, `--time=` as appropriate |
| 257 | + |
| 258 | +2. **CUDA version**: set `TORCH_CUDA=cuXXX` to match your driver |
| 259 | + (e.g. `cu121` for CUDA 12.1, `cu124` for CUDA 12.4). |
| 260 | + |
| 261 | +3. **Docker base image**: set `DOCKER_BASE_IMAGE` if your cluster uses a different CUDA runtime. |
| 262 | + |
| 263 | +4. **File systems**: `SHARED_OUTPUT_DIR` should be on a shared/persistent filesystem |
| 264 | + visible from all nodes (NFS, Lustre, etc.). The sbatch script sets `chmod 1777` for |
| 265 | + NFS compatibility when using Docker. |
| 266 | + |
| 267 | +5. **No Docker?** The scripts fall back to bare-metal install automatically. |
| 268 | + Ensure the node has internet access (for pip) or pre-install deps via `cluster_install_deps.sh`. |
| 269 | + |
| 270 | +## Troubleshooting |
| 271 | + |
| 272 | +- **`SHARED_OUTPUT_DIR is not set`**: export it before running cluster scripts. |
| 273 | +- **Lock file conflict**: if a previous job crashed, remove `$SHARED_OUTPUT_DIR/.lock_<experiment>`. |
| 274 | +- **`steps_per_epoch=0`**: samples too low for the batch size. Increase `PREDECODER_TRAIN_SAMPLES`. |
| 275 | +- **torch.compile segfaults**: set `PREDECODER_TORCH_COMPILE=0`. |
| 276 | +- **pip install fails in container**: ensure the base image has `python3.11-dev` and `build-essential`. |
0 commit comments