Skip to content

Commit 65e12c0

Browse files
ivanbasovclaude
andauthored
Add training scripts and configs for public model generation (#4)
* Add Docker-based training infrastructure and cluster scripts Dockerfile, SLURM sbatch template, and supporting shell scripts for running pre-decoder training on remote GPU nodes (Docker, bare-metal, or SLURM). Includes two production training configs (R=9, R=13), PREDECODER_LR_MILESTONES env override in train.py, and comprehensive TRAINING.md documentation. Made-with: Cursor * Fix training script portability and documentation issues - sbatch_train.sh: resolve REPO_ROOT from script location, not $(pwd) - sbatch_train.sh: consolidate PREDECODER_DISABLE_SDR/TORCH_COMPILE defaults so Docker and bare-metal paths behave identically - sbatch_train.sh: log message before chmod 1777; add --nodes=1 to multi-GPU examples - cluster_install_deps.sh: arch-aware Miniconda URL (supports aarch64/ARM) - cluster_install_deps.sh: single TORCH_CUDA default (remove redundant fallback) - TRAINING.md: document SHARED_LOG_DIR; correct cluster defaults for SDR/compile vars - conf/config_qec_decoder_r{9,13}_fp8.yaml: note that training hyperparams come from internal defaults, point to config_public.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent e70d9ab commit 65e12c0

12 files changed

Lines changed: 873 additions & 2 deletions

.dockerignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
outputs/
2+
logs/
3+
frames_data/
4+
models/
5+
dev_history/
6+
.git/
7+
.venv*/
8+
venv/
9+
__pycache__/
10+
*.out
11+
*.err
12+
*.html

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,8 @@ logs/
6565
/MR_COMMENT_DRAFT.md
6666
/MR_REVIEW_SUMMARY.md
6767
/dev_history/
68+
69+
# SLURM job logs
70+
predecoder_train_*.out
71+
predecoder_train_*.err
72+
sbatch_logs/

Dockerfile

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
3+
#
4+
# Pre-decoder training image.
5+
#
6+
# Build:
7+
# docker build -t predecoder-train .
8+
# docker build -t predecoder-train --build-arg TORCH_CUDA=cu124 . # different CUDA
9+
#
10+
# Run:
11+
# docker run --rm --gpus all \
12+
# -v $(pwd):/app:ro -v $HOME/predecoder_outputs:/data \
13+
# -e SHARED_OUTPUT_DIR=/data \
14+
# predecoder-train
15+
#
16+
# See TRAINING.md for the full environment variable reference.
17+
18+
ARG BASE_IMAGE=nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
19+
FROM ${BASE_IMAGE}
20+
21+
ARG PYTHON_VERSION=3.11
22+
ARG TORCH_CUDA=cu121
23+
24+
ENV DEBIAN_FRONTEND=noninteractive \
25+
PYTHONUNBUFFERED=1 \
26+
PREDECODER_PYTHON=/opt/venv/bin/python
27+
28+
RUN apt-get update -qq && \
29+
apt-get install -y -qq --no-install-recommends \
30+
python${PYTHON_VERSION} python${PYTHON_VERSION}-venv python${PYTHON_VERSION}-dev \
31+
curl git coreutils build-essential cmake && \
32+
apt-get clean && rm -rf /var/lib/apt/lists/*
33+
34+
RUN python${PYTHON_VERSION} -m venv /opt/venv
35+
ENV PATH="/opt/venv/bin:$PATH"
36+
37+
COPY code/requirements_public_inference.txt /tmp/requirements_public_inference.txt
38+
COPY code/requirements_public_train.txt /tmp/requirements_public_train.txt
39+
40+
RUN pip install --no-cache-dir --upgrade pip setuptools wheel && \
41+
pip install --no-cache-dir \
42+
-r /tmp/requirements_public_train.txt \
43+
--index-url "https://download.pytorch.org/whl/${TORCH_CUDA}" \
44+
--extra-index-url https://pypi.org/simple && \
45+
python -c "import torch; print('PyTorch', torch.__version__, '(CUDA build:', torch.version.cuda, ')')"
46+
47+
WORKDIR /app
48+
CMD ["bash", "code/scripts/cluster_container_install_and_train.sh"]

TRAINING.md

Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
# Cluster / Remote Training Guide
2+
3+
This document covers running pre-decoder training on remote GPU nodes
4+
using Docker containers, with optional SLURM integration.
5+
For local single-machine usage, see `README.md`.
6+
7+
## Prerequisites
8+
9+
- Docker with NVIDIA GPU support (`nvidia-docker` / `--gpus`)
10+
- One or more NVIDIA GPUs (H100, A100, or similar)
11+
- A persistent directory for checkpoints and logs
12+
13+
## Quick start (Docker — recommended)
14+
15+
### Option A: build the image once, reuse everywhere
16+
17+
```bash
18+
# Build (once, from repo root)
19+
docker build -t predecoder-train .
20+
21+
# Optionally, for a different CUDA version:
22+
docker build -t predecoder-train --build-arg TORCH_CUDA=cu124 .
23+
24+
# Train
25+
docker run --rm --gpus all \
26+
-v $(pwd):/app:ro \
27+
-v $HOME/predecoder_outputs:/data \
28+
-e SHARED_OUTPUT_DIR=/data \
29+
predecoder-train
30+
```
31+
32+
The image includes Python 3.11, PyTorch with CUDA, and all training dependencies.
33+
Dependencies are baked in, so startup is fast and no internet access is needed at
34+
runtime.
35+
36+
### Option B: install deps at runtime from a CUDA base image
37+
38+
If you cannot pre-build the image (e.g. in a locked-down environment):
39+
40+
```bash
41+
docker run --rm --gpus all \
42+
-v $(pwd):/app:ro \
43+
-v $HOME/predecoder_outputs:/data \
44+
-e SHARED_OUTPUT_DIR=/data \
45+
-e INSTALL_DIR=/opt/predecoder_env \
46+
nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 \
47+
bash -c 'apt-get update -qq && apt-get install -y -qq python3.11 python3.11-venv python3.11-dev curl git build-essential cmake >/dev/null 2>&1; bash /app/code/scripts/cluster_container_install_and_train.sh'
48+
```
49+
50+
This installs dependencies on every run, so it is slower. Use Option A when possible.
51+
52+
## Quick start (SLURM + Docker)
53+
54+
1. Build the image on a machine with Docker access:
55+
```bash
56+
docker build -t predecoder-train .
57+
```
58+
59+
2. Edit the `#SBATCH` directives in `code/scripts/sbatch_train.sh` for your cluster
60+
(partition name, GPU count, memory, time limit).
61+
62+
3. Submit:
63+
```bash
64+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
65+
sbatch code/scripts/sbatch_train.sh
66+
```
67+
68+
4. Monitor:
69+
```bash
70+
tail -f predecoder_train_<jobid>.out
71+
```
72+
73+
The sbatch script auto-detects: pre-built image > base CUDA image > bare-metal fallback.
74+
75+
## Script overview
76+
77+
| Script | Purpose |
78+
|--------|---------|
79+
| `Dockerfile` | Builds `predecoder-train` image with all dependencies. |
80+
| `code/scripts/local_run.sh` | Core runner. Handles Hydra config, GPU detection, logging, checkpoints. Works everywhere. |
81+
| `code/scripts/cluster_install_deps.sh` | Installs Python 3.11+ and training dependencies into an isolated environment. |
82+
| `code/scripts/cluster_train.sh` | Sets up output dirs, exports env, then calls `local_run.sh`. Expects `SHARED_OUTPUT_DIR`. |
83+
| `code/scripts/cluster_container_install_and_train.sh` | Runs inside a Docker container: install deps (if needed), then train. |
84+
| `code/scripts/sbatch_train.sh` | SLURM submission script (template). Edit `#SBATCH` directives for your cluster. |
85+
86+
### Call chain
87+
88+
```
89+
sbatch_train.sh (or: docker run ... predecoder-train)
90+
├─ (pre-built image) → cluster_container_install_and_train.sh
91+
│ └─ cluster_train.sh → local_run.sh
92+
├─ (base CUDA image) → cluster_container_install_and_train.sh
93+
│ ├─ cluster_install_deps.sh
94+
│ └─ cluster_train.sh → local_run.sh
95+
└─ (no Docker) → cluster_install_deps.sh
96+
→ cluster_train.sh → local_run.sh
97+
```
98+
99+
## Quick start (bare-metal node, no Docker)
100+
101+
If Docker is unavailable, you can install directly on the node:
102+
103+
```bash
104+
# Install deps once
105+
export INSTALL_DIR=$HOME/predecoder_env
106+
bash code/scripts/cluster_install_deps.sh
107+
108+
# Train
109+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
110+
export PREDECODER_PYTHON=$INSTALL_DIR/venv/bin/python
111+
bash code/scripts/cluster_train.sh
112+
```
113+
114+
## Available training configs
115+
116+
| Config file | Model | R | Noise |
117+
|-------------|-------|---|-------|
118+
| `conf/config_qec_decoder_r9_fp8.yaml` | Model 1 | 9 | Depolarizing p=0.006 |
119+
| `conf/config_qec_decoder_r13_fp8.yaml` | Model 4 | 13 | Depolarizing p=0.006 |
120+
| `conf/config_public.yaml` | Any | Varies | User-defined |
121+
122+
Select a config by setting `CONFIG_NAME` (without the `.yaml` extension):
123+
```bash
124+
export CONFIG_NAME=config_qec_decoder_r13_fp8
125+
```
126+
127+
## Environment variable reference
128+
129+
### Core variables
130+
131+
| Variable | Default | Description |
132+
|----------|---------|-------------|
133+
| `SHARED_OUTPUT_DIR` | *(required for cluster)* | Persistent directory for outputs, logs, checkpoints. |
134+
| `EXPERIMENT_NAME` | `qec-decoder-depolarizing-r9-fp8` | Subdirectory under `outputs/` for this run. Change this when changing configs. |
135+
| `CONFIG_NAME` | `config_qec_decoder_r9_fp8` | Hydra config name (file in `conf/` without `.yaml`). |
136+
| `WORKFLOW` | `train` | `train` or `inference`. |
137+
| `GPUS` | auto-detect | Number of GPUs. Must match SLURM `--gres=gpu:N`. |
138+
| `FRESH_START` | `0` | Set `1` to ignore existing checkpoints and start from scratch. |
139+
140+
### Training overrides
141+
142+
| Variable | Default | Description |
143+
|----------|---------|-------------|
144+
| `PREDECODER_TRAIN_EPOCHS` | `100` | Total number of training epochs. |
145+
| `PREDECODER_TRAIN_SAMPLES` | config-defined | Samples per epoch. Bypasses auto-scaling when set explicitly. |
146+
| `PREDECODER_LR_MILESTONES` | config-defined | Comma-separated LR schedule milestone fractions (e.g. `0.25,0.5,1.0`). |
147+
| `PREDECODER_TIMING_RUN` | unset | Set `1` for timing/benchmarking mode (disables some overhead). |
148+
| `PREDECODER_TORCH_COMPILE` | `0` when run via `sbatch_train.sh`, otherwise unset | `0` to disable `torch.compile`, `1` to enable. |
149+
| `PREDECODER_DISABLE_SDR` | `1` when run via `sbatch_train.sh`, otherwise unset | `1` to skip Syndrome Density Reduction computation (saves time on cluster). |
150+
| `TORCH_COMPILE` | unset | Alternative way to control `torch.compile` (`0`/`1`). |
151+
| `TORCH_COMPILE_MODE` | unset | `default`, `reduce-overhead`, or `max-autotune`. |
152+
153+
### Infrastructure variables
154+
155+
| Variable | Default | Description |
156+
|----------|---------|-------------|
157+
| `INSTALL_DIR` | `$HOME/predecoder_env` | Where `cluster_install_deps.sh` creates the Python environment. |
158+
| `PREDECODER_PYTHON` | auto-detect | Explicit path to the Python binary. |
159+
| `TORCH_CUDA` | `cu121` | PyTorch CUDA wheel tag (e.g. `cu121`, `cu124`, `cu130`). |
160+
| `DOCKER_IMAGE` | `predecoder-train` | Pre-built Docker image name. |
161+
| `DOCKER_BASE_IMAGE` | `nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04` | Fallback CUDA base image. |
162+
| `SHARED_LOG_DIR` | `$SHARED_OUTPUT_DIR/logs` | Override the logs root directory (advanced). |
163+
| `PREDECODER_BASE_OUTPUT_DIR` | `$SHARED_OUTPUT_DIR/outputs` | Override the outputs root (advanced). |
164+
| `PREDECODER_LOG_BASE_DIR` | `$SHARED_OUTPUT_DIR/logs` | Override the logs root (advanced, set by `cluster_train.sh` from `SHARED_LOG_DIR`). |
165+
166+
## Example SLURM configurations
167+
168+
### R=9, 1 GPU (Model 1)
169+
170+
```bash
171+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
172+
sbatch code/scripts/sbatch_train.sh
173+
```
174+
175+
### R=13, 1 GPU (Model 4)
176+
177+
```bash
178+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
179+
EXPERIMENT_NAME=qec-decoder-depolarizing-r13-fp8 \
180+
CONFIG_NAME=config_qec_decoder_r13_fp8 \
181+
sbatch code/scripts/sbatch_train.sh
182+
```
183+
184+
### R=13, 4 GPUs (Model 4)
185+
186+
Override SLURM resources on the command line:
187+
188+
```bash
189+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
190+
EXPERIMENT_NAME=qec-decoder-depolarizing-r13-fp8-4gpu \
191+
CONFIG_NAME=config_qec_decoder_r13_fp8 \
192+
GPUS=4 FRESH_START=1 \
193+
sbatch --partition=<your-4gpu-partition> \
194+
--nodes=1 --gres=gpu:4 --cpus-per-task=80 --mem=240G \
195+
code/scripts/sbatch_train.sh
196+
```
197+
198+
### Resume a 1-GPU checkpoint on 4 GPUs
199+
200+
When moving from 1 to N GPUs mid-training, fix the sample count and LR milestones
201+
so the schedule matches the original trajectory:
202+
203+
```bash
204+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
205+
EXPERIMENT_NAME=qec-decoder-depolarizing-r13-fp8 \
206+
CONFIG_NAME=config_qec_decoder_r13_fp8 \
207+
GPUS=4 \
208+
PREDECODER_TRAIN_SAMPLES=8388608 \
209+
PREDECODER_LR_MILESTONES="1.0,2.0,4.0" \
210+
sbatch --partition=<your-4gpu-partition> \
211+
--nodes=1 --gres=gpu:4 --cpus-per-task=80 --mem=240G \
212+
code/scripts/sbatch_train.sh
213+
```
214+
215+
The milestone rescaling formula: if original milestones are `[m1, m2, m3]` and you
216+
increase GPU count by factor `k`, new milestones are `[m1*k, m2*k, m3*k]`.
217+
218+
## Multi-GPU training
219+
220+
- Training uses PyTorch DDP (`torch.distributed.run`). Set `GPUS=N` and ensure N GPUs are visible.
221+
- Auto-scaling: with N GPUs, each GPU processes `num_samples / N` samples per epoch.
222+
To keep the *total* samples identical to a 1-GPU run, set `PREDECODER_TRAIN_SAMPLES` explicitly.
223+
- LR milestones are expressed as fractions of total steps. Changing GPU count changes total steps,
224+
so milestones may need rescaling (see the resume example above).
225+
- The `MASTER_PORT` is auto-selected if not set. Override it to avoid port conflicts
226+
when running multiple jobs on the same node.
227+
228+
## Resuming training
229+
230+
Training auto-resumes from the latest checkpoint found in
231+
`$SHARED_OUTPUT_DIR/outputs/$EXPERIMENT_NAME/models/`.
232+
233+
- Same experiment name = resume. Different experiment name = fresh run.
234+
- To force a clean restart on the same experiment: `export FRESH_START=1`.
235+
- A lock file prevents two SLURM jobs from writing to the same experiment directory concurrently.
236+
237+
## Output structure
238+
239+
```
240+
$SHARED_OUTPUT_DIR/
241+
├── outputs/
242+
│ └── <experiment_name>/
243+
│ ├── models/ # Checkpoints and final model
244+
│ ├── tensorboard/ # TensorBoard logs
245+
│ ├── config/ # Config snapshots per run
246+
│ └── run.log # Latest run log
247+
└── logs/
248+
└── <experiment_name>_<timestamp>/
249+
└── train.log # Full stdout/stderr
250+
```
251+
252+
## Adapting to your cluster
253+
254+
1. **Edit `#SBATCH` directives** in `sbatch_train.sh`:
255+
- `--partition=` your cluster's GPU partition
256+
- `--gres=gpu:N` matching your GPU count
257+
- `--cpus-per-task=`, `--mem=`, `--time=` as appropriate
258+
259+
2. **CUDA version**: set `TORCH_CUDA=cuXXX` to match your driver
260+
(e.g. `cu121` for CUDA 12.1, `cu124` for CUDA 12.4).
261+
262+
3. **Docker base image**: set `DOCKER_BASE_IMAGE` if your cluster uses a different CUDA runtime.
263+
264+
4. **File systems**: `SHARED_OUTPUT_DIR` should be on a shared/persistent filesystem
265+
visible from all nodes (NFS, Lustre, etc.). The sbatch script sets `chmod 1777` for
266+
NFS compatibility when using Docker.
267+
268+
5. **No Docker?** The scripts fall back to bare-metal install automatically.
269+
Ensure the node has internet access (for pip) or pre-install deps via `cluster_install_deps.sh`.
270+
271+
## Troubleshooting
272+
273+
- **`SHARED_OUTPUT_DIR is not set`**: export it before running cluster scripts.
274+
- **Lock file conflict**: if a previous job crashed, remove `$SHARED_OUTPUT_DIR/.lock_<experiment>`.
275+
- **`steps_per_epoch=0`**: samples too low for the batch size. Increase `PREDECODER_TRAIN_SAMPLES`.
276+
- **torch.compile segfaults**: set `PREDECODER_TORCH_COMPILE=0`.
277+
- **pip install fails in container**: ensure the base image has `python3.11-dev` and `build-essential`.

0 commit comments

Comments
 (0)