Skip to content

Commit 45bd63f

Browse files
committed
Add Docker-based training infrastructure and cluster scripts
Dockerfile, SLURM sbatch template, and supporting shell scripts for running pre-decoder training on remote GPU nodes (Docker, bare-metal, or SLURM). Includes two production training configs (R=9, R=13), PREDECODER_LR_MILESTONES env override in train.py, and comprehensive TRAINING.md documentation. Made-with: Cursor
1 parent 62cf726 commit 45bd63f

12 files changed

Lines changed: 857 additions & 2 deletions

.dockerignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
outputs/
2+
logs/
3+
frames_data/
4+
models/
5+
dev_history/
6+
.git/
7+
.venv*/
8+
venv/
9+
__pycache__/
10+
*.out
11+
*.err
12+
*.html

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,8 @@ logs/
6565
/MR_COMMENT_DRAFT.md
6666
/MR_REVIEW_SUMMARY.md
6767
/dev_history/
68+
69+
# SLURM job logs
70+
predecoder_train_*.out
71+
predecoder_train_*.err
72+
sbatch_logs/

Dockerfile

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
3+
#
4+
# Pre-decoder training image.
5+
#
6+
# Build:
7+
# docker build -t predecoder-train .
8+
# docker build -t predecoder-train --build-arg TORCH_CUDA=cu124 . # different CUDA
9+
#
10+
# Run:
11+
# docker run --rm --gpus all \
12+
# -v $(pwd):/app:ro -v $HOME/predecoder_outputs:/data \
13+
# -e SHARED_OUTPUT_DIR=/data \
14+
# predecoder-train
15+
#
16+
# See TRAINING.md for the full environment variable reference.
17+
18+
ARG BASE_IMAGE=nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
19+
FROM ${BASE_IMAGE}
20+
21+
ARG PYTHON_VERSION=3.11
22+
ARG TORCH_CUDA=cu121
23+
24+
ENV DEBIAN_FRONTEND=noninteractive \
25+
PYTHONUNBUFFERED=1 \
26+
PREDECODER_PYTHON=/opt/venv/bin/python
27+
28+
RUN apt-get update -qq && \
29+
apt-get install -y -qq --no-install-recommends \
30+
python${PYTHON_VERSION} python${PYTHON_VERSION}-venv python${PYTHON_VERSION}-dev \
31+
curl git coreutils build-essential cmake && \
32+
apt-get clean && rm -rf /var/lib/apt/lists/*
33+
34+
RUN python${PYTHON_VERSION} -m venv /opt/venv
35+
ENV PATH="/opt/venv/bin:$PATH"
36+
37+
COPY code/requirements_public_inference.txt /tmp/requirements_public_inference.txt
38+
COPY code/requirements_public_train.txt /tmp/requirements_public_train.txt
39+
40+
RUN pip install --no-cache-dir --upgrade pip setuptools wheel && \
41+
pip install --no-cache-dir \
42+
-r /tmp/requirements_public_train.txt \
43+
--index-url "https://download.pytorch.org/whl/${TORCH_CUDA}" \
44+
--extra-index-url https://pypi.org/simple && \
45+
python -c "import torch; print('PyTorch', torch.__version__, '(CUDA build:', torch.version.cuda, ')')"
46+
47+
WORKDIR /app
48+
CMD ["bash", "code/scripts/cluster_container_install_and_train.sh"]

TRAINING.md

Lines changed: 276 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# Cluster / Remote Training Guide
2+
3+
This document covers running pre-decoder training on remote GPU nodes
4+
using Docker containers, with optional SLURM integration.
5+
For local single-machine usage, see `README.md`.
6+
7+
## Prerequisites
8+
9+
- Docker with NVIDIA GPU support (`nvidia-docker` / `--gpus`)
10+
- One or more NVIDIA GPUs (H100, A100, or similar)
11+
- A persistent directory for checkpoints and logs
12+
13+
## Quick start (Docker — recommended)
14+
15+
### Option A: build the image once, reuse everywhere
16+
17+
```bash
18+
# Build (once, from repo root)
19+
docker build -t predecoder-train .
20+
21+
# Optionally, for a different CUDA version:
22+
docker build -t predecoder-train --build-arg TORCH_CUDA=cu124 .
23+
24+
# Train
25+
docker run --rm --gpus all \
26+
-v $(pwd):/app:ro \
27+
-v $HOME/predecoder_outputs:/data \
28+
-e SHARED_OUTPUT_DIR=/data \
29+
predecoder-train
30+
```
31+
32+
The image includes Python 3.11, PyTorch with CUDA, and all training dependencies.
33+
Dependencies are baked in, so startup is fast and no internet access is needed at
34+
runtime.
35+
36+
### Option B: install deps at runtime from a CUDA base image
37+
38+
If you cannot pre-build the image (e.g. in a locked-down environment):
39+
40+
```bash
41+
docker run --rm --gpus all \
42+
-v $(pwd):/app:ro \
43+
-v $HOME/predecoder_outputs:/data \
44+
-e SHARED_OUTPUT_DIR=/data \
45+
-e INSTALL_DIR=/opt/predecoder_env \
46+
nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 \
47+
bash -c 'apt-get update -qq && apt-get install -y -qq python3.11 python3.11-venv python3.11-dev curl git build-essential cmake >/dev/null 2>&1; bash /app/code/scripts/cluster_container_install_and_train.sh'
48+
```
49+
50+
This installs dependencies on every run, so it is slower. Use Option A when possible.
51+
52+
## Quick start (SLURM + Docker)
53+
54+
1. Build the image on a machine with Docker access:
55+
```bash
56+
docker build -t predecoder-train .
57+
```
58+
59+
2. Edit the `#SBATCH` directives in `code/scripts/sbatch_train.sh` for your cluster
60+
(partition name, GPU count, memory, time limit).
61+
62+
3. Submit:
63+
```bash
64+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
65+
sbatch code/scripts/sbatch_train.sh
66+
```
67+
68+
4. Monitor:
69+
```bash
70+
tail -f predecoder_train_<jobid>.out
71+
```
72+
73+
The sbatch script auto-detects: pre-built image > base CUDA image > bare-metal fallback.
74+
75+
## Script overview
76+
77+
| Script | Purpose |
78+
|--------|---------|
79+
| `Dockerfile` | Builds `predecoder-train` image with all dependencies. |
80+
| `code/scripts/local_run.sh` | Core runner. Handles Hydra config, GPU detection, logging, checkpoints. Works everywhere. |
81+
| `code/scripts/cluster_install_deps.sh` | Installs Python 3.11+ and training dependencies into an isolated environment. |
82+
| `code/scripts/cluster_train.sh` | Sets up output dirs, exports env, then calls `local_run.sh`. Expects `SHARED_OUTPUT_DIR`. |
83+
| `code/scripts/cluster_container_install_and_train.sh` | Runs inside a Docker container: install deps (if needed), then train. |
84+
| `code/scripts/sbatch_train.sh` | SLURM submission script (template). Edit `#SBATCH` directives for your cluster. |
85+
86+
### Call chain
87+
88+
```
89+
sbatch_train.sh (or: docker run ... predecoder-train)
90+
├─ (pre-built image) → cluster_container_install_and_train.sh
91+
│ └─ cluster_train.sh → local_run.sh
92+
├─ (base CUDA image) → cluster_container_install_and_train.sh
93+
│ ├─ cluster_install_deps.sh
94+
│ └─ cluster_train.sh → local_run.sh
95+
└─ (no Docker) → cluster_install_deps.sh
96+
→ cluster_train.sh → local_run.sh
97+
```
98+
99+
## Quick start (bare-metal node, no Docker)
100+
101+
If Docker is unavailable, you can install directly on the node:
102+
103+
```bash
104+
# Install deps once
105+
export INSTALL_DIR=$HOME/predecoder_env
106+
bash code/scripts/cluster_install_deps.sh
107+
108+
# Train
109+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
110+
export PREDECODER_PYTHON=$INSTALL_DIR/venv/bin/python
111+
bash code/scripts/cluster_train.sh
112+
```
113+
114+
## Available training configs
115+
116+
| Config file | Model | R | Noise |
117+
|-------------|-------|---|-------|
118+
| `conf/config_qec_decoder_r9_fp8.yaml` | Model 1 | 9 | Depolarizing p=0.006 |
119+
| `conf/config_qec_decoder_r13_fp8.yaml` | Model 4 | 13 | Depolarizing p=0.006 |
120+
| `conf/config_public.yaml` | Any | Varies | User-defined |
121+
122+
Select a config by setting `CONFIG_NAME` (without the `.yaml` extension):
123+
```bash
124+
export CONFIG_NAME=config_qec_decoder_r13_fp8
125+
```
126+
127+
## Environment variable reference
128+
129+
### Core variables
130+
131+
| Variable | Default | Description |
132+
|----------|---------|-------------|
133+
| `SHARED_OUTPUT_DIR` | *(required for cluster)* | Persistent directory for outputs, logs, checkpoints. |
134+
| `EXPERIMENT_NAME` | `qec-decoder-depolarizing-r9-fp8` | Subdirectory under `outputs/` for this run. Change this when changing configs. |
135+
| `CONFIG_NAME` | `config_qec_decoder_r9_fp8` | Hydra config name (file in `conf/` without `.yaml`). |
136+
| `WORKFLOW` | `train` | `train` or `inference`. |
137+
| `GPUS` | auto-detect | Number of GPUs. Must match SLURM `--gres=gpu:N`. |
138+
| `FRESH_START` | `0` | Set `1` to ignore existing checkpoints and start from scratch. |
139+
140+
### Training overrides
141+
142+
| Variable | Default | Description |
143+
|----------|---------|-------------|
144+
| `PREDECODER_TRAIN_EPOCHS` | `100` | Total number of training epochs. |
145+
| `PREDECODER_TRAIN_SAMPLES` | config-defined | Samples per epoch. Bypasses auto-scaling when set explicitly. |
146+
| `PREDECODER_LR_MILESTONES` | config-defined | Comma-separated LR schedule milestone fractions (e.g. `0.25,0.5,1.0`). |
147+
| `PREDECODER_TIMING_RUN` | unset | Set `1` for timing/benchmarking mode (disables some overhead). |
148+
| `PREDECODER_TORCH_COMPILE` | unset | `0` to disable `torch.compile`, `1` to enable. |
149+
| `PREDECODER_DISABLE_SDR` | unset | `1` to skip Syndrome Density Reduction computation (saves time). |
150+
| `TORCH_COMPILE` | unset | Alternative way to control `torch.compile` (`0`/`1`). |
151+
| `TORCH_COMPILE_MODE` | unset | `default`, `reduce-overhead`, or `max-autotune`. |
152+
153+
### Infrastructure variables
154+
155+
| Variable | Default | Description |
156+
|----------|---------|-------------|
157+
| `INSTALL_DIR` | `$HOME/predecoder_env` | Where `cluster_install_deps.sh` creates the Python environment. |
158+
| `PREDECODER_PYTHON` | auto-detect | Explicit path to the Python binary. |
159+
| `TORCH_CUDA` | `cu121` | PyTorch CUDA wheel tag (e.g. `cu121`, `cu124`, `cu130`). |
160+
| `DOCKER_IMAGE` | `predecoder-train` | Pre-built Docker image name. |
161+
| `DOCKER_BASE_IMAGE` | `nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04` | Fallback CUDA base image. |
162+
| `PREDECODER_BASE_OUTPUT_DIR` | `$SHARED_OUTPUT_DIR/outputs` | Override the outputs root (advanced). |
163+
| `PREDECODER_LOG_BASE_DIR` | `$SHARED_OUTPUT_DIR/logs` | Override the logs root (advanced). |
164+
165+
## Example SLURM configurations
166+
167+
### R=9, 1 GPU (Model 1)
168+
169+
```bash
170+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
171+
sbatch code/scripts/sbatch_train.sh
172+
```
173+
174+
### R=13, 1 GPU (Model 4)
175+
176+
```bash
177+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
178+
EXPERIMENT_NAME=qec-decoder-depolarizing-r13-fp8 \
179+
CONFIG_NAME=config_qec_decoder_r13_fp8 \
180+
sbatch code/scripts/sbatch_train.sh
181+
```
182+
183+
### R=13, 4 GPUs (Model 4)
184+
185+
Override SLURM resources on the command line:
186+
187+
```bash
188+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
189+
EXPERIMENT_NAME=qec-decoder-depolarizing-r13-fp8-4gpu \
190+
CONFIG_NAME=config_qec_decoder_r13_fp8 \
191+
GPUS=4 FRESH_START=1 \
192+
sbatch --partition=<your-4gpu-partition> \
193+
--gres=gpu:4 --cpus-per-task=80 --mem=240G \
194+
code/scripts/sbatch_train.sh
195+
```
196+
197+
### Resume a 1-GPU checkpoint on 4 GPUs
198+
199+
When moving from 1 to N GPUs mid-training, fix the sample count and LR milestones
200+
so the schedule matches the original trajectory:
201+
202+
```bash
203+
export SHARED_OUTPUT_DIR=$HOME/predecoder_outputs
204+
EXPERIMENT_NAME=qec-decoder-depolarizing-r13-fp8 \
205+
CONFIG_NAME=config_qec_decoder_r13_fp8 \
206+
GPUS=4 \
207+
PREDECODER_TRAIN_SAMPLES=8388608 \
208+
PREDECODER_LR_MILESTONES="1.0,2.0,4.0" \
209+
sbatch --partition=<your-4gpu-partition> \
210+
--gres=gpu:4 --cpus-per-task=80 --mem=240G \
211+
code/scripts/sbatch_train.sh
212+
```
213+
214+
The milestone rescaling formula: if original milestones are `[m1, m2, m3]` and you
215+
increase GPU count by factor `k`, new milestones are `[m1*k, m2*k, m3*k]`.
216+
217+
## Multi-GPU training
218+
219+
- Training uses PyTorch DDP (`torch.distributed.run`). Set `GPUS=N` and ensure N GPUs are visible.
220+
- Auto-scaling: with N GPUs, each GPU processes `num_samples / N` samples per epoch.
221+
To keep the *total* samples identical to a 1-GPU run, set `PREDECODER_TRAIN_SAMPLES` explicitly.
222+
- LR milestones are expressed as fractions of total steps. Changing GPU count changes total steps,
223+
so milestones may need rescaling (see the resume example above).
224+
- The `MASTER_PORT` is auto-selected if not set. Override it to avoid port conflicts
225+
when running multiple jobs on the same node.
226+
227+
## Resuming training
228+
229+
Training auto-resumes from the latest checkpoint found in
230+
`$SHARED_OUTPUT_DIR/outputs/$EXPERIMENT_NAME/models/`.
231+
232+
- Same experiment name = resume. Different experiment name = fresh run.
233+
- To force a clean restart on the same experiment: `export FRESH_START=1`.
234+
- A lock file prevents two SLURM jobs from writing to the same experiment directory concurrently.
235+
236+
## Output structure
237+
238+
```
239+
$SHARED_OUTPUT_DIR/
240+
├── outputs/
241+
│ └── <experiment_name>/
242+
│ ├── models/ # Checkpoints and final model
243+
│ ├── tensorboard/ # TensorBoard logs
244+
│ ├── config/ # Config snapshots per run
245+
│ └── run.log # Latest run log
246+
└── logs/
247+
└── <experiment_name>_<timestamp>/
248+
└── train.log # Full stdout/stderr
249+
```
250+
251+
## Adapting to your cluster
252+
253+
1. **Edit `#SBATCH` directives** in `sbatch_train.sh`:
254+
- `--partition=` your cluster's GPU partition
255+
- `--gres=gpu:N` matching your GPU count
256+
- `--cpus-per-task=`, `--mem=`, `--time=` as appropriate
257+
258+
2. **CUDA version**: set `TORCH_CUDA=cuXXX` to match your driver
259+
(e.g. `cu121` for CUDA 12.1, `cu124` for CUDA 12.4).
260+
261+
3. **Docker base image**: set `DOCKER_BASE_IMAGE` if your cluster uses a different CUDA runtime.
262+
263+
4. **File systems**: `SHARED_OUTPUT_DIR` should be on a shared/persistent filesystem
264+
visible from all nodes (NFS, Lustre, etc.). The sbatch script sets `chmod 1777` for
265+
NFS compatibility when using Docker.
266+
267+
5. **No Docker?** The scripts fall back to bare-metal install automatically.
268+
Ensure the node has internet access (for pip) or pre-install deps via `cluster_install_deps.sh`.
269+
270+
## Troubleshooting
271+
272+
- **`SHARED_OUTPUT_DIR is not set`**: export it before running cluster scripts.
273+
- **Lock file conflict**: if a previous job crashed, remove `$SHARED_OUTPUT_DIR/.lock_<experiment>`.
274+
- **`steps_per_epoch=0`**: samples too low for the batch size. Increase `PREDECODER_TRAIN_SAMPLES`.
275+
- **torch.compile segfaults**: set `PREDECODER_TORCH_COMPILE=0`.
276+
- **pip install fails in container**: ensure the base image has `python3.11-dev` and `build-essential`.

0 commit comments

Comments
 (0)