Reproducible streaming data order across runs (fixed data_seed + pinned num_workers) by ksd3 · Pull Request #70 · Smith42/astroPT

ksd3 · 2026-06-13T19:25:06Z

Make the streaming training data appear in the same example order for every run/config, so a scaling study compares model/objective on identical data.

1. scripts/train.py — fixed data_seed, shuffle before shard.
The stream shuffled after node-sharding with a rank-dependent seed (1337 + seed_offset), so the global order depended on GPU count and the model seed. Now it shuffles the full stream with a fixed data_seed = 1337 before sharding; each rank takes a deterministic disjoint slice. Model/DDP seeding unchanged; single-GPU behavior unchanged.

2. config/pythia-like/*.py — pin num_workers 32 → 8.
Byte-identical batch order also needs the same num_workers (DataLoader worker interleaving sets the order). 32/rank × 4 ranks ≈ 128 concurrent HF range-requests on the 764 GB stream → HTTP 429, forcing per-run overrides that break reproducibility. A fixed 8 (~32 streams/node) is 429-safe and identical across all 12 configs.

(The one-epoch max_iters change is split out into #71.)

Shuffle the streaming train set with a fixed `data_seed` BEFORE sharding across DDP ranks, rather than after sharding with a per-rank seed (`1337 + seed_offset`). The global example order now depends only on `data_seed` -- not on the GPU count or the per-rank model seed -- so every run/config in a scaling study trains on the identical example order, while each rank still takes a deterministic, disjoint slice of that one canonical order. Model-init/DDP seeding (`torch.manual_seed(1337 + seed_offset)`) is unchanged.

Lower num_workers from 32 to 8 in every config/pythia-like config. They were already identical (so already reproducible), but 32 workers/rank x 4 ranks fans out to ~128 concurrent HF range-requests and rate-limits (HTTP 429) when streaming the 764 GB Smith42/galaxies set. 8/rank keeps ~32 streams/node -- 429-safe -- and, being a single fixed value across all 12 configs, means no launcher needs to override num_workers per-run (which would change DataLoader worker interleaving and so the example order). With the fixed data_seed shuffle, this makes the data order genuinely identical across the scaling study.

ksd3 added 2 commits June 13, 2026 15:24

ksd3 changed the title ~~Deterministic streaming data order across runs (fixed data_seed, shuffle before shard)~~ Reproducible streaming data order across runs (fixed data_seed + pinned num_workers) Jun 13, 2026

ksd3 changed the title ~~Reproducible streaming data order across runs (fixed data_seed + pinned num_workers)~~ Scaling-study determinism: fixed data_seed, pinned num_workers, one-epoch budget Jun 13, 2026

ksd3 force-pushed the feat/deterministic-data-order branch from 95c393b to 80132c5 Compare June 13, 2026 19:55

ksd3 changed the title ~~Scaling-study determinism: fixed data_seed, pinned num_workers, one-epoch budget~~ Reproducible streaming data order across runs (fixed data_seed + pinned num_workers) Jun 13, 2026

ksd3 mentioned this pull request Jun 13, 2026

Exact one-epoch training from local data (padding-free distributed sampler) #72

Open

ksd3 force-pushed the feat/deterministic-data-order branch from 80132c5 to f21d4cd Compare June 13, 2026 21:50

ksd3 requested a review from Smith42 June 13, 2026 21:51

Smith42 mentioned this pull request Jun 14, 2026

Pythia configs: effective batch 320 + clean one-epoch stop #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducible streaming data order across runs (fixed data_seed + pinned num_workers)#70

Reproducible streaming data order across runs (fixed data_seed + pinned num_workers)#70
ksd3 wants to merge 2 commits into
mainfrom
feat/deterministic-data-order

ksd3 commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ksd3 commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ksd3 commented Jun 13, 2026 •

edited

Loading