Skip to content

Reproducible streaming data order across runs (fixed data_seed + pinned num_workers)#70

Open
ksd3 wants to merge 2 commits into
mainfrom
feat/deterministic-data-order
Open

Reproducible streaming data order across runs (fixed data_seed + pinned num_workers)#70
ksd3 wants to merge 2 commits into
mainfrom
feat/deterministic-data-order

Conversation

@ksd3

@ksd3 ksd3 commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Make the streaming training data appear in the same example order for every run/config, so a scaling study compares model/objective on identical data.

1. scripts/train.py — fixed data_seed, shuffle before shard.
The stream shuffled after node-sharding with a rank-dependent seed (1337 + seed_offset), so the global order depended on GPU count and the model seed. Now it shuffles the full stream with a fixed data_seed = 1337 before sharding; each rank takes a deterministic disjoint slice. Model/DDP seeding unchanged; single-GPU behavior unchanged.

2. config/pythia-like/*.py — pin num_workers 32 → 8.
Byte-identical batch order also needs the same num_workers (DataLoader worker interleaving sets the order). 32/rank × 4 ranks ≈ 128 concurrent HF range-requests on the 764 GB stream → HTTP 429, forcing per-run overrides that break reproducibility. A fixed 8 (~32 streams/node) is 429-safe and identical across all 12 configs.

(The one-epoch max_iters change is split out into #71.)

ksd3 added 2 commits June 13, 2026 15:24
Shuffle the streaming train set with a fixed `data_seed` BEFORE sharding
across DDP ranks, rather than after sharding with a per-rank seed
(`1337 + seed_offset`). The global example order now depends only on
`data_seed` -- not on the GPU count or the per-rank model seed -- so every
run/config in a scaling study trains on the identical example order, while
each rank still takes a deterministic, disjoint slice of that one canonical
order. Model-init/DDP seeding (`torch.manual_seed(1337 + seed_offset)`) is
unchanged.
Lower num_workers from 32 to 8 in every config/pythia-like config. They
were already identical (so already reproducible), but 32 workers/rank x 4
ranks fans out to ~128 concurrent HF range-requests and rate-limits
(HTTP 429) when streaming the 764 GB Smith42/galaxies set. 8/rank keeps
~32 streams/node -- 429-safe -- and, being a single fixed value across all
12 configs, means no launcher needs to override num_workers per-run (which
would change DataLoader worker interleaving and so the example order). With
the fixed data_seed shuffle, this makes the data order genuinely identical
across the scaling study.
@ksd3 ksd3 changed the title Deterministic streaming data order across runs (fixed data_seed, shuffle before shard) Reproducible streaming data order across runs (fixed data_seed + pinned num_workers) Jun 13, 2026
@ksd3 ksd3 changed the title Reproducible streaming data order across runs (fixed data_seed + pinned num_workers) Scaling-study determinism: fixed data_seed, pinned num_workers, one-epoch budget Jun 13, 2026
@ksd3 ksd3 force-pushed the feat/deterministic-data-order branch from 95c393b to 80132c5 Compare June 13, 2026 19:55
@ksd3 ksd3 changed the title Scaling-study determinism: fixed data_seed, pinned num_workers, one-epoch budget Reproducible streaming data order across runs (fixed data_seed + pinned num_workers) Jun 13, 2026
@ksd3 ksd3 force-pushed the feat/deterministic-data-order branch from 80132c5 to f21d4cd Compare June 13, 2026 21:50
@ksd3 ksd3 requested a review from Smith42 June 13, 2026 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant