15' + 1 hour record: Document-level shuffling by samacqua · Pull Request #76 · qlabs-eng/slowrun

samacqua · 2026-04-21T02:44:52Z

Instead of shuffling the order of a fixed set of batches every epoch, shuffle the order of the documents in the training set, then tokenize into batches + re-shuffle.

15' log file:
Old record: 3.345
New record: 3.332

1hr log file:
Old record: 3.211
New record: 3.204

This is taking one of the many components of #67 so that comparison to previous results is straightforward. This does not change the eval tokens or batching at all -- all it does is make the training shuffle the documents before re-batching. This adds more training diversity as any given token is preceded by different tokens every epoch.

I also added 2 hour record which was pre-empted while running. Submitting now, hoping someone can run it. The code is backwards compatible with --no-doc-shuffle, and I added some QOL things like saving the log, training file, and checkpoints to a file. This not only makes dev easier, but also replication. I hope people adopt this standard of attaching logs.

Cycle WD in anti-phase with SWA's LR: low (0.77x base) at each SWA epoch start (high LR, exploration), high (1.23x base) at epoch end (near-zero LR, settling). Pre-SWA: hold at 1.0x then linear decay to 0.77x over 40% -> SWA start. Stacks on top of PR qlabs-eng#76 doc shuffling (3.204) for val loss ~3.200.

akshayvegesna

Thank you, left a couple minor comments. To merge the update to prepare_data, we also need to ensure that unlimited/train.py (and any other code) doesn't break. Can you update the Dataloader with your no-document-level shuffling version that is equivalent to before so it works?

akshayvegesna · 2026-04-22T23:04:51Z

    print()
-    verify_hash(val_path)
-    verify_hash(train_path)
+    for p in (val_path, train_path):


Can you add an assert that the hash doesn't change? It's nice to have confidence that nothing has changed upstream in the huggingface dataset.

akshayvegesna · 2026-04-22T23:05:17Z

@@ -12,12 +12,17 @@
 import math


Since we don't have a record here, let's leave this file changes out until we do.

akshayvegesna · 2026-04-22T23:09:26Z

 EVAL_TOKENS = 10_000_000
 DATA_DIR = "fineweb_data"
+BOS_ID = 50256  # <|endoftext|>
+RUNS_DIR = "runs"


Let's add runs/ to .gitignore

akshayvegesna · 2026-04-22T23:11:30Z

+def resolve_run_dir(run_name):
+    if run_name:
+        return run_name, os.path.join(RUNS_DIR, run_name)
+    name = "".join(secrets.choice(string.ascii_lowercase + string.digits) for _ in range(6))


nit: I suggest we keep the datetime run id.

Cycle WD in anti-phase with SWA's LR: low (0.77x base) at each SWA epoch start (high LR, exploration), high (1.23x base) at epoch end (near-zero LR, settling). Pre-SWA: hold at 1.0x then linear decay to 0.77x over 40% -> SWA start. Stacks on top of PR qlabs-eng#76 doc shuffling (3.204) for val loss ~3.200.

…lt run id timestamps

samacqua · 2026-04-24T16:24:44Z

Alright all have been changed. All the non-record dataloaders (unlimited/, two_hour/, research/) should be token-for-token identical to before. Thanks!

akshayvegesna · 2026-04-24T21:28:35Z

Thank you, looks good. Merging.

add training + data scripts

2085da9

akshayvegesna reviewed Apr 22, 2026

View reviewed changes

samacqua added 2 commits April 24, 2026 08:28

Merge branch 'qlabs-eng:main' into shuffle-train

d54da27

make dataloaders compatible w/ new data, assert data hash, make defau…

a119f7d

…lt run id timestamps

akshayvegesna merged commit a473277 into qlabs-eng:main Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

15' + 1 hour record: Document-level shuffling#76

15' + 1 hour record: Document-level shuffling#76
akshayvegesna merged 3 commits into
qlabs-eng:mainfrom
samacqua:shuffle-train

samacqua commented Apr 21, 2026

Uh oh!

akshayvegesna left a comment

Uh oh!

akshayvegesna Apr 22, 2026

Uh oh!

akshayvegesna Apr 22, 2026

Uh oh!

akshayvegesna Apr 22, 2026

Uh oh!

akshayvegesna Apr 22, 2026

Uh oh!

samacqua commented Apr 24, 2026

Uh oh!

akshayvegesna commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samacqua commented Apr 21, 2026

Uh oh!

akshayvegesna left a comment

Choose a reason for hiding this comment

Uh oh!

akshayvegesna Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

akshayvegesna Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

akshayvegesna Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

akshayvegesna Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

samacqua commented Apr 24, 2026

Uh oh!

akshayvegesna commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants