Skip to content

15' + 1 hour record: Document-level shuffling#76

Merged
akshayvegesna merged 3 commits into
qlabs-eng:mainfrom
samacqua:shuffle-train
Apr 24, 2026
Merged

15' + 1 hour record: Document-level shuffling#76
akshayvegesna merged 3 commits into
qlabs-eng:mainfrom
samacqua:shuffle-train

Conversation

@samacqua
Copy link
Copy Markdown
Contributor

Instead of shuffling the order of a fixed set of batches every epoch, shuffle the order of the documents in the training set, then tokenize into batches + re-shuffle.

15' log file:
Old record: 3.345
New record: 3.332

1hr log file:
Old record: 3.211
New record: 3.204

This is taking one of the many components of #67 so that comparison to previous results is straightforward. This does not change the eval tokens or batching at all -- all it does is make the training shuffle the documents before re-batching. This adds more training diversity as any given token is preceded by different tokens every epoch.

I also added 2 hour record which was pre-empted while running. Submitting now, hoping someone can run it. The code is backwards compatible with --no-doc-shuffle, and I added some QOL things like saving the log, training file, and checkpoints to a file. This not only makes dev easier, but also replication. I hope people adopt this standard of attaching logs.

shmublu added a commit to shmublu/slowrun that referenced this pull request Apr 22, 2026
Cycle WD in anti-phase with SWA's LR: low (0.77x base) at each SWA epoch start
(high LR, exploration), high (1.23x base) at epoch end (near-zero LR, settling).
Pre-SWA: hold at 1.0x then linear decay to 0.77x over 40% -> SWA start.

Stacks on top of PR qlabs-eng#76 doc shuffling (3.204) for val loss ~3.200.
shmublu added a commit to shmublu/slowrun that referenced this pull request Apr 22, 2026
Cycle WD in anti-phase with SWA's LR: low (0.77x base) at each SWA epoch start
(high LR, exploration), high (1.23x base) at epoch end (near-zero LR, settling).
Pre-SWA: hold at 1.0x then linear decay to 0.77x over 40% -> SWA start.

Stacks on top of PR qlabs-eng#76 doc shuffling (3.204) for val loss ~3.200.
Copy link
Copy Markdown
Contributor

@akshayvegesna akshayvegesna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, left a couple minor comments. To merge the update to prepare_data, we also need to ensure that unlimited/train.py (and any other code) doesn't break. Can you update the Dataloader with your no-document-level shuffling version that is equivalent to before so it works?

Comment thread prepare_data.py Outdated
print()
verify_hash(val_path)
verify_hash(train_path)
for p in (val_path, train_path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an assert that the hash doesn't change? It's nice to have confidence that nothing has changed upstream in the huggingface dataset.

Comment thread two_hour/train.py
@@ -12,12 +12,17 @@
import math
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have a record here, let's leave this file changes out until we do.

Comment thread train.py
EVAL_TOKENS = 10_000_000
DATA_DIR = "fineweb_data"
BOS_ID = 50256 # <|endoftext|>
RUNS_DIR = "runs"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add runs/ to .gitignore

Comment thread train.py Outdated
def resolve_run_dir(run_name):
if run_name:
return run_name, os.path.join(RUNS_DIR, run_name)
name = "".join(secrets.choice(string.ascii_lowercase + string.digits) for _ in range(6))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I suggest we keep the datetime run id.

shmublu added a commit to shmublu/slowrun that referenced this pull request Apr 23, 2026
Cycle WD in anti-phase with SWA's LR: low (0.77x base) at each SWA epoch start
(high LR, exploration), high (1.23x base) at epoch end (near-zero LR, settling).
Pre-SWA: hold at 1.0x then linear decay to 0.77x over 40% -> SWA start.

Stacks on top of PR qlabs-eng#76 doc shuffling (3.204) for val loss ~3.200.
@samacqua
Copy link
Copy Markdown
Contributor Author

Alright all have been changed. All the non-record dataloaders (unlimited/, two_hour/, research/) should be token-for-token identical to before. Thanks!

@akshayvegesna
Copy link
Copy Markdown
Contributor

Thank you, looks good. Merging.

@akshayvegesna akshayvegesna merged commit a473277 into qlabs-eng:main Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants