-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Record: 11L XSA4 + Tight SWA + FA3 + Two-Phase TTT (3-seed mean val_bpb=1.1227) #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
EthanYangTW
wants to merge
18
commits into
openai:main
Choose a base branch
from
EthanYangTW:submission/fa3-twophase-ttt-3seed
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+5,149
−0
Open
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
6ed2fa5
Add aggressive submission: QAT + BigramHash 12288 + stride 32
EthanYangTW bccc688
Add submission: QAT + BigramHash 12K + Stride 32
EthanYangTW 8a0ebf2
Add LoRA TTT (test-time training) to submission
EthanYangTW db5c5dd
Remove TTT, bump BigramHash to 13312
EthanYangTW 65f54ac
Revert BigramHash to 12288 (13312 over 16MB)
EthanYangTW 32790dd
Add training log and update submission with 8xH100 results
EthanYangTW 4bd048c
Fix SWA description: 50 steps not 25
EthanYangTW a3b1212
Update records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32…
EthanYangTW 38dff06
Add #315/#388 full stack: 11L, XSA4, Partial RoPE, LN Scale, EMA, Lat…
EthanYangTW 67fa031
Fix speed and artifact size: disable FA3, reduce BigramHash, EMA ever…
EthanYangTW 943597d
Reduce BigramHash to 2048, increase pruning to 10% to fit under 16MB
EthanYangTW 308ed62
Remove BigramHash, increase pruning to 15% — must fit under 16MB
EthanYangTW 78f998e
New record: 11L XSA4 + Tight SWA + TTT (based on PR #374)
EthanYangTW 4c37972
Fix TTT: compile + DDP + 3 epochs + batch 64 for speed
EthanYangTW e83a277
Fix FA3 import: add fallback to flash_attn and SDPA
EthanYangTW 573f735
New record: 11L XSA4 + Tight SWA + Two-Phase TTT (1.1258 BPB)
EthanYangTW 358a426
Update: FA3 Hopper + aggressive two-phase TTT (val_bpb=1.1216)
EthanYangTW 550a5db
Record: 11L XSA4 + FA3 + Two-Phase TTT (3-seed mean 1.1227)
EthanYangTW File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
31 changes: 31 additions & 0 deletions
31
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| # QAT + BigramHash(12288) + Stride 32 | ||
|
|
||
| ## Summary | ||
|
|
||
| Built on the current SOTA (`10L_Int5MLP_MuonWD04_SWA50`) with the following improvements: | ||
|
|
||
| - **QAT (Quantization-Aware Training):** STE fake-quantize during training — int5 for MLP layers, int6 for attention. Reduces post-quantization degradation. | ||
| - **BigramHash 12288:** Increased from 10240 to 12288 buckets for better bigram coverage. | ||
| - **Eval stride 32:** Reduced from 64 to 32 for more overlapping context windows during evaluation. | ||
| - **Magnitude pruning 5%:** Increased from 3% to improve compression ratio. | ||
| - **SWA every 50 steps:** Checkpoint averaging during warmdown. | ||
|
|
||
| ## Architecture | ||
|
|
||
| - 10 transformer layers, dim=512, 8 heads, 4 KV heads | ||
| - 3x MLP with SmearGate | ||
| - BigramHash(12288) with bigram_dim=128 | ||
| - Mixed quantization: int5 MLP, int6 attention | ||
| - zstd-22 compression | ||
|
|
||
| ## Results | ||
|
|
||
| ``` | ||
| seed=2024: val_bpb=1.14443, artifact=15,902,583 bytes | ||
| ``` | ||
|
|
||
| ## Command | ||
|
|
||
| ```bash | ||
| torchrun --standalone --nproc_per_node=8 train_gpt.py | ||
| ``` |
9 changes: 9 additions & 0 deletions
9
records/track_10min_16mb/2026-03-21_QAT_BigramHash12K_Stride32/submission.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| { | ||
| "name": "QAT + BigramHash(12288) + Stride 32", | ||
| "val_loss": 1.14443, | ||
| "bytes_total": 15902583, | ||
| "blurb": "10 layers, QAT with STE (int5 MLP / int6 attn), BigramHash 12288, eval stride 32, magnitude pruning 5%, SWA every 50 steps, zstd-22. Based on 10L_Int5MLP_MuonWD04_SWA50.", | ||
| "author": "fbedev", | ||
| "github_id": "fbedev", | ||
| "date": "2026-03-21" | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description focuses on the 2026-03-22 two-phase TTT record, but this PR also adds a separate 2026-03-21 QAT/BigramHash record. If this extra record is intentional, it would help to mention it in the PR description (or split into a separate PR) so reviewers understand the scope.