TrigramHash + XSA + TTT on thwu1 SOTA stack - val_bpb pending H100#430
Draft
sahiee-dev wants to merge 1 commit intoopenai:mainfrom
Draft
TrigramHash + XSA + TTT on thwu1 SOTA stack - val_bpb pending H100#430sahiee-dev wants to merge 1 commit intoopenai:mainfrom
sahiee-dev wants to merge 1 commit intoopenai:mainfrom
Conversation
Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360). Three novel additions on thwu1 SOTA base (1.1428): - TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096 - XSA: orthogonal self-value removal, last 4 layers, from PR openai#287 - TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only) Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TrigramHash + XSA + TTT on thwu1 SOTA stack
Base: 10L Int5 MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 by thwu1 —> 1.1428 val_bpb
Novel additions
TrigramHash(20480, dim=32)
Adds trigram (t-2, t-1, t) embedding signal alongside BigramHash.
Captures 3 token phrase patterns and morphological structure that
bigrams cannot represent. Budget: bigram reduced 10240→4096 to fund
trigram within 16MB. Zero runtime overhead pure embedding table lookup.
XSA — Exclusive Self Attention (last 4 layers)
Removes self value bias from attention output via orthogonal projection.
GQA aware implementation from PR #287, adapted for our transposed layout.
Zero parameter cost. Enables last 4 layers to attend more purely to
context rather than self reinforcing their own value vectors.
TTT — Test Time Training
3 epoch SGD (lr=0.002, momentum=0.9) over validation tokens before
eval, with bottom 6 layers frozen. Runs identically on all 8 ranks —
deterministic in order SGD on identical data, no broadcast needed.
Original weights restored after evaluation. Budget: ~47 seconds.
QAT was evaluated and dropped — confirmed negative result (PR #360),
8% throughput penalty outweighs regularization within 600s budget.
Artifact: ~15.64MB | Status: Draft — H100 validation pending
Smoke tests passing. 3 seed results and ablation table to be added
before marking ready for review.