Skip to content

TrigramHash + XSA + TTT on thwu1 SOTA stack - val_bpb pending H100#430

Draft
sahiee-dev wants to merge 1 commit intoopenai:mainfrom
sahiee-dev:main
Draft

TrigramHash + XSA + TTT on thwu1 SOTA stack - val_bpb pending H100#430
sahiee-dev wants to merge 1 commit intoopenai:mainfrom
sahiee-dev:main

Conversation

@sahiee-dev
Copy link

TrigramHash + XSA + TTT on thwu1 SOTA stack

Base: 10L Int5 MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 by thwu1 —> 1.1428 val_bpb

Novel additions

TrigramHash(20480, dim=32)
Adds trigram (t-2, t-1, t) embedding signal alongside BigramHash.
Captures 3 token phrase patterns and morphological structure that
bigrams cannot represent. Budget: bigram reduced 10240→4096 to fund
trigram within 16MB. Zero runtime overhead pure embedding table lookup.

XSA — Exclusive Self Attention (last 4 layers)
Removes self value bias from attention output via orthogonal projection.
GQA aware implementation from PR #287, adapted for our transposed layout.
Zero parameter cost. Enables last 4 layers to attend more purely to
context rather than self reinforcing their own value vectors.

TTT — Test Time Training
3 epoch SGD (lr=0.002, momentum=0.9) over validation tokens before
eval, with bottom 6 layers frozen. Runs identically on all 8 ranks —
deterministic in order SGD on identical data, no broadcast needed.
Original weights restored after evaluation. Budget: ~47 seconds.
QAT was evaluated and dropped — confirmed negative result (PR #360),
8% throughput penalty outweighs regularization within 600s budget.

Artifact: ~15.64MB | Status: Draft — H100 validation pending
Smoke tests passing. 3 seed results and ablation table to be added
before marking ready for review.

Dropped QAT: 8% throughput penalty kills 600s budget (per PR openai#360).

Three novel additions on thwu1 SOTA base (1.1428):
- TrigramHash(20480, dim=32): trigram embedding signal, bigram 10240->4096
- XSA: orthogonal self-value removal, last 4 layers, from PR openai#287
- TTT: 3-epoch SGD on val tokens before eval, all ranks, ~47s budget
  Fixed rank bug: TTT runs on all 8 ranks independently (not rank 0 only)

Artifact: ~15.64MB. Smoke tests passing. H100 validation pending.
@sahiee-dev sahiee-dev changed the title TrigramHash + XSA + TTT on thwu1 SOTA stack — val_bpb pending H100 TrigramHash + XSA + TTT on thwu1 SOTA stack - val_bpb pending H100 Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant