Skip to content

Non-record: 11L mixed int5/int6 + working QAT + TTT (val_bpb=1.1466)#421

Open
vytautas-bunevicius wants to merge 3 commits intoopenai:mainfrom
vytautas-bunevicius:submission/sota-attempt
Open

Non-record: 11L mixed int5/int6 + working QAT + TTT (val_bpb=1.1466)#421
vytautas-bunevicius wants to merge 3 commits intoopenai:mainfrom
vytautas-bunevicius:submission/sota-attempt

Conversation

@vytautas-bunevicius
Copy link

Summary

Non-record submission stacking 8 techniques on PR #315 (1.1248):

  • Working QAT fix (PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315's was dead code due to torch.compile)
  • Mixed int5 (MLP) / int6 (attention) quantization + 3% magnitude pruning
  • Test-time training (3 epochs SGD post-quant, 83s on 8xH100)
  • BigramHash 10240 (up from 2048)
  • 64 learnable memory tokens
  • Backout connection (1 scalar param)
  • Per-head temperature (88 params)
  • Eval stride 32

val_bpb = 1.1466 on 8xH100 SXM. Ran with PyTorch SDPA instead of FA3 (110ms/step, 5129 steps instead of ~7000). Artifact: 14.7MB.

Test plan

  • Smoke test on 1xH100 (completed)
  • Full run on 8xH100 SXM (completed, 605s training + 340s eval)
  • Rerun with FlashAttention 3 for improved score
  • 3-seed reproducibility (single seed so far)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant