Skip to content

Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)#418

Open
yashverms wants to merge 1 commit intoopenai:mainfrom
yashverms:prismlm-v3-non-record
Open

Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)#418
yashverms wants to merge 1 commit intoopenai:mainfrom
yashverms:prismlm-v3-non-record

Conversation

@yashverms
Copy link

Summary

Non-record submission exploring 3 novel techniques not yet attempted in any merged or open PR, built on the proven PR #315 technique stack.

Novel Contributions

  1. DiffTransformer V2 Attention (last 2 layers) — noise-cancelled attention via differential softmax maps (Ye et al., ICLR 2025 Oral)
  2. NorMuon Optimizer — replaces Muon with per-neuron row normalization after Newton-Schulz orthogonalization, ~11% better compute efficiency
  3. TrigramHash + Context-Aware N-gram Gating — extends BigramHash with trigram patterns and a learned sigmoid gate that modulates n-gram signal based on hidden state (inspired by DeepSeek Engram)

Architecture

  • 11 layers, 512 dim, 8/4 heads (GQA), MLP 3× (ReLU²)
  • XSA on last 6 layers, DiffAttn on last 2
  • Partial RoPE (16/64 dims), LN depth scaling, SmearGate
  • BigramHash(2048) + TrigramHash(2048) + context-aware gate
  • U-Net skips, tied embeddings, logit softcap

Results

Metric Value
val_bpb (post-quant) 1.1715 (no sliding window)
Pre-quant val_bpb 1.1607
Steps 4,600 (600s wallclock)
Params 27,518,587
Artifact 15,586,651 bytes (int6+zstd-22)
GPU 8×H100 SXM

Gap Analysis

Score is ~0.029 bpb behind merged SOTA (1.1428). Key factors: no sliding window eval (~0.03 bpb), small BigramHash (2048 vs 10240), NorMuon momentum=0.95 vs proven 0.99, SDPA fallback instead of Flash Attention 3. The submitted code has these issues fixed (sliding window re-enabled, correct 16MB decimal limit).

Why This Is Interesting

  • First submission using Differential Attention in this competition
  • First submission using NorMuon optimizer
  • First submission with context-aware n-gram gating
  • Documents which 2026 architectural innovations transfer (or don't) to the 16MB parameter-constrained regime

Test plan

  • Training completes within 600s on 8×H100
  • Artifact under 16,000,000 bytes
  • Post-quant roundtrip evaluation produces valid val_bpb
  • Code is self-contained in train_gpt.py
  • Sliding window eval (re-enabled in submitted code, not yet run)
  • Multi-seed verification (single seed only in this submission)

Made with Cursor

…igramHash

Three novel techniques on top of PR openai#315's stack:
1. DiffTransformer V2 attention (last 2 layers) for noise-cancelled attention
2. NorMuon optimizer with per-neuron row normalization
3. TrigramHash + context-aware n-gram gating

11L/512d, XSA6, Partial RoPE, int6+zstd-22. Post-quant val_bpb=1.1715
(without sliding window eval). 8xH100, 600s, 15.59MB artifact.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants