Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715) by yashverms · Pull Request #418 · openai/parameter-golf

yashverms · 2026-03-22T11:57:48Z

Summary

Non-record submission exploring 3 novel techniques not yet attempted in any merged or open PR, built on the proven PR #315 technique stack.

Novel Contributions

DiffTransformer V2 Attention (last 2 layers) — noise-cancelled attention via differential softmax maps (Ye et al., ICLR 2025 Oral)
NorMuon Optimizer — replaces Muon with per-neuron row normalization after Newton-Schulz orthogonalization, ~11% better compute efficiency
TrigramHash + Context-Aware N-gram Gating — extends BigramHash with trigram patterns and a learned sigmoid gate that modulates n-gram signal based on hidden state (inspired by DeepSeek Engram)

Architecture

11 layers, 512 dim, 8/4 heads (GQA), MLP 3× (ReLU²)
XSA on last 6 layers, DiffAttn on last 2
Partial RoPE (16/64 dims), LN depth scaling, SmearGate
BigramHash(2048) + TrigramHash(2048) + context-aware gate
U-Net skips, tied embeddings, logit softcap

Results

Metric	Value
val_bpb (post-quant)	1.1715 (no sliding window)
Pre-quant val_bpb	1.1607
Steps	4,600 (600s wallclock)
Params	27,518,587
Artifact	15,586,651 bytes (int6+zstd-22)
GPU	8×H100 SXM

Gap Analysis

Score is ~0.029 bpb behind merged SOTA (1.1428). Key factors: no sliding window eval (~0.03 bpb), small BigramHash (2048 vs 10240), NorMuon momentum=0.95 vs proven 0.99, SDPA fallback instead of Flash Attention 3. The submitted code has these issues fixed (sliding window re-enabled, correct 16MB decimal limit).

Why This Is Interesting

First submission using Differential Attention in this competition
First submission using NorMuon optimizer
First submission with context-aware n-gram gating
Documents which 2026 architectural innovations transfer (or don't) to the 16MB parameter-constrained regime

Test plan

Training completes within 600s on 8×H100
Artifact under 16,000,000 bytes
Post-quant roundtrip evaluation produces valid val_bpb
Code is self-contained in train_gpt.py
Sliding window eval (re-enabled in submitted code, not yet run)
Multi-seed verification (single seed only in this submission)

Made with Cursor

…igramHash Three novel techniques on top of PR openai#315's stack: 1. DiffTransformer V2 attention (last 2 layers) for noise-cancelled attention 2. NorMuon optimizer with per-neuron row normalization 3. TrigramHash + context-aware n-gram gating 11L/512d, XSA6, Partial RoPE, int6+zstd-22. Post-quant val_bpb=1.1715 (without sliding window eval). 8xH100, 600s, 15.59MB artifact. Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)#418

Non-record: PrismLM v3 — DiffTransformer V2 + NorMuon + TrigramHash (val_bpb=1.1715)#418
yashverms wants to merge 1 commit intoopenai:mainfrom
yashverms:prismlm-v3-non-record

yashverms commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yashverms commented Mar 22, 2026

Summary

Novel Contributions

Architecture

Results

Gap Analysis

Why This Is Interesting

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants