Skip to content

EBLS Learned Sharing (10min/16MB)#433

Open
Robby955 wants to merge 1 commit intoopenai:mainfrom
Robby955:submission/ebls-learned-sharing
Open

EBLS Learned Sharing (10min/16MB)#433
Robby955 wants to merge 1 commit intoopenai:mainfrom
Robby955:submission/ebls-learned-sharing

Conversation

@Robby955
Copy link

Summary

  • Val BPB: 1.3441 (post-quant) / 1.2105 (pre-quant)
  • Artifact: 16,224,826 bytes (int6+zstd-22)
  • Compute: 8×H100 SXM, 4572 steps, 10 min wallclock

Empirical Bayes Layer Sharing (EBLS): 3 shared transformer blocks × 3 virtual layers = 9 effective layers, with per-virtual-layer rank-8 LoRA deviations gated by learned shrinkage factors γ_i = σ(logit_i).

Key finding

The model discovers the optimal sharing pattern from data: MLP gammas converge to 0 (fully shared) across all virtual layers, while attention shows minimal specialization only in early layers. This provides empirical evidence for architectural choices that other submissions make by intuition.

Virtual Layer Attn γ MLP γ
0 0.0035 0.0012
1 0.0013 0.0000
2 0.0012 0.0000
3–8 0.0000 0.0000

Architecture

  • 1024-dim, 16Q/4KV heads (GQA), 3× MLP with ReLU²
  • SmearGate, BigramHash(10240), U-Net skip connections
  • Int6 STE QAT + zstd-22, Muon+Adam, SWA

Technical writeup

Full method description with James-Stein statistical foundations: https://github.com/Robby955/parameter-golf-ebls

🤖 Generated with Claude Code

Empirical Bayes Layer Sharing: 3 shared blocks × 3 virtual layers with
per-virtual-layer LoRA deviations gated by learned shrinkage gammas.

Val BPB: 1.3441 (post-quant) / 1.2105 (pre-quant)
Artifact: 16,224,826 bytes | 8×H100 SXM, 4572 steps, 10 min

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant