records(non-record-16mb): JEPA-on-LM 14-run ablation (negative result) by eren23 · Pull Request #2142 · openai/parameter-golf

eren23 · 2026-05-02T06:53:28Z

Summary

Non-record submission documenting a comprehensive negative result:
JEPA auxiliary objectives do not improve val_bpb on parameter-golf
at the 17.06M-param / sp1024 / FineWeb scale. The cleanest recipe ties
baseline exactly. Submitting to formalize the negative finding so future
JEPA submitters don't re-run the same grid.

Best JEPA variant (jepa-var-zero, alpha=0.001, VAR_WEIGHT=0):
val_bpb = 1.2311 at step 50K — exact tie with same-seed baseline.
Same-seed JEPA-vs-baseline gap: +0.0007 to +0.0009 across two seeds (1337, 42).
Cross-seed baseline variance: 0.0022 — larger than the JEPA gap, statistically indistinguishable.
lambda matters by orders of magnitude. lambda=0.001 = parity. lambda=0.005 ≥ +0.005 BPB cost. lambda=0.2 (a common JEPA paper default) costs +0.018 BPB.

What's new (param-count clean)

All 14 variants share one architectural backbone: 17,059,912-param BaselineGPT (9L, 512d, KV4, MLP_MULT=2, sp1024, relu_sq, tied embeds). JEPA variants add a single 65,536-param predictor MLP (model_dim → 64 → model_dim, zero-init on output) — total 17,125,448 params (+0.4%). No model-shape changes across the grid; only loss weights differ.

14-run table — final val_bpb @ step 50K

run	seed	config	step	val_bpb	Δ vs same-seed baseline
baseline-seed42	42	control	50K	1.2289	0
tiny-lambda-seed42	42	alpha=0.001	50K	1.2298	+0.0009
var-zero	1337	alpha=0.001, VAR_WEIGHT=0	50K	1.2311	0.0000 (tie)
baseline-promo	1337	control	50K	1.2311	0
tiny-lambda-v3	1337	alpha=0.001	50K	1.2318	+0.0007
half-lambda	1337	alpha=0.0005	50K	1.2318	+0.0007
chunk16	1337	alpha=0.001, JEPA_CHUNK=16	50K	1.2318	+0.0007
aux+token-tiny	1337	alpha=beta=0.001	50K	1.2361	+0.0050
tenth-lambda*	1337	alpha=0.0001	40K	1.2362	tied @40k
covar-v3	1337	alpha=0.005, COVAR_WEIGHT=0.05	50K	1.2374	+0.0063
token-only-tiny*	1337	beta=0.001	40K	1.2408	+0.0046 (40K)
injection-v2*	1337	alpha=0.005, INJECTION=1	40K	1.2456	+0.0094 (40K)
aux-v1	1337	alpha=0.2 ("JEPA paper" default)	50K	1.2492	+0.0181
aux-low-v2*	1337	alpha=0.005	30K	1.2553	+0.0060 (30K)

* wallclock cap hit before step 50K on slower hardware; column shows actual.

Three findings

lambda matters most, by orders of magnitude. PR Non-record: Byte-level transformer + JEPA auxiliary loss (val_bpb: 1.1903) #832's winner pattern used lambda=0.001. We confirm parity at that magnitude. Going to 0.005 already costs ≥0.005 BPB. Going to 0.2 (a common JEPA paper default) costs 0.018 BPB.
VICReg variance reg adds small harm at this lambda. With lambda already at the noise floor, the variance hinge `relu(1 - z_std)` injects a tiny asymmetric force that nudges JEPA away from baseline. Setting VAR_WEIGHT=0 recovers exact parity.
Path B (token-decoder JEPA) hurts even at beta=0.001. Token CE competes with main CE for the tied LM head. Path A (hidden-state aux MSE) is benign at small lambda because it doesn't touch the LM head.

Reproducibility

Architecture: `jepa_lm.py` in this PR. Also published in crucible-community-tap.
Finding doc (per-step val_bpb curves CSV, full structured report): findings/parameter-golf-jepa-ablation/.
Compute: 4× RunPod RTX 4090, ~$15 over ~16 GPU-hours.
W&B: project eren23/parameter-golf. Run names match the table (e.g. https://wandb.ai/eren23/parameter-golf/runs/n22iw31q for var-zero).

Track and quant note

Track: `non-record-unlimited-compute-16mb`. The model artifact was not int8+zlib quantized for this submission — we're submitting an ablation finding, not a leaderboard ranking candidate. `val_bpb` reported is the pre-quant running value at step 50K. The bytes_total / bytes_model_int8_zlib fields in submission.json are null. If a finding-style submission is preferred under a different track, happy to relabel.

Refines

PR [Non-Record] JEPA Self-Distillation with EMA Target Encoder for Autoregressive LM | Controlled A/B Shows No Gain Over Vanilla CE (val_bpb: 1.19) #896 (Manav Pandey, single-config JEPA failure)
PR [Non-record]: JEPA v2 —JEPA v2 — Why same-sequence next-k JEPA collapses in causal LMs #1330 (luciobaiocchi, observation that vanilla JEPA collapses)

Test plan

Reviewer reads README.md in records/.../
Reviewer spot-checks train.log val_bpb at step 50K = 1.2311
Reviewer confirms param-count-clean comparison (only +65K predictor MLP)

Generated with Claude Code and Crucible plugin-based ML research platform.

Comprehensive ablation showing JEPA auxiliary objectives do not improve val_bpb on parameter-golf at 17M / sp1024 / FineWeb scale. Cleanest recipe (alpha=0.001, VAR_WEIGHT=0, MSE-only Path A) ties baseline exactly at val_bpb=1.2311 (step 50K, promotion preset, 7200s wallclock). - 14 runs at the same N (17.06M / 17.13M with predictor MLP, +0.4%). - Two-seed paired baselines (1337, 42) -> 0.0022 noise floor. - lambda sweep across 4 orders of magnitude (1e-4..0.2). - Path A / Path B / injection / V-JEPA covariance ablation. Refines findings from PR openai#896 (Manav Pandey) and PR openai#1330 (luciobaiocchi). Architecture and full finding doc published at https://github.com/eren23/crucible-community-tap (architectures/jepa_lm, findings/parameter-golf-jepa-ablation).

eren23 marked this pull request as ready for review May 2, 2026 06:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

records(non-record-16mb): JEPA-on-LM 14-run ablation (negative result)#2142

records(non-record-16mb): JEPA-on-LM 14-run ablation (negative result)#2142
eren23 wants to merge 1 commit into
openai:mainfrom
eren23:submission-jepa-ablation-2026-05-02

eren23 commented May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eren23 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new (param-count clean)

14-run table — final val_bpb @ step 50K

Three findings

Reproducibility

Track and quant note

Refines

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eren23 commented May 2, 2026 •

edited

Loading