Skip to content

records(non-record-16mb): JEPA-on-LM 14-run ablation (negative result)#2142

Open
eren23 wants to merge 1 commit into
openai:mainfrom
eren23:submission-jepa-ablation-2026-05-02
Open

records(non-record-16mb): JEPA-on-LM 14-run ablation (negative result)#2142
eren23 wants to merge 1 commit into
openai:mainfrom
eren23:submission-jepa-ablation-2026-05-02

Conversation

@eren23
Copy link
Copy Markdown

@eren23 eren23 commented May 2, 2026

Summary

Non-record submission documenting a comprehensive negative result:
JEPA auxiliary objectives do not improve val_bpb on parameter-golf
at the 17.06M-param / sp1024 / FineWeb scale. The cleanest recipe ties
baseline exactly. Submitting to formalize the negative finding so future
JEPA submitters don't re-run the same grid.

  • Best JEPA variant (jepa-var-zero, alpha=0.001, VAR_WEIGHT=0):
    val_bpb = 1.2311 at step 50K — exact tie with same-seed baseline.
  • Same-seed JEPA-vs-baseline gap: +0.0007 to +0.0009 across two seeds (1337, 42).
  • Cross-seed baseline variance: 0.0022 — larger than the JEPA gap, statistically indistinguishable.
  • lambda matters by orders of magnitude. lambda=0.001 = parity. lambda=0.005 ≥ +0.005 BPB cost. lambda=0.2 (a common JEPA paper default) costs +0.018 BPB.

What's new (param-count clean)

All 14 variants share one architectural backbone: 17,059,912-param BaselineGPT (9L, 512d, KV4, MLP_MULT=2, sp1024, relu_sq, tied embeds). JEPA variants add a single 65,536-param predictor MLP (model_dim → 64 → model_dim, zero-init on output) — total 17,125,448 params (+0.4%). No model-shape changes across the grid; only loss weights differ.

14-run table — final val_bpb @ step 50K

run seed config step val_bpb Δ vs same-seed baseline
baseline-seed42 42 control 50K 1.2289 0
tiny-lambda-seed42 42 alpha=0.001 50K 1.2298 +0.0009
var-zero 1337 alpha=0.001, VAR_WEIGHT=0 50K 1.2311 0.0000 (tie)
baseline-promo 1337 control 50K 1.2311 0
tiny-lambda-v3 1337 alpha=0.001 50K 1.2318 +0.0007
half-lambda 1337 alpha=0.0005 50K 1.2318 +0.0007
chunk16 1337 alpha=0.001, JEPA_CHUNK=16 50K 1.2318 +0.0007
aux+token-tiny 1337 alpha=beta=0.001 50K 1.2361 +0.0050
tenth-lambda* 1337 alpha=0.0001 40K 1.2362 tied @40k
covar-v3 1337 alpha=0.005, COVAR_WEIGHT=0.05 50K 1.2374 +0.0063
token-only-tiny* 1337 beta=0.001 40K 1.2408 +0.0046 (40K)
injection-v2* 1337 alpha=0.005, INJECTION=1 40K 1.2456 +0.0094 (40K)
aux-v1 1337 alpha=0.2 ("JEPA paper" default) 50K 1.2492 +0.0181
aux-low-v2* 1337 alpha=0.005 30K 1.2553 +0.0060 (30K)

* wallclock cap hit before step 50K on slower hardware; column shows actual.

Three findings

  1. lambda matters most, by orders of magnitude. PR Non-record: Byte-level transformer + JEPA auxiliary loss (val_bpb: 1.1903) #832's winner pattern used lambda=0.001. We confirm parity at that magnitude. Going to 0.005 already costs ≥0.005 BPB. Going to 0.2 (a common JEPA paper default) costs 0.018 BPB.
  2. VICReg variance reg adds small harm at this lambda. With lambda already at the noise floor, the variance hinge `relu(1 - z_std)` injects a tiny asymmetric force that nudges JEPA away from baseline. Setting VAR_WEIGHT=0 recovers exact parity.
  3. Path B (token-decoder JEPA) hurts even at beta=0.001. Token CE competes with main CE for the tied LM head. Path A (hidden-state aux MSE) is benign at small lambda because it doesn't touch the LM head.

Reproducibility

Track and quant note

Track: `non-record-unlimited-compute-16mb`. The model artifact was not int8+zlib quantized for this submission — we're submitting an ablation finding, not a leaderboard ranking candidate. `val_bpb` reported is the pre-quant running value at step 50K. The bytes_total / bytes_model_int8_zlib fields in submission.json are null. If a finding-style submission is preferred under a different track, happy to relabel.

Refines

Test plan

  • Reviewer reads README.md in records/.../
  • Reviewer spot-checks train.log val_bpb at step 50K = 1.2311
  • Reviewer confirms param-count-clean comparison (only +65K predictor MLP)

Generated with Claude Code and Crucible plugin-based ML research platform.

Comprehensive ablation showing JEPA auxiliary objectives do not improve
val_bpb on parameter-golf at 17M / sp1024 / FineWeb scale. Cleanest
recipe (alpha=0.001, VAR_WEIGHT=0, MSE-only Path A) ties baseline
exactly at val_bpb=1.2311 (step 50K, promotion preset, 7200s wallclock).

- 14 runs at the same N (17.06M / 17.13M with predictor MLP, +0.4%).
- Two-seed paired baselines (1337, 42) -> 0.0022 noise floor.
- lambda sweep across 4 orders of magnitude (1e-4..0.2).
- Path A / Path B / injection / V-JEPA covariance ablation.

Refines findings from PR openai#896 (Manav Pandey) and PR openai#1330 (luciobaiocchi).
Architecture and full finding doc published at
https://github.com/eren23/crucible-community-tap (architectures/jepa_lm,
findings/parameter-golf-jepa-ablation).
@eren23 eren23 marked this pull request as ready for review May 2, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant