Skip to content

results+paper(z-gap): Strategy E — multi-model P3 probing closes M5#5

Merged
heznpc merged 1 commit into
mainfrom
chore/strategy-e-multimodel-probing-2026-05-21
May 20, 2026
Merged

results+paper(z-gap): Strategy E — multi-model P3 probing closes M5#5
heznpc merged 1 commit into
mainfrom
chore/strategy-e-multimodel-probing-2026-05-21

Conversation

@heznpc

@heznpc heznpc commented May 20, 2026

Copy link
Copy Markdown
Owner

P3 cross-lingual probing run on the same 7-model set as Strategy D, with
per-cell binomial test against chance. New runner:
experiments/scripts/run_strategy_e_multimodel_probing.py

Results (probe trained on English, mean over 4 non-English langs):

Model cat_en cat_xfer op_en op_xfer
────────────────────────────────────────────────────────────────
UniXcoder (code) 0.990 0.670 1.000 0.175
MiniLM-L12 (NL) 1.000 0.900 1.000 0.858
Nomic v1.5 (NL+code) 1.000 0.625 1.000 0.225
E5-small (NL) 1.000 0.985 1.000 0.892
E5-base (NL) 1.000 0.978 1.000 0.958
E5-large (NL) 1.000 0.995 1.000 0.978
BGE-M3 (NL+code) 1.000 0.990 1.000 0.975

All non-trivial non-en cells: p < 1e-25 vs chance (binomial).

Findings:

  • P3 is supported in multilingual NL models (MiniLM, E5 family, BGE-M3) but
    is MODEL-CLASS DEPENDENT. Code-trained (UniXcoder) and mixed NL+code (Nomic)
    models reach near-perfect English training accuracy but collapse on
    cross-lingual transfer (0.62-0.67 cat, 0.18-0.23 op). This refines the
    paper's original P3 claim: cross-lingual Z_sem separability is a property
    of the multilingual NL training distribution, not an intrinsic property of
    every embedding space with R_code > 1.
  • E5 family P3 scale-convergence (echo of Strategy D pattern under fixed
    architecture/training recipe): operation transfer 0.89 (384d) -> 0.96
    (768d) -> 0.98 (1024d).

Paper:

  • §5.5 P3 Results table: 1 row (MiniLM only) -> 7 rows. Body text rewritten
    to surface the model-class dependence finding + E5 family P3
    scale-convergence echo.
  • Limitations "Z stratification" bullet: "not validated across model
    families" replaced with "supported on 7 models with model-class
    dependence"; remaining work narrowed to decoder-only LLM hidden states +
    tier2/tier3 OOD stimuli.

Decisions log:

  • planning/decisions.md: 2026-05-21 Strategy E entry covering the design
    rationale, the model-class dependence finding, and the scope of remaining
    P3 work after this PR.

Closes M5 from the 2026-05-21 pre-experiment review. C1 (contamination)
deferred portion via OOD stimuli is the next follow-up.

P3 cross-lingual probing run on the same 7-model set as Strategy D, with
per-cell binomial test against chance. New runner:
experiments/scripts/run_strategy_e_multimodel_probing.py

Results (probe trained on English, mean over 4 non-English langs):

  Model                       cat_en   cat_xfer   op_en   op_xfer
  ────────────────────────────────────────────────────────────────
  UniXcoder (code)             0.990     0.670    1.000    0.175
  MiniLM-L12 (NL)              1.000     0.900    1.000    0.858
  Nomic v1.5 (NL+code)         1.000     0.625    1.000    0.225
  E5-small (NL)                1.000     0.985    1.000    0.892
  E5-base (NL)                 1.000     0.978    1.000    0.958
  E5-large (NL)                1.000     0.995    1.000    0.978
  BGE-M3 (NL+code)             1.000     0.990    1.000    0.975

All non-trivial non-en cells: p < 1e-25 vs chance (binomial).

Findings:
- P3 is supported in multilingual NL models (MiniLM, E5 family, BGE-M3) but
  is MODEL-CLASS DEPENDENT. Code-trained (UniXcoder) and mixed NL+code (Nomic)
  models reach near-perfect English training accuracy but collapse on
  cross-lingual transfer (0.62-0.67 cat, 0.18-0.23 op). This refines the
  paper's original P3 claim: cross-lingual Z_sem separability is a property
  of the multilingual NL training distribution, not an intrinsic property of
  every embedding space with R_code > 1.
- E5 family P3 scale-convergence (echo of Strategy D pattern under fixed
  architecture/training recipe): operation transfer 0.89 (384d) -> 0.96
  (768d) -> 0.98 (1024d).

Paper:
- §5.5 P3 Results table: 1 row (MiniLM only) -> 7 rows. Body text rewritten
  to surface the model-class dependence finding + E5 family P3
  scale-convergence echo.
- Limitations "Z stratification" bullet: "not validated across model
  families" replaced with "supported on 7 models with model-class
  dependence"; remaining work narrowed to decoder-only LLM hidden states +
  tier2/tier3 OOD stimuli.

Decisions log:
- planning/decisions.md: 2026-05-21 Strategy E entry covering the design
  rationale, the model-class dependence finding, and the scope of remaining
  P3 work after this PR.

Closes M5 from the 2026-05-21 pre-experiment review. C1 (contamination)
deferred portion via OOD stimuli is the next follow-up.
@heznpc heznpc merged commit 72cd8ea into main May 20, 2026
1 check passed
@heznpc heznpc deleted the chore/strategy-e-multimodel-probing-2026-05-21 branch May 20, 2026 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant