results+paper(z-gap): Strategy E — multi-model P3 probing closes M5#5
Merged
Merged
Conversation
P3 cross-lingual probing run on the same 7-model set as Strategy D, with per-cell binomial test against chance. New runner: experiments/scripts/run_strategy_e_multimodel_probing.py Results (probe trained on English, mean over 4 non-English langs): Model cat_en cat_xfer op_en op_xfer ──────────────────────────────────────────────────────────────── UniXcoder (code) 0.990 0.670 1.000 0.175 MiniLM-L12 (NL) 1.000 0.900 1.000 0.858 Nomic v1.5 (NL+code) 1.000 0.625 1.000 0.225 E5-small (NL) 1.000 0.985 1.000 0.892 E5-base (NL) 1.000 0.978 1.000 0.958 E5-large (NL) 1.000 0.995 1.000 0.978 BGE-M3 (NL+code) 1.000 0.990 1.000 0.975 All non-trivial non-en cells: p < 1e-25 vs chance (binomial). Findings: - P3 is supported in multilingual NL models (MiniLM, E5 family, BGE-M3) but is MODEL-CLASS DEPENDENT. Code-trained (UniXcoder) and mixed NL+code (Nomic) models reach near-perfect English training accuracy but collapse on cross-lingual transfer (0.62-0.67 cat, 0.18-0.23 op). This refines the paper's original P3 claim: cross-lingual Z_sem separability is a property of the multilingual NL training distribution, not an intrinsic property of every embedding space with R_code > 1. - E5 family P3 scale-convergence (echo of Strategy D pattern under fixed architecture/training recipe): operation transfer 0.89 (384d) -> 0.96 (768d) -> 0.98 (1024d). Paper: - §5.5 P3 Results table: 1 row (MiniLM only) -> 7 rows. Body text rewritten to surface the model-class dependence finding + E5 family P3 scale-convergence echo. - Limitations "Z stratification" bullet: "not validated across model families" replaced with "supported on 7 models with model-class dependence"; remaining work narrowed to decoder-only LLM hidden states + tier2/tier3 OOD stimuli. Decisions log: - planning/decisions.md: 2026-05-21 Strategy E entry covering the design rationale, the model-class dependence finding, and the scope of remaining P3 work after this PR. Closes M5 from the 2026-05-21 pre-experiment review. C1 (contamination) deferred portion via OOD stimuli is the next follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
P3 cross-lingual probing run on the same 7-model set as Strategy D, with
per-cell binomial test against chance. New runner:
experiments/scripts/run_strategy_e_multimodel_probing.py
Results (probe trained on English, mean over 4 non-English langs):
Model cat_en cat_xfer op_en op_xfer
────────────────────────────────────────────────────────────────
UniXcoder (code) 0.990 0.670 1.000 0.175
MiniLM-L12 (NL) 1.000 0.900 1.000 0.858
Nomic v1.5 (NL+code) 1.000 0.625 1.000 0.225
E5-small (NL) 1.000 0.985 1.000 0.892
E5-base (NL) 1.000 0.978 1.000 0.958
E5-large (NL) 1.000 0.995 1.000 0.978
BGE-M3 (NL+code) 1.000 0.990 1.000 0.975
All non-trivial non-en cells: p < 1e-25 vs chance (binomial).
Findings:
is MODEL-CLASS DEPENDENT. Code-trained (UniXcoder) and mixed NL+code (Nomic)
models reach near-perfect English training accuracy but collapse on
cross-lingual transfer (0.62-0.67 cat, 0.18-0.23 op). This refines the
paper's original P3 claim: cross-lingual Z_sem separability is a property
of the multilingual NL training distribution, not an intrinsic property of
every embedding space with R_code > 1.
architecture/training recipe): operation transfer 0.89 (384d) -> 0.96
(768d) -> 0.98 (1024d).
Paper:
to surface the model-class dependence finding + E5 family P3
scale-convergence echo.
families" replaced with "supported on 7 models with model-class
dependence"; remaining work narrowed to decoder-only LLM hidden states +
tier2/tier3 OOD stimuli.
Decisions log:
rationale, the model-class dependence finding, and the scope of remaining
P3 work after this PR.
Closes M5 from the 2026-05-21 pre-experiment review. C1 (contamination)
deferred portion via OOD stimuli is the next follow-up.