results+paper(z-gap): Strategy D 7-model results — 35/35 cells, P1 partial via E5 family#4
Merged
Conversation
…rtial via E5 family Strategy D extension ran successfully after einops install: Model en ko zh ar es agg UniXcoder 1.22* 1.01* 1.08* 1.01* 1.05* 1.07 MiniLM-L12 1.23* 1.12* 1.18* 1.10* 1.19* 1.16 Nomic v1.5 1.24* 1.02* 1.03* 1.01* 1.07* 1.07 E5-small [NEW] 1.22* 1.09* 1.13* 1.09* 1.14* 1.13 E5-base [NEW] 1.22* 1.11* 1.13* 1.11* 1.16* 1.14 E5-large 1.28* 1.16* 1.19* 1.16* 1.22* 1.20 BGE-M3 [NEW] 1.21* 1.14* 1.16* 1.14* 1.16* 1.16 35/35 cells: R_code > 1, p < 0.05 after Holm-Bonferroni Permutation null mean: R in [1.000, 1.005] across all cells (C2 baseline confirmed) Findings: - Cross-model robustness: alignment holds across code-trained (UniXcoder, Nomic), hybrid (BGE-M3), and NL-only (MiniLM, E5 family) architectures. - E5 family scale-convergence (within same architecture/training recipe): 1.13 (384d) -> 1.14 (768d) -> 1.20 (1024d). Partial P1 support — monotonic qualitatively but non-linear (small->base flat, base->large steep). - D_train modulation re-confirmed at scale: English R_code 1.21-1.28, Korean/Arabic 1.01-1.16, tracking code-corpus language representation. Paper: - §5.5 Table: 4 rows -> 7 rows. Caption now reports 35/35 cells + null R range. Body text rewritten from "20/20" to "35/35" + new "Third pattern" paragraph on E5-family partial scale-convergence. Dependencies: - einops>=0.7 added to experiments/requirements.txt and pyproject.toml. Required by nomic-ai/nomic-embed-text-v1.5 trust_remote_code module; the M3 try/except wrap correctly isolated the first-run failure, allowing one targeted dep fix instead of a wholesale debug. Decisions log: - planning/decisions.md: 2026-05-21 entry documenting einops fix and the 35/35 result.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Strategy D extension ran successfully after einops install:
Model en ko zh ar es agg
UniXcoder 1.22* 1.01* 1.08* 1.01* 1.05* 1.07
MiniLM-L12 1.23* 1.12* 1.18* 1.10* 1.19* 1.16
Nomic v1.5 1.24* 1.02* 1.03* 1.01* 1.07* 1.07
E5-small [NEW] 1.22* 1.09* 1.13* 1.09* 1.14* 1.13
E5-base [NEW] 1.22* 1.11* 1.13* 1.11* 1.16* 1.14
E5-large 1.28* 1.16* 1.19* 1.16* 1.22* 1.20
BGE-M3 [NEW] 1.21* 1.14* 1.16* 1.14* 1.16* 1.16
35/35 cells: R_code > 1, p < 0.05 after Holm-Bonferroni
Permutation null mean: R in [1.000, 1.005] across all cells (C2 baseline confirmed)
Findings:
Nomic), hybrid (BGE-M3), and NL-only (MiniLM, E5 family) architectures.
1.13 (384d) -> 1.14 (768d) -> 1.20 (1024d). Partial P1 support — monotonic
qualitatively but non-linear (small->base flat, base->large steep).
Korean/Arabic 1.01-1.16, tracking code-corpus language representation.
Paper:
range. Body text rewritten from "20/20" to "35/35" + new "Third pattern"
paragraph on E5-family partial scale-convergence.
Dependencies:
Required by nomic-ai/nomic-embed-text-v1.5 trust_remote_code module; the
M3 try/except wrap correctly isolated the first-run failure, allowing
one targeted dep fix instead of a wholesale debug.
Decisions log:
35/35 result.