heznpc · heznpc · May 20, 2026 · May 20, 2026
diff --git a/experiments/requirements.txt b/experiments/requirements.txt
@@ -12,3 +12,4 @@ kiwipiepy>=0.16
 tqdm>=4.65
 click>=8.1
 python-dotenv>=1.0
+einops>=0.7
diff --git a/paper/main.tex b/paper/main.tex
@@ -469,18 +469,21 @@ \subsection{Pilot Experiment and Results}\label{sec:pilot}
 \toprule
 \textbf{Model} & \textbf{en} & \textbf{ko} & \textbf{zh} & \textbf{ar} & \textbf{es} & \textbf{agg} \\
 \midrule
-UniXcoder (code)  & 1.22* & 1.01* & 1.08* & 1.01* & 1.05* & 1.07 \\
-MiniLM-L12 (NL)   & 1.23* & 1.12* & 1.18* & 1.10* & 1.19* & 1.16 \\
+UniXcoder (code)   & 1.22* & 1.01* & 1.08* & 1.01* & 1.05* & 1.07 \\
+MiniLM-L12 (NL)    & 1.23* & 1.12* & 1.18* & 1.10* & 1.19* & 1.16 \\
 Nomic v1.5         & 1.24* & 1.02* & 1.03* & 1.01* & 1.07* & 1.07 \\
+E5-small (NL)      & 1.22* & 1.09* & 1.13* & 1.09* & 1.14* & 1.13 \\
+E5-base (NL)       & 1.22* & 1.11* & 1.13* & 1.11* & 1.16* & 1.14 \\
 E5-large (NL)      & 1.28* & 1.16* & 1.19* & 1.16* & 1.22* & 1.20 \\
+BGE-M3 (NL+code)   & 1.21* & 1.14* & 1.16* & 1.14* & 1.16* & 1.16 \\
 \bottomrule
 \end{tabular}
-\caption*{\small * $p < 0.05$ after Holm-Bonferroni correction. All 20/20 cells show $R_{\text{code}} > 1$.}
+\caption*{\small * $p < 0.05$ after Holm-Bonferroni correction. All 35/35 cells show $R_{\text{code}} > 1$. Permutation-null distribution mean: $R \in [1.000, 1.005]$ across all cells (random-matching baseline).}
 \end{table}
 
-All 20 language-model cells show $R_{\text{code}} > 1$ ($p < 0.05$ after correction): NL descriptions are closer to their corresponding code than to mismatched code in every language and every model. The result is robust across code-trained (UniXcoder, Nomic) and NL-only (MiniLM, E5-large) architectures.
+All 35 language-model cells show $R_{\text{code}} > 1$ ($p < 0.05$ after Holm-Bonferroni correction across 35 cells): NL descriptions are closer to their corresponding code than to mismatched code in every language and every model. The permutation null mean falls in $R \in [1.000, 1.005]$ across all cells, confirming the effect is not a metric artifact. The result is robust across code-trained (UniXcoder, Nomic), hybrid (BGE-M3), and NL-only (MiniLM, E5 family) architectures.
 
-Two patterns emerge. First, \textbf{$\Dtrain$ modulates NL-code alignment}: English consistently shows the highest $R_{\text{code}}$ (1.22--1.28), while Korean and Arabic show the lowest (1.01--1.16), tracking language representation in code training corpora. Second, \textbf{NL-only models achieve higher $R_{\text{code}}$ than code-trained models}: E5-large (1.20 aggregate) and MiniLM (1.16) surpass UniXcoder (1.07) and Nomic (1.07).
+Three patterns emerge. First, \textbf{$\Dtrain$ modulates NL-code alignment}: English consistently shows the highest $R_{\text{code}}$ (1.21--1.28), while Korean and Arabic show the lowest (1.01--1.16), tracking language representation in code training corpora. Second, \textbf{NL-only models achieve higher $R_{\text{code}}$ than code-trained models}: E5-large (1.20 aggregate) and BGE-M3 (1.16) surpass UniXcoder (1.07) and Nomic (1.07). Third, \textbf{the E5 family (same architecture, identical training recipe, varying dimension) shows partial scale-convergence}: aggregate $R_{\text{code}}$ rises $1.13$ (small, 384d) $\to 1.14$ (base, 768d) $\to 1.20$ (large, 1024d); the small-to-base jump is flat while base-to-large is steep, so P1's monotonic-with-scale prediction holds qualitatively but is non-linear in this regime.
 
 \paragraph{Lexical overlap control.} A potential confound: NL descriptions share tokens with their code equivalents (``sort'' appears in both ``Sort the list'' and \texttt{sorted(lst)}). Token overlap correlates with $d_{\text{match}}$ (Spearman $\rho = -0.51$, $p < 0.001$ for MiniLM), confirming a lexical component. However, $R_{\text{code}} > 1$ survives two controls. First, for the 32/50 operations with \emph{zero} token overlap (after stemming), $R_{\text{code}}$ remains above 1 in all three models (1.06--1.18). Second, obfuscating variable names in code (\texttt{lst}$\to$\texttt{v0}, \texttt{s}$\to$\texttt{v0}) reduces $R_{\text{code}}$ by only 1.6--5.4\%, and all models retain $R_{\text{code}} > 1$. Lexical overlap inflates the effect but does not create it: the alignment is primarily semantic.
 

diff --git a/planning/decisions.md b/planning/decisions.md
@@ -66,3 +66,18 @@ Format: `## YYYY-MM-DD -- <short title>` with **Context**, **Decision**, **Why**
   - **M6 (Codestral Embed)**: Excluded for this session. `.env` has no `MISTRAL_API_KEY`, and the user constrained the session to Claude Code-accessible models. Sentence-transformers / open-source HF only.
 
 **Why**: The pre-experiment review caught contamination and baseline-framing issues that, if discovered after results were reported, would have required a paper revision plus a fresh experiment. Catching them before the cross-model extension lets a single PR carry the corrected framing and the new evidence simultaneously.
+
+---
+
+## 2026-05-21 -- Strategy D 7-model results + einops dependency fix
+
+**Context**: After the pre-experiment review PR (#3) merged, the Strategy D extension ran on 7 models. First run: 6/7 succeeded; Nomic v1.5 failed with `ImportError: einops` because its `trust_remote_code` module imports `einops` lazily and the package was not in `requirements.txt`. The M3 try/except wrap correctly isolated the failure so the other 6 models completed cleanly.
+
+**Decisions**:
+
+  - Added `einops>=0.7` to `experiments/requirements.txt` and `pyproject.toml` dependencies. The package is needed only by Nomic's remote-code path; pinning loosely (`>=0.7`) is sufficient because the API has been stable since 0.6.
+  - Re-ran Strategy D with einops installed. All 7 models succeeded. Final matrix: **35/35 cells with $R_{\text{code}} > 1$ and $p < 0.05$ after Holm-Bonferroni**. Permutation-null mean ∈ [1.000, 1.005] across all cells (C2 baseline framing empirically confirmed).
+  - Paper §5.5 Table updated 4-row → 7-row. Body text revised from "20/20 cells" to "35/35 cells", added "Third pattern" paragraph on the E5 family's partial scale-convergence ($1.13 \to 1.14 \to 1.20$ at $384/768/1024$d).
+  - The pretraining contamination caveat (C1) added in PR #3 stays unchanged — adding more models does not address contamination, only cross-model robustness.
+
+**Why**: The 7-model extension was the empirical contribution this session aimed to land. Catching einops as a soft-dep blocker (rather than as a paper-level claim error) preserved the cross-model robustness claim. The E5-family scale-convergence finding is a side effect of the extension that strengthens P1 in a way the previous mixed-family P1 test (MiniLM/mpnet/E5-large) could not.
diff --git a/pyproject.toml b/pyproject.toml
@@ -47,6 +47,7 @@ dependencies = [
     "tqdm>=4.65",
     "click>=8.1",
     "python-dotenv>=1.0",
+    "einops>=0.7",  # required by nomic-ai/nomic-embed-text-v1.5 trust_remote_code module
 ]
 
 [project.urls]