diff --git a/experiments/requirements.txt b/experiments/requirements.txt index 949214b..57927b2 100644 --- a/experiments/requirements.txt +++ b/experiments/requirements.txt @@ -12,3 +12,4 @@ kiwipiepy>=0.16 tqdm>=4.65 click>=8.1 python-dotenv>=1.0 +einops>=0.7 diff --git a/paper/main.tex b/paper/main.tex index 1306a81..1d221f7 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -469,18 +469,21 @@ \subsection{Pilot Experiment and Results}\label{sec:pilot} \toprule \textbf{Model} & \textbf{en} & \textbf{ko} & \textbf{zh} & \textbf{ar} & \textbf{es} & \textbf{agg} \\ \midrule -UniXcoder (code) & 1.22* & 1.01* & 1.08* & 1.01* & 1.05* & 1.07 \\ -MiniLM-L12 (NL) & 1.23* & 1.12* & 1.18* & 1.10* & 1.19* & 1.16 \\ +UniXcoder (code) & 1.22* & 1.01* & 1.08* & 1.01* & 1.05* & 1.07 \\ +MiniLM-L12 (NL) & 1.23* & 1.12* & 1.18* & 1.10* & 1.19* & 1.16 \\ Nomic v1.5 & 1.24* & 1.02* & 1.03* & 1.01* & 1.07* & 1.07 \\ +E5-small (NL) & 1.22* & 1.09* & 1.13* & 1.09* & 1.14* & 1.13 \\ +E5-base (NL) & 1.22* & 1.11* & 1.13* & 1.11* & 1.16* & 1.14 \\ E5-large (NL) & 1.28* & 1.16* & 1.19* & 1.16* & 1.22* & 1.20 \\ +BGE-M3 (NL+code) & 1.21* & 1.14* & 1.16* & 1.14* & 1.16* & 1.16 \\ \bottomrule \end{tabular} -\caption*{\small * $p < 0.05$ after Holm-Bonferroni correction. All 20/20 cells show $R_{\text{code}} > 1$.} +\caption*{\small * $p < 0.05$ after Holm-Bonferroni correction. All 35/35 cells show $R_{\text{code}} > 1$. Permutation-null distribution mean: $R \in [1.000, 1.005]$ across all cells (random-matching baseline).} \end{table} -All 20 language-model cells show $R_{\text{code}} > 1$ ($p < 0.05$ after correction): NL descriptions are closer to their corresponding code than to mismatched code in every language and every model. The result is robust across code-trained (UniXcoder, Nomic) and NL-only (MiniLM, E5-large) architectures. +All 35 language-model cells show $R_{\text{code}} > 1$ ($p < 0.05$ after Holm-Bonferroni correction across 35 cells): NL descriptions are closer to their corresponding code than to mismatched code in every language and every model. The permutation null mean falls in $R \in [1.000, 1.005]$ across all cells, confirming the effect is not a metric artifact. The result is robust across code-trained (UniXcoder, Nomic), hybrid (BGE-M3), and NL-only (MiniLM, E5 family) architectures. -Two patterns emerge. First, \textbf{$\Dtrain$ modulates NL-code alignment}: English consistently shows the highest $R_{\text{code}}$ (1.22--1.28), while Korean and Arabic show the lowest (1.01--1.16), tracking language representation in code training corpora. Second, \textbf{NL-only models achieve higher $R_{\text{code}}$ than code-trained models}: E5-large (1.20 aggregate) and MiniLM (1.16) surpass UniXcoder (1.07) and Nomic (1.07). +Three patterns emerge. First, \textbf{$\Dtrain$ modulates NL-code alignment}: English consistently shows the highest $R_{\text{code}}$ (1.21--1.28), while Korean and Arabic show the lowest (1.01--1.16), tracking language representation in code training corpora. Second, \textbf{NL-only models achieve higher $R_{\text{code}}$ than code-trained models}: E5-large (1.20 aggregate) and BGE-M3 (1.16) surpass UniXcoder (1.07) and Nomic (1.07). Third, \textbf{the E5 family (same architecture, identical training recipe, varying dimension) shows partial scale-convergence}: aggregate $R_{\text{code}}$ rises $1.13$ (small, 384d) $\to 1.14$ (base, 768d) $\to 1.20$ (large, 1024d); the small-to-base jump is flat while base-to-large is steep, so P1's monotonic-with-scale prediction holds qualitatively but is non-linear in this regime. \paragraph{Lexical overlap control.} A potential confound: NL descriptions share tokens with their code equivalents (``sort'' appears in both ``Sort the list'' and \texttt{sorted(lst)}). Token overlap correlates with $d_{\text{match}}$ (Spearman $\rho = -0.51$, $p < 0.001$ for MiniLM), confirming a lexical component. However, $R_{\text{code}} > 1$ survives two controls. First, for the 32/50 operations with \emph{zero} token overlap (after stemming), $R_{\text{code}}$ remains above 1 in all three models (1.06--1.18). Second, obfuscating variable names in code (\texttt{lst}$\to$\texttt{v0}, \texttt{s}$\to$\texttt{v0}) reduces $R_{\text{code}}$ by only 1.6--5.4\%, and all models retain $R_{\text{code}} > 1$. Lexical overlap inflates the effect but does not create it: the alignment is primarily semantic. diff --git a/planning/decisions.md b/planning/decisions.md index 534f5c3..60f6960 100644 --- a/planning/decisions.md +++ b/planning/decisions.md @@ -66,3 +66,18 @@ Format: `## YYYY-MM-DD -- ` with **Context**, **Decision**, **Why** - **M6 (Codestral Embed)**: Excluded for this session. `.env` has no `MISTRAL_API_KEY`, and the user constrained the session to Claude Code-accessible models. Sentence-transformers / open-source HF only. **Why**: The pre-experiment review caught contamination and baseline-framing issues that, if discovered after results were reported, would have required a paper revision plus a fresh experiment. Catching them before the cross-model extension lets a single PR carry the corrected framing and the new evidence simultaneously. + +--- + +## 2026-05-21 -- Strategy D 7-model results + einops dependency fix + +**Context**: After the pre-experiment review PR (#3) merged, the Strategy D extension ran on 7 models. First run: 6/7 succeeded; Nomic v1.5 failed with `ImportError: einops` because its `trust_remote_code` module imports `einops` lazily and the package was not in `requirements.txt`. The M3 try/except wrap correctly isolated the failure so the other 6 models completed cleanly. + +**Decisions**: + + - Added `einops>=0.7` to `experiments/requirements.txt` and `pyproject.toml` dependencies. The package is needed only by Nomic's remote-code path; pinning loosely (`>=0.7`) is sufficient because the API has been stable since 0.6. + - Re-ran Strategy D with einops installed. All 7 models succeeded. Final matrix: **35/35 cells with $R_{\text{code}} > 1$ and $p < 0.05$ after Holm-Bonferroni**. Permutation-null mean โˆˆ [1.000, 1.005] across all cells (C2 baseline framing empirically confirmed). + - Paper ยง5.5 Table updated 4-row โ†’ 7-row. Body text revised from "20/20 cells" to "35/35 cells", added "Third pattern" paragraph on the E5 family's partial scale-convergence ($1.13 \to 1.14 \to 1.20$ at $384/768/1024$d). + - The pretraining contamination caveat (C1) added in PR #3 stays unchanged โ€” adding more models does not address contamination, only cross-model robustness. + +**Why**: The 7-model extension was the empirical contribution this session aimed to land. Catching einops as a soft-dep blocker (rather than as a paper-level claim error) preserved the cross-model robustness claim. The E5-family scale-convergence finding is a side effect of the extension that strengthens P1 in a way the previous mixed-family P1 test (MiniLM/mpnet/E5-large) could not. diff --git a/pyproject.toml b/pyproject.toml index 3dd2fa4..6d9ebaf 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -47,6 +47,7 @@ dependencies = [ "tqdm>=4.65", "click>=8.1", "python-dotenv>=1.0", + "einops>=0.7", # required by nomic-ai/nomic-embed-text-v1.5 trust_remote_code module ] [project.urls]