Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions experiments/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ kiwipiepy>=0.16
tqdm>=4.65
click>=8.1
python-dotenv>=1.0
einops>=0.7
13 changes: 8 additions & 5 deletions paper/main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -469,18 +469,21 @@ \subsection{Pilot Experiment and Results}\label{sec:pilot}
\toprule
\textbf{Model} & \textbf{en} & \textbf{ko} & \textbf{zh} & \textbf{ar} & \textbf{es} & \textbf{agg} \\
\midrule
UniXcoder (code) & 1.22* & 1.01* & 1.08* & 1.01* & 1.05* & 1.07 \\
MiniLM-L12 (NL) & 1.23* & 1.12* & 1.18* & 1.10* & 1.19* & 1.16 \\
UniXcoder (code) & 1.22* & 1.01* & 1.08* & 1.01* & 1.05* & 1.07 \\
MiniLM-L12 (NL) & 1.23* & 1.12* & 1.18* & 1.10* & 1.19* & 1.16 \\
Nomic v1.5 & 1.24* & 1.02* & 1.03* & 1.01* & 1.07* & 1.07 \\
E5-small (NL) & 1.22* & 1.09* & 1.13* & 1.09* & 1.14* & 1.13 \\
E5-base (NL) & 1.22* & 1.11* & 1.13* & 1.11* & 1.16* & 1.14 \\
E5-large (NL) & 1.28* & 1.16* & 1.19* & 1.16* & 1.22* & 1.20 \\
BGE-M3 (NL+code) & 1.21* & 1.14* & 1.16* & 1.14* & 1.16* & 1.16 \\
\bottomrule
\end{tabular}
\caption*{\small * $p < 0.05$ after Holm-Bonferroni correction. All 20/20 cells show $R_{\text{code}} > 1$.}
\caption*{\small * $p < 0.05$ after Holm-Bonferroni correction. All 35/35 cells show $R_{\text{code}} > 1$. Permutation-null distribution mean: $R \in [1.000, 1.005]$ across all cells (random-matching baseline).}
\end{table}

All 20 language-model cells show $R_{\text{code}} > 1$ ($p < 0.05$ after correction): NL descriptions are closer to their corresponding code than to mismatched code in every language and every model. The result is robust across code-trained (UniXcoder, Nomic) and NL-only (MiniLM, E5-large) architectures.
All 35 language-model cells show $R_{\text{code}} > 1$ ($p < 0.05$ after Holm-Bonferroni correction across 35 cells): NL descriptions are closer to their corresponding code than to mismatched code in every language and every model. The permutation null mean falls in $R \in [1.000, 1.005]$ across all cells, confirming the effect is not a metric artifact. The result is robust across code-trained (UniXcoder, Nomic), hybrid (BGE-M3), and NL-only (MiniLM, E5 family) architectures.

Two patterns emerge. First, \textbf{$\Dtrain$ modulates NL-code alignment}: English consistently shows the highest $R_{\text{code}}$ (1.22--1.28), while Korean and Arabic show the lowest (1.01--1.16), tracking language representation in code training corpora. Second, \textbf{NL-only models achieve higher $R_{\text{code}}$ than code-trained models}: E5-large (1.20 aggregate) and MiniLM (1.16) surpass UniXcoder (1.07) and Nomic (1.07).
Three patterns emerge. First, \textbf{$\Dtrain$ modulates NL-code alignment}: English consistently shows the highest $R_{\text{code}}$ (1.21--1.28), while Korean and Arabic show the lowest (1.01--1.16), tracking language representation in code training corpora. Second, \textbf{NL-only models achieve higher $R_{\text{code}}$ than code-trained models}: E5-large (1.20 aggregate) and BGE-M3 (1.16) surpass UniXcoder (1.07) and Nomic (1.07). Third, \textbf{the E5 family (same architecture, identical training recipe, varying dimension) shows partial scale-convergence}: aggregate $R_{\text{code}}$ rises $1.13$ (small, 384d) $\to 1.14$ (base, 768d) $\to 1.20$ (large, 1024d); the small-to-base jump is flat while base-to-large is steep, so P1's monotonic-with-scale prediction holds qualitatively but is non-linear in this regime.

\paragraph{Lexical overlap control.} A potential confound: NL descriptions share tokens with their code equivalents (``sort'' appears in both ``Sort the list'' and \texttt{sorted(lst)}). Token overlap correlates with $d_{\text{match}}$ (Spearman $\rho = -0.51$, $p < 0.001$ for MiniLM), confirming a lexical component. However, $R_{\text{code}} > 1$ survives two controls. First, for the 32/50 operations with \emph{zero} token overlap (after stemming), $R_{\text{code}}$ remains above 1 in all three models (1.06--1.18). Second, obfuscating variable names in code (\texttt{lst}$\to$\texttt{v0}, \texttt{s}$\to$\texttt{v0}) reduces $R_{\text{code}}$ by only 1.6--5.4\%, and all models retain $R_{\text{code}} > 1$. Lexical overlap inflates the effect but does not create it: the alignment is primarily semantic.

Expand Down
15 changes: 15 additions & 0 deletions planning/decisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,3 +66,18 @@ Format: `## YYYY-MM-DD -- <short title>` with **Context**, **Decision**, **Why**
- **M6 (Codestral Embed)**: Excluded for this session. `.env` has no `MISTRAL_API_KEY`, and the user constrained the session to Claude Code-accessible models. Sentence-transformers / open-source HF only.

**Why**: The pre-experiment review caught contamination and baseline-framing issues that, if discovered after results were reported, would have required a paper revision plus a fresh experiment. Catching them before the cross-model extension lets a single PR carry the corrected framing and the new evidence simultaneously.

---

## 2026-05-21 -- Strategy D 7-model results + einops dependency fix

**Context**: After the pre-experiment review PR (#3) merged, the Strategy D extension ran on 7 models. First run: 6/7 succeeded; Nomic v1.5 failed with `ImportError: einops` because its `trust_remote_code` module imports `einops` lazily and the package was not in `requirements.txt`. The M3 try/except wrap correctly isolated the failure so the other 6 models completed cleanly.

**Decisions**:

- Added `einops>=0.7` to `experiments/requirements.txt` and `pyproject.toml` dependencies. The package is needed only by Nomic's remote-code path; pinning loosely (`>=0.7`) is sufficient because the API has been stable since 0.6.
- Re-ran Strategy D with einops installed. All 7 models succeeded. Final matrix: **35/35 cells with $R_{\text{code}} > 1$ and $p < 0.05$ after Holm-Bonferroni**. Permutation-null mean ∈ [1.000, 1.005] across all cells (C2 baseline framing empirically confirmed).
- Paper §5.5 Table updated 4-row → 7-row. Body text revised from "20/20 cells" to "35/35 cells", added "Third pattern" paragraph on the E5 family's partial scale-convergence ($1.13 \to 1.14 \to 1.20$ at $384/768/1024$d).
- The pretraining contamination caveat (C1) added in PR #3 stays unchanged — adding more models does not address contamination, only cross-model robustness.

**Why**: The 7-model extension was the empirical contribution this session aimed to land. Catching einops as a soft-dep blocker (rather than as a paper-level claim error) preserved the cross-model robustness claim. The E5-family scale-convergence finding is a side effect of the extension that strengthens P1 in a way the previous mixed-family P1 test (MiniLM/mpnet/E5-large) could not.
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ dependencies = [
"tqdm>=4.65",
"click>=8.1",
"python-dotenv>=1.0",
"einops>=0.7", # required by nomic-ai/nomic-embed-text-v1.5 trust_remote_code module
]

[project.urls]
Expand Down
Loading