docs+experiments(z-gap): pre-experiment review fixes (C1/C2/C3 + M1/M2/M3 + M4) by heznpc · Pull Request #3 · heznpc/z-gap

heznpc · 2026-05-20T17:21:28Z

Critical (paper integrity):

C1 pretraining contamination caveat: new paragraph in paper §5.5 NL-Code
Alignment + Limitations bullet. R_code > 1 reframed as "at least as strong
as pretraining co-occurrence statistics would predict", not as independent
evidence for Z_sem convergence beyond training-data overlap. Decisive
separation deferred to tier2/tier3 OOD stimuli.
C2 random-matching baseline framing: §5.5 protocol sentence now explicitly
identifies permutation test (n=10,000) as the random-matching baseline with
null R ≈ 1. compute_per_language_R_code() now exports the null distribution
mean/std/p95 to results JSON.
C3 HuggingFace revision policy: documented in
run_strategy_d_code_alignment.py header. Pilot accepts floating-main risk
and relies on EmbeddingCache for embedding-level reproducibility; explicit
revision= pin deferred as Minor TODO.

Major:

M1 stimulus complexity: new Limitations paragraph stating conclusions apply
to stdlib-idiom-level operations only.
M2 translation provenance: new Limitations paragraph stating no formal IAA;
translations were first-author + LLM-assisted + bilingual review.
M3 model robustness wrap: per-model try/except in run loop; OOM /
trust-remote-code / network failure of one model skips that cell instead of
aborting the 7-model sweep.
M4 prior art: web-search confirmed no per-language × per-model NL-code
matrix exists; "to our knowledge, first" qualifier added to §5.5.

Strategy D extension (this session's experiment):

MODELS extended 4 -> 7: + E5-small, E5-base, BGE-M3. M5 (P3 multi-model
probing) deferred to follow-up PR. M6 (Codestral Embed) excluded — no
MISTRAL_API_KEY in this session.
Run meta block (started/finished UTC, Python/torch/sentence-transformers
versions, seed, n_perm, n_boot, failed_models) written to results JSON.

Decisions log:

planning/decisions.md: 2026-05-21 entry documenting all C/M fixes and the
scope choices for M5/M6.

…2/M3 + M4) Critical (paper integrity): - C1 pretraining contamination caveat: new paragraph in paper §5.5 NL-Code Alignment + Limitations bullet. R_code > 1 reframed as "at least as strong as pretraining co-occurrence statistics would predict", not as independent evidence for Z_sem convergence beyond training-data overlap. Decisive separation deferred to tier2/tier3 OOD stimuli. - C2 random-matching baseline framing: §5.5 protocol sentence now explicitly identifies permutation test (n=10,000) as the random-matching baseline with null R ≈ 1. compute_per_language_R_code() now exports the null distribution mean/std/p95 to results JSON. - C3 HuggingFace revision policy: documented in run_strategy_d_code_alignment.py header. Pilot accepts floating-main risk and relies on EmbeddingCache for embedding-level reproducibility; explicit revision= pin deferred as Minor TODO. Major: - M1 stimulus complexity: new Limitations paragraph stating conclusions apply to stdlib-idiom-level operations only. - M2 translation provenance: new Limitations paragraph stating no formal IAA; translations were first-author + LLM-assisted + bilingual review. - M3 model robustness wrap: per-model try/except in run loop; OOM / trust-remote-code / network failure of one model skips that cell instead of aborting the 7-model sweep. - M4 prior art: web-search confirmed no per-language × per-model NL-code matrix exists; "to our knowledge, first" qualifier added to §5.5. Strategy D extension (this session's experiment): - MODELS extended 4 -> 7: + E5-small, E5-base, BGE-M3. M5 (P3 multi-model probing) deferred to follow-up PR. M6 (Codestral Embed) excluded — no MISTRAL_API_KEY in this session. - Run meta block (started/finished UTC, Python/torch/sentence-transformers versions, seed, n_perm, n_boot, failed_models) written to results JSON. Decisions log: - planning/decisions.md: 2026-05-21 entry documenting all C/M fixes and the scope choices for M5/M6.

… falsified (#6) C1 deferred portion (OOD test for the contamination caveat from PR #3) now closed. The pre-registered prediction was that multi-step / compositional OOD operations should show LOWER R_code than tier-1 stdlib 1-liners if the tier-1 effect was primarily pretraining memorization. Observed direction is the opposite: every model shows STRONGER alignment on OOD. New runner: experiments/scripts/run_strategy_f_ood_alignment.py - 50 OOD ops: 30 tier-2 multi-step (binary_search, BFS, merge_sort, ...) + 20 tier-3 compositional (bellman_ford, topological_sort, A*, ...) - Same 7-model set, same statistics (permutation n=10k + bootstrap n=10k + Holm-Bonferroni) as Strategy D. Results (OOD aggregate vs tier-1 aggregate): Model tier1 OOD Δ ──────────────────────────────────── UniXcoder (code) 1.07 1.15 +0.08 MiniLM-L12 (NL) 1.16 1.31 +0.15 Nomic v1.5 1.07 1.16 +0.09 E5-small (NL) 1.13 1.28 +0.15 E5-base (NL) 1.14 1.31 +0.17 E5-large (NL) 1.20 1.33 +0.13 BGE-M3 (NL+code) 1.16 1.36 +0.20 35/35 OOD cells significant (p < 0.05 Holm-Bonferroni) Cohen's d up to 4.12 (en, E5-large) Permutation-null R in [1.004, 1.008] Interpretation: multi-step algorithm NL descriptions are longer and more distinctive (mean 180 chars vs 55 for tier-1), and multi-line function bodies are stronger signal carriers than 1-liners. The embedding alignment exploits this richer surface form rather than being damaged by reduced co-occurrence frequency. NL-code alignment is NOT primarily memorization- driven. Paper: - §5.5 contamination caveat: "left to future work" framing removed; now points to OOD experiment below. - §5.5 new "Out-of-distribution NL-code alignment" paragraph + 7×5 OOD table + tier1↔OOD aggregate comparison + interpretation. - Limitations "Pretraining contamination" bullet renamed to "(partially addressed)" with summary of OOD result; residual matched-perplexity work remains future. Decisions log: - planning/decisions.md: 2026-05-21 Strategy F entry covering the pre- registered hypothesis structure (recorded in the runner docstring before running, not post-hoc), the observed direction, and the resulting paper revisions. Closes C1 deferred portion. C3 (revision SHA pin) + Minor TODOs remain.

heznpc merged commit b6790ae into main May 20, 2026
1 check passed

heznpc deleted the chore/pre-experiment-review-2026-05-21 branch May 20, 2026 17:21

heznpc mentioned this pull request May 20, 2026

results+paper(z-gap): Strategy F OOD alignment — contamination caveat falsified #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs+experiments(z-gap): pre-experiment review fixes (C1/C2/C3 + M1/M2/M3 + M4)#3

docs+experiments(z-gap): pre-experiment review fixes (C1/C2/C3 + M1/M2/M3 + M4)#3
heznpc merged 1 commit into
mainfrom
chore/pre-experiment-review-2026-05-21

heznpc commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heznpc commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant