Skip to content

docs+experiments(z-gap): pre-experiment review fixes (C1/C2/C3 + M1/M2/M3 + M4)#3

Merged
heznpc merged 1 commit into
mainfrom
chore/pre-experiment-review-2026-05-21
May 20, 2026
Merged

docs+experiments(z-gap): pre-experiment review fixes (C1/C2/C3 + M1/M2/M3 + M4)#3
heznpc merged 1 commit into
mainfrom
chore/pre-experiment-review-2026-05-21

Conversation

@heznpc

@heznpc heznpc commented May 20, 2026

Copy link
Copy Markdown
Owner

Critical (paper integrity):

  • C1 pretraining contamination caveat: new paragraph in paper §5.5 NL-Code
    Alignment + Limitations bullet. R_code > 1 reframed as "at least as strong
    as pretraining co-occurrence statistics would predict", not as independent
    evidence for Z_sem convergence beyond training-data overlap. Decisive
    separation deferred to tier2/tier3 OOD stimuli.
  • C2 random-matching baseline framing: §5.5 protocol sentence now explicitly
    identifies permutation test (n=10,000) as the random-matching baseline with
    null R ≈ 1. compute_per_language_R_code() now exports the null distribution
    mean/std/p95 to results JSON.
  • C3 HuggingFace revision policy: documented in
    run_strategy_d_code_alignment.py header. Pilot accepts floating-main risk
    and relies on EmbeddingCache for embedding-level reproducibility; explicit
    revision= pin deferred as Minor TODO.

Major:

  • M1 stimulus complexity: new Limitations paragraph stating conclusions apply
    to stdlib-idiom-level operations only.
  • M2 translation provenance: new Limitations paragraph stating no formal IAA;
    translations were first-author + LLM-assisted + bilingual review.
  • M3 model robustness wrap: per-model try/except in run loop; OOM /
    trust-remote-code / network failure of one model skips that cell instead of
    aborting the 7-model sweep.
  • M4 prior art: web-search confirmed no per-language × per-model NL-code
    matrix exists; "to our knowledge, first" qualifier added to §5.5.

Strategy D extension (this session's experiment):

  • MODELS extended 4 -> 7: + E5-small, E5-base, BGE-M3. M5 (P3 multi-model
    probing) deferred to follow-up PR. M6 (Codestral Embed) excluded — no
    MISTRAL_API_KEY in this session.
  • Run meta block (started/finished UTC, Python/torch/sentence-transformers
    versions, seed, n_perm, n_boot, failed_models) written to results JSON.

Decisions log:

  • planning/decisions.md: 2026-05-21 entry documenting all C/M fixes and the
    scope choices for M5/M6.

…2/M3 + M4)

Critical (paper integrity):
- C1 pretraining contamination caveat: new paragraph in paper §5.5 NL-Code
  Alignment + Limitations bullet. R_code > 1 reframed as "at least as strong
  as pretraining co-occurrence statistics would predict", not as independent
  evidence for Z_sem convergence beyond training-data overlap. Decisive
  separation deferred to tier2/tier3 OOD stimuli.
- C2 random-matching baseline framing: §5.5 protocol sentence now explicitly
  identifies permutation test (n=10,000) as the random-matching baseline with
  null R ≈ 1. compute_per_language_R_code() now exports the null distribution
  mean/std/p95 to results JSON.
- C3 HuggingFace revision policy: documented in
  run_strategy_d_code_alignment.py header. Pilot accepts floating-main risk
  and relies on EmbeddingCache for embedding-level reproducibility; explicit
  revision= pin deferred as Minor TODO.

Major:
- M1 stimulus complexity: new Limitations paragraph stating conclusions apply
  to stdlib-idiom-level operations only.
- M2 translation provenance: new Limitations paragraph stating no formal IAA;
  translations were first-author + LLM-assisted + bilingual review.
- M3 model robustness wrap: per-model try/except in run loop; OOM /
  trust-remote-code / network failure of one model skips that cell instead of
  aborting the 7-model sweep.
- M4 prior art: web-search confirmed no per-language × per-model NL-code
  matrix exists; "to our knowledge, first" qualifier added to §5.5.

Strategy D extension (this session's experiment):
- MODELS extended 4 -> 7: + E5-small, E5-base, BGE-M3. M5 (P3 multi-model
  probing) deferred to follow-up PR. M6 (Codestral Embed) excluded — no
  MISTRAL_API_KEY in this session.
- Run meta block (started/finished UTC, Python/torch/sentence-transformers
  versions, seed, n_perm, n_boot, failed_models) written to results JSON.

Decisions log:
- planning/decisions.md: 2026-05-21 entry documenting all C/M fixes and the
  scope choices for M5/M6.
@heznpc heznpc merged commit b6790ae into main May 20, 2026
1 check passed
@heznpc heznpc deleted the chore/pre-experiment-review-2026-05-21 branch May 20, 2026 17:21
heznpc added a commit that referenced this pull request May 20, 2026
… falsified (#6)

C1 deferred portion (OOD test for the contamination caveat from PR #3) now
closed. The pre-registered prediction was that multi-step / compositional
OOD operations should show LOWER R_code than tier-1 stdlib 1-liners if the
tier-1 effect was primarily pretraining memorization. Observed direction is
the opposite: every model shows STRONGER alignment on OOD.

New runner: experiments/scripts/run_strategy_f_ood_alignment.py
  - 50 OOD ops: 30 tier-2 multi-step (binary_search, BFS, merge_sort, ...)
    + 20 tier-3 compositional (bellman_ford, topological_sort, A*, ...)
  - Same 7-model set, same statistics (permutation n=10k + bootstrap n=10k
    + Holm-Bonferroni) as Strategy D.

Results (OOD aggregate vs tier-1 aggregate):

  Model              tier1   OOD    Δ
  ────────────────────────────────────
  UniXcoder (code)   1.07    1.15  +0.08
  MiniLM-L12 (NL)    1.16    1.31  +0.15
  Nomic v1.5         1.07    1.16  +0.09
  E5-small (NL)      1.13    1.28  +0.15
  E5-base (NL)       1.14    1.31  +0.17
  E5-large (NL)      1.20    1.33  +0.13
  BGE-M3 (NL+code)   1.16    1.36  +0.20

  35/35 OOD cells significant (p < 0.05 Holm-Bonferroni)
  Cohen's d up to 4.12 (en, E5-large)
  Permutation-null R in [1.004, 1.008]

Interpretation: multi-step algorithm NL descriptions are longer and more
distinctive (mean 180 chars vs 55 for tier-1), and multi-line function
bodies are stronger signal carriers than 1-liners. The embedding alignment
exploits this richer surface form rather than being damaged by reduced
co-occurrence frequency. NL-code alignment is NOT primarily memorization-
driven.

Paper:
- §5.5 contamination caveat: "left to future work" framing removed; now
  points to OOD experiment below.
- §5.5 new "Out-of-distribution NL-code alignment" paragraph + 7×5 OOD
  table + tier1↔OOD aggregate comparison + interpretation.
- Limitations "Pretraining contamination" bullet renamed to "(partially
  addressed)" with summary of OOD result; residual matched-perplexity
  work remains future.

Decisions log:
- planning/decisions.md: 2026-05-21 Strategy F entry covering the pre-
  registered hypothesis structure (recorded in the runner docstring
  before running, not post-hoc), the observed direction, and the resulting
  paper revisions.

Closes C1 deferred portion. C3 (revision SHA pin) + Minor TODOs remain.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant