Skip to content

results+paper(z-gap): Strategy F OOD alignment — contamination caveat falsified#6

Merged
heznpc merged 1 commit into
mainfrom
chore/strategy-f-ood-alignment-2026-05-21
May 20, 2026
Merged

results+paper(z-gap): Strategy F OOD alignment — contamination caveat falsified#6
heznpc merged 1 commit into
mainfrom
chore/strategy-f-ood-alignment-2026-05-21

Conversation

@heznpc

@heznpc heznpc commented May 20, 2026

Copy link
Copy Markdown
Owner

C1 deferred portion (OOD test for the contamination caveat from PR #3) now
closed. The pre-registered prediction was that multi-step / compositional
OOD operations should show LOWER R_code than tier-1 stdlib 1-liners if the
tier-1 effect was primarily pretraining memorization. Observed direction is
the opposite: every model shows STRONGER alignment on OOD.

New runner: experiments/scripts/run_strategy_f_ood_alignment.py

  • 50 OOD ops: 30 tier-2 multi-step (binary_search, BFS, merge_sort, ...)
    • 20 tier-3 compositional (bellman_ford, topological_sort, A*, ...)
  • Same 7-model set, same statistics (permutation n=10k + bootstrap n=10k
    • Holm-Bonferroni) as Strategy D.

Results (OOD aggregate vs tier-1 aggregate):

Model tier1 OOD Δ
────────────────────────────────────
UniXcoder (code) 1.07 1.15 +0.08
MiniLM-L12 (NL) 1.16 1.31 +0.15
Nomic v1.5 1.07 1.16 +0.09
E5-small (NL) 1.13 1.28 +0.15
E5-base (NL) 1.14 1.31 +0.17
E5-large (NL) 1.20 1.33 +0.13
BGE-M3 (NL+code) 1.16 1.36 +0.20

35/35 OOD cells significant (p < 0.05 Holm-Bonferroni)
Cohen's d up to 4.12 (en, E5-large)
Permutation-null R in [1.004, 1.008]

Interpretation: multi-step algorithm NL descriptions are longer and more
distinctive (mean 180 chars vs 55 for tier-1), and multi-line function
bodies are stronger signal carriers than 1-liners. The embedding alignment
exploits this richer surface form rather than being damaged by reduced
co-occurrence frequency. NL-code alignment is NOT primarily memorization-
driven.

Paper:

  • §5.5 contamination caveat: "left to future work" framing removed; now
    points to OOD experiment below.
  • §5.5 new "Out-of-distribution NL-code alignment" paragraph + 7×5 OOD
    table + tier1↔OOD aggregate comparison + interpretation.
  • Limitations "Pretraining contamination" bullet renamed to "(partially
    addressed)" with summary of OOD result; residual matched-perplexity
    work remains future.

Decisions log:

  • planning/decisions.md: 2026-05-21 Strategy F entry covering the pre-
    registered hypothesis structure (recorded in the runner docstring
    before running, not post-hoc), the observed direction, and the resulting
    paper revisions.

Closes C1 deferred portion. C3 (revision SHA pin) + Minor TODOs remain.

… falsified

C1 deferred portion (OOD test for the contamination caveat from PR #3) now
closed. The pre-registered prediction was that multi-step / compositional
OOD operations should show LOWER R_code than tier-1 stdlib 1-liners if the
tier-1 effect was primarily pretraining memorization. Observed direction is
the opposite: every model shows STRONGER alignment on OOD.

New runner: experiments/scripts/run_strategy_f_ood_alignment.py
  - 50 OOD ops: 30 tier-2 multi-step (binary_search, BFS, merge_sort, ...)
    + 20 tier-3 compositional (bellman_ford, topological_sort, A*, ...)
  - Same 7-model set, same statistics (permutation n=10k + bootstrap n=10k
    + Holm-Bonferroni) as Strategy D.

Results (OOD aggregate vs tier-1 aggregate):

  Model              tier1   OOD    Δ
  ────────────────────────────────────
  UniXcoder (code)   1.07    1.15  +0.08
  MiniLM-L12 (NL)    1.16    1.31  +0.15
  Nomic v1.5         1.07    1.16  +0.09
  E5-small (NL)      1.13    1.28  +0.15
  E5-base (NL)       1.14    1.31  +0.17
  E5-large (NL)      1.20    1.33  +0.13
  BGE-M3 (NL+code)   1.16    1.36  +0.20

  35/35 OOD cells significant (p < 0.05 Holm-Bonferroni)
  Cohen's d up to 4.12 (en, E5-large)
  Permutation-null R in [1.004, 1.008]

Interpretation: multi-step algorithm NL descriptions are longer and more
distinctive (mean 180 chars vs 55 for tier-1), and multi-line function
bodies are stronger signal carriers than 1-liners. The embedding alignment
exploits this richer surface form rather than being damaged by reduced
co-occurrence frequency. NL-code alignment is NOT primarily memorization-
driven.

Paper:
- §5.5 contamination caveat: "left to future work" framing removed; now
  points to OOD experiment below.
- §5.5 new "Out-of-distribution NL-code alignment" paragraph + 7×5 OOD
  table + tier1↔OOD aggregate comparison + interpretation.
- Limitations "Pretraining contamination" bullet renamed to "(partially
  addressed)" with summary of OOD result; residual matched-perplexity
  work remains future.

Decisions log:
- planning/decisions.md: 2026-05-21 Strategy F entry covering the pre-
  registered hypothesis structure (recorded in the runner docstring
  before running, not post-hoc), the observed direction, and the resulting
  paper revisions.

Closes C1 deferred portion. C3 (revision SHA pin) + Minor TODOs remain.
@heznpc heznpc merged commit 0f6cabc into main May 20, 2026
1 check passed
@heznpc heznpc deleted the chore/strategy-f-ood-alignment-2026-05-21 branch May 20, 2026 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant