results+paper(z-gap): Strategy F OOD alignment — contamination caveat falsified#6
Merged
Merged
Conversation
… falsified C1 deferred portion (OOD test for the contamination caveat from PR #3) now closed. The pre-registered prediction was that multi-step / compositional OOD operations should show LOWER R_code than tier-1 stdlib 1-liners if the tier-1 effect was primarily pretraining memorization. Observed direction is the opposite: every model shows STRONGER alignment on OOD. New runner: experiments/scripts/run_strategy_f_ood_alignment.py - 50 OOD ops: 30 tier-2 multi-step (binary_search, BFS, merge_sort, ...) + 20 tier-3 compositional (bellman_ford, topological_sort, A*, ...) - Same 7-model set, same statistics (permutation n=10k + bootstrap n=10k + Holm-Bonferroni) as Strategy D. Results (OOD aggregate vs tier-1 aggregate): Model tier1 OOD Δ ──────────────────────────────────── UniXcoder (code) 1.07 1.15 +0.08 MiniLM-L12 (NL) 1.16 1.31 +0.15 Nomic v1.5 1.07 1.16 +0.09 E5-small (NL) 1.13 1.28 +0.15 E5-base (NL) 1.14 1.31 +0.17 E5-large (NL) 1.20 1.33 +0.13 BGE-M3 (NL+code) 1.16 1.36 +0.20 35/35 OOD cells significant (p < 0.05 Holm-Bonferroni) Cohen's d up to 4.12 (en, E5-large) Permutation-null R in [1.004, 1.008] Interpretation: multi-step algorithm NL descriptions are longer and more distinctive (mean 180 chars vs 55 for tier-1), and multi-line function bodies are stronger signal carriers than 1-liners. The embedding alignment exploits this richer surface form rather than being damaged by reduced co-occurrence frequency. NL-code alignment is NOT primarily memorization- driven. Paper: - §5.5 contamination caveat: "left to future work" framing removed; now points to OOD experiment below. - §5.5 new "Out-of-distribution NL-code alignment" paragraph + 7×5 OOD table + tier1↔OOD aggregate comparison + interpretation. - Limitations "Pretraining contamination" bullet renamed to "(partially addressed)" with summary of OOD result; residual matched-perplexity work remains future. Decisions log: - planning/decisions.md: 2026-05-21 Strategy F entry covering the pre- registered hypothesis structure (recorded in the runner docstring before running, not post-hoc), the observed direction, and the resulting paper revisions. Closes C1 deferred portion. C3 (revision SHA pin) + Minor TODOs remain.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
C1 deferred portion (OOD test for the contamination caveat from PR #3) now
closed. The pre-registered prediction was that multi-step / compositional
OOD operations should show LOWER R_code than tier-1 stdlib 1-liners if the
tier-1 effect was primarily pretraining memorization. Observed direction is
the opposite: every model shows STRONGER alignment on OOD.
New runner: experiments/scripts/run_strategy_f_ood_alignment.py
Results (OOD aggregate vs tier-1 aggregate):
Model tier1 OOD Δ
────────────────────────────────────
UniXcoder (code) 1.07 1.15 +0.08
MiniLM-L12 (NL) 1.16 1.31 +0.15
Nomic v1.5 1.07 1.16 +0.09
E5-small (NL) 1.13 1.28 +0.15
E5-base (NL) 1.14 1.31 +0.17
E5-large (NL) 1.20 1.33 +0.13
BGE-M3 (NL+code) 1.16 1.36 +0.20
35/35 OOD cells significant (p < 0.05 Holm-Bonferroni)
Cohen's d up to 4.12 (en, E5-large)
Permutation-null R in [1.004, 1.008]
Interpretation: multi-step algorithm NL descriptions are longer and more
distinctive (mean 180 chars vs 55 for tier-1), and multi-line function
bodies are stronger signal carriers than 1-liners. The embedding alignment
exploits this richer surface form rather than being damaged by reduced
co-occurrence frequency. NL-code alignment is NOT primarily memorization-
driven.
Paper:
points to OOD experiment below.
table + tier1↔OOD aggregate comparison + interpretation.
addressed)" with summary of OOD result; residual matched-perplexity
work remains future.
Decisions log:
registered hypothesis structure (recorded in the runner docstring
before running, not post-hoc), the observed direction, and the resulting
paper revisions.
Closes C1 deferred portion. C3 (revision SHA pin) + Minor TODOs remain.