results+paper(z-gap): Strategy F OOD alignment — contamination caveat falsified by heznpc · Pull Request #6 · heznpc/z-gap

heznpc · 2026-05-20T18:04:00Z

C1 deferred portion (OOD test for the contamination caveat from PR #3) now
closed. The pre-registered prediction was that multi-step / compositional
OOD operations should show LOWER R_code than tier-1 stdlib 1-liners if the
tier-1 effect was primarily pretraining memorization. Observed direction is
the opposite: every model shows STRONGER alignment on OOD.

New runner: experiments/scripts/run_strategy_f_ood_alignment.py

50 OOD ops: 30 tier-2 multi-step (binary_search, BFS, merge_sort, ...)
- 20 tier-3 compositional (bellman_ford, topological_sort, A*, ...)
Same 7-model set, same statistics (permutation n=10k + bootstrap n=10k
- Holm-Bonferroni) as Strategy D.

Results (OOD aggregate vs tier-1 aggregate):

Model tier1 OOD Δ
────────────────────────────────────
UniXcoder (code) 1.07 1.15 +0.08
MiniLM-L12 (NL) 1.16 1.31 +0.15
Nomic v1.5 1.07 1.16 +0.09
E5-small (NL) 1.13 1.28 +0.15
E5-base (NL) 1.14 1.31 +0.17
E5-large (NL) 1.20 1.33 +0.13
BGE-M3 (NL+code) 1.16 1.36 +0.20

35/35 OOD cells significant (p < 0.05 Holm-Bonferroni)
Cohen's d up to 4.12 (en, E5-large)
Permutation-null R in [1.004, 1.008]

Interpretation: multi-step algorithm NL descriptions are longer and more
distinctive (mean 180 chars vs 55 for tier-1), and multi-line function
bodies are stronger signal carriers than 1-liners. The embedding alignment
exploits this richer surface form rather than being damaged by reduced
co-occurrence frequency. NL-code alignment is NOT primarily memorization-
driven.

Paper:

§5.5 contamination caveat: "left to future work" framing removed; now
points to OOD experiment below.
§5.5 new "Out-of-distribution NL-code alignment" paragraph + 7×5 OOD
table + tier1↔OOD aggregate comparison + interpretation.
Limitations "Pretraining contamination" bullet renamed to "(partially
addressed)" with summary of OOD result; residual matched-perplexity
work remains future.

Decisions log:

planning/decisions.md: 2026-05-21 Strategy F entry covering the pre-
registered hypothesis structure (recorded in the runner docstring
before running, not post-hoc), the observed direction, and the resulting
paper revisions.

Closes C1 deferred portion. C3 (revision SHA pin) + Minor TODOs remain.

… falsified C1 deferred portion (OOD test for the contamination caveat from PR #3) now closed. The pre-registered prediction was that multi-step / compositional OOD operations should show LOWER R_code than tier-1 stdlib 1-liners if the tier-1 effect was primarily pretraining memorization. Observed direction is the opposite: every model shows STRONGER alignment on OOD. New runner: experiments/scripts/run_strategy_f_ood_alignment.py - 50 OOD ops: 30 tier-2 multi-step (binary_search, BFS, merge_sort, ...) + 20 tier-3 compositional (bellman_ford, topological_sort, A*, ...) - Same 7-model set, same statistics (permutation n=10k + bootstrap n=10k + Holm-Bonferroni) as Strategy D. Results (OOD aggregate vs tier-1 aggregate): Model tier1 OOD Δ ──────────────────────────────────── UniXcoder (code) 1.07 1.15 +0.08 MiniLM-L12 (NL) 1.16 1.31 +0.15 Nomic v1.5 1.07 1.16 +0.09 E5-small (NL) 1.13 1.28 +0.15 E5-base (NL) 1.14 1.31 +0.17 E5-large (NL) 1.20 1.33 +0.13 BGE-M3 (NL+code) 1.16 1.36 +0.20 35/35 OOD cells significant (p < 0.05 Holm-Bonferroni) Cohen's d up to 4.12 (en, E5-large) Permutation-null R in [1.004, 1.008] Interpretation: multi-step algorithm NL descriptions are longer and more distinctive (mean 180 chars vs 55 for tier-1), and multi-line function bodies are stronger signal carriers than 1-liners. The embedding alignment exploits this richer surface form rather than being damaged by reduced co-occurrence frequency. NL-code alignment is NOT primarily memorization- driven. Paper: - §5.5 contamination caveat: "left to future work" framing removed; now points to OOD experiment below. - §5.5 new "Out-of-distribution NL-code alignment" paragraph + 7×5 OOD table + tier1↔OOD aggregate comparison + interpretation. - Limitations "Pretraining contamination" bullet renamed to "(partially addressed)" with summary of OOD result; residual matched-perplexity work remains future. Decisions log: - planning/decisions.md: 2026-05-21 Strategy F entry covering the pre- registered hypothesis structure (recorded in the runner docstring before running, not post-hoc), the observed direction, and the resulting paper revisions. Closes C1 deferred portion. C3 (revision SHA pin) + Minor TODOs remain.

heznpc merged commit 0f6cabc into main May 20, 2026
1 check passed

heznpc deleted the chore/strategy-f-ood-alignment-2026-05-21 branch May 20, 2026 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

results+paper(z-gap): Strategy F OOD alignment — contamination caveat falsified#6

results+paper(z-gap): Strategy F OOD alignment — contamination caveat falsified#6
heznpc merged 1 commit into
mainfrom
chore/strategy-f-ood-alignment-2026-05-21

heznpc commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heznpc commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant