Skip to content

fix(z-gap): close all 15 findings from xhigh-recall code review#8

Merged
heznpc merged 1 commit into
mainfrom
chore/code-review-fixes-2026-05-21
May 28, 2026
Merged

fix(z-gap): close all 15 findings from xhigh-recall code review#8
heznpc merged 1 commit into
mainfrom
chore/code-review-fixes-2026-05-21

Conversation

@heznpc

@heznpc heznpc commented May 28, 2026

Copy link
Copy Markdown
Owner

Critical fixes (paper claim integrity):

  • V8: Strategy D/E/F sys.exit(2) on any failed model unless
    Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells"
    claim can no longer be silently invalidated by a model dropout.
  • V1/V14: SentenceTransformerEmbedder.name now includes full repo path
    (org/name → org__name) and @<sha8> revision suffix. C3 closure
    (PR experiments(z-gap): frozen model registry with HuggingFace SHAs (closes C3) #7) now actually holds end-to-end against SHA bumps and across
    org/name basename collisions.
  • V2: run_cross_experiment_synthesis._normalize_results_envelope()
    unwraps {_meta, results} so legacy consumers keep working with the
    new D/E/F JSON shape; strategy_e and strategy_f added to known files.
  • V3: paper §5.5 / Limitations "20 cells / four models / 20/20" drift
    updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places
    (L463 method, L516 P2-resolution, L663 Limitations).

Statistical fixes (correctness):

  • V5: compute_per_language_R_code substitutes NaN (not 1.0) when
    d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for
    random_baseline_R aggregation. New null R range [1.0001, 1.0046]
    (tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations.
  • V6: permutation p-value uses (k+1)/(n+1) convention. No cell reports
    literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001
    across all 70 D+F cells). Reviewer push-back surface closed.

Robustness / pitfall fixes:

  • V7: Strategy D/E/F save JSON BEFORE generating figures; figures in
    try/except. Multi-hour compute no longer lost to matplotlib fail.
  • V9: MistralEmbedder Retry sets respect_retry_after_header=False;
    bounded by backoff_factor=1 (~31s worst case), eliminating the
    multi-hour silent stall mode on server-sent Retry-After.
  • V10: OpenAI client timeout 60s → 300s; legacy batch callers no longer
    regress on slow server-side processing.
  • V11: Strategy E replaces categories[op_id] with categories.get() +
    _label() helper. Empty test sets produce {skip:true} cells with
    NaN accuracy instead of crashing on clf.predict(np.array([])).
  • V12: load_ood_stimuli() asserts tier2/tier3 op_id uniqueness with the
    duplicate list in the error message. 50/50 unique today; future
    collision will fail loudly.
  • V13: EmbeddingCache._key() switched from |-joined string to a
    JSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash
    to distinct keys.
  • V18: SentenceTransformerEmbedder.dimension falls back to a single-text
    encode probe when the deprecated get_sentence_embedding_dimension()
    returns None. Nomic v1.5 no longer risks int(None) silent skip.
  • V20: synthesis script counts aggregate as a 6th language → explicit
    if lang == "aggregate": continue in the per-language counter.

Hygiene:

  • V4: Strategy D datetime.datetime.utcnow() → datetime.now(datetime.UTC),
    matching E/F and surviving future Python ≥3.13 removal.

Re-execution:

  • D/E/F rerun successfully (7/7 models each, ~5 min wall time).
  • 35/35 + multi-model P3 + 35/35 OOD all preserved.
  • 2-decimal R_code values unchanged except UniXcoder tier1 (1.0649 ≈ 1.06,
    was printed 1.07).
  • OOD Cohen's d_max E5-large 4.12 → E5-base 4.42; paper updated.

Decisions log:

  • planning/decisions.md: 2026-05-21 entry covering all 15 fixes with
    per-finding rationale and the re-execution outcome.

Critical fixes (paper claim integrity):
- V8: Strategy D/E/F sys.exit(2) on any failed model unless
      Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells"
      claim can no longer be silently invalidated by a model dropout.
- V1/V14: SentenceTransformerEmbedder.name now includes full repo path
      (org/name → org__name) and `@<sha8>` revision suffix. C3 closure
      (PR #7) now actually holds end-to-end against SHA bumps and across
      org/name basename collisions.
- V2:  run_cross_experiment_synthesis._normalize_results_envelope()
      unwraps {_meta, results} so legacy consumers keep working with the
      new D/E/F JSON shape; strategy_e and strategy_f added to known files.
- V3:  paper §5.5 / Limitations "20 cells / four models / 20/20" drift
      updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places
      (L463 method, L516 P2-resolution, L663 Limitations).

Statistical fixes (correctness):
- V5:  compute_per_language_R_code substitutes NaN (not 1.0) when
      d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for
      random_baseline_R aggregation. New null R range [1.0001, 1.0046]
      (tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations.
- V6:  permutation p-value uses (k+1)/(n+1) convention. No cell reports
      literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001
      across all 70 D+F cells). Reviewer push-back surface closed.

Robustness / pitfall fixes:
- V7:  Strategy D/E/F save JSON BEFORE generating figures; figures in
      try/except. Multi-hour compute no longer lost to matplotlib fail.
- V9:  MistralEmbedder Retry sets respect_retry_after_header=False;
      bounded by backoff_factor=1 (~31s worst case), eliminating the
      multi-hour silent stall mode on server-sent Retry-After.
- V10: OpenAI client timeout 60s → 300s; legacy batch callers no longer
      regress on slow server-side processing.
- V11: Strategy E replaces categories[op_id] with categories.get() +
      _label() helper. Empty test sets produce {skip:true} cells with
      NaN accuracy instead of crashing on clf.predict(np.array([])).
- V12: load_ood_stimuli() asserts tier2/tier3 op_id uniqueness with the
      duplicate list in the error message. 50/50 unique today; future
      collision will fail loudly.
- V13: EmbeddingCache._key() switched from `|`-joined string to a
      JSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash
      to distinct keys.
- V18: SentenceTransformerEmbedder.dimension falls back to a single-text
      encode probe when the deprecated get_sentence_embedding_dimension()
      returns None. Nomic v1.5 no longer risks int(None) silent skip.
- V20: synthesis script counts aggregate as a 6th language → explicit
      `if lang == "aggregate": continue` in the per-language counter.

Hygiene:
- V4:  Strategy D datetime.datetime.utcnow() → datetime.now(datetime.UTC),
      matching E/F and surviving future Python ≥3.13 removal.

Re-execution:
- D/E/F rerun successfully (7/7 models each, ~5 min wall time).
- 35/35 + multi-model P3 + 35/35 OOD all preserved.
- 2-decimal R_code values unchanged except UniXcoder tier1 (1.0649 ≈ 1.06,
  was printed 1.07).
- OOD Cohen's d_max E5-large 4.12 → E5-base 4.42; paper updated.

Decisions log:
- planning/decisions.md: 2026-05-21 entry covering all 15 fixes with
  per-finding rationale and the re-execution outcome.
@heznpc heznpc merged commit 3dcd986 into main May 28, 2026
@heznpc heznpc deleted the chore/code-review-fixes-2026-05-21 branch May 28, 2026 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant