Skip to content

fix: judging pipeline merges pareto_objectives (preserves composite scorer output)#55

Merged
ty13r merged 1 commit intomainfrom
fix/zero-pareto-scores-investigation
Apr 20, 2026
Merged

fix: judging pipeline merges pareto_objectives (preserves composite scorer output)#55
ty13r merged 1 commit intomainfrom
fix/zero-pareto-scores-investigation

Conversation

@ty13r
Copy link
Copy Markdown
Owner

@ty13r ty13r commented Apr 20, 2026

Summary

Investigation + fix for the correctness=0.00, consistency=0.00, token_efficiency=0.04 scores on the passing live atomic run. Two of the three zeros are real signals, not bugs. One real bug was hiding underneath them.

Verdict per axis

Axis Value Verdict
correctness 0.00 Real signal. Haiku's generated solutions failed pytest on the generated challenges — expected behavior at the cheap tier. Every competitor's det_scores[*:tests] = 0.0 confirmed via DB query.
consistency 0.00 Intentional MVP stub. comparative.py:91 hardcodes it pending L6 work in v1.1.
token_efficiency 0.04 Real signal. 1 - trace_len/(MAX_TURNS*2); ~29 of 30 turns consumed means slow.

The actual bug — pipeline clobbering composite scores

Atomic mode scores genomes twice:

  1. variant_evolution writes the composite scorer's rich structural breakdown (composite, l0, compile, ast, template, brevity, behavioral) onto skill.pareto_objectives.
  2. Then run_judging_pipeline runs L1–L5 on per-challenge results. Its per-skill aggregation replaced skill.pareto_objectives wholesale with comparative.py's legacy 5-axis schema (correctness, token_efficiency, code_quality, trigger_accuracy, consistency).

Net: every atomic-run composite ended up scored only on the legacy schema. The l0/ast/template/brevity keys SKLD-bench had computed were silently clobbered between the two passes.

Fix

pipeline.py now merges instead of replaces — pre-existing skill-level keys win on conflict; aggregation only fills in keys the skill doesn't already carry. Atomic-mode runs now retain both schemas side-by-side; molecular-mode parity is unchanged because skill.pareto_objectives starts empty there.

Test plan

  • uv run ruff check skillforge — clean
  • uv run mypy skillforge — 65 files pass
  • uv run pytest tests/ — 411 passed (+1), 2 skipped
  • New test: test_pipeline_preserves_preexisting_pareto_objectives asserts both schemas survive

Not fixing (by design, documented above)

  • consistency=0.0 — retained as the MVP stub until L6 ships.
  • correctness=0.0 on cheap Haiku — a real fragility signal, not a scoring bug.
  • token_efficiency low when traces are long — correct behavior.

🤖 Generated with Claude Code

Atomic-mode runs go through two scoring passes: variant_evolution
writes composite-scorer keys (composite/l0/compile/ast/template/
brevity/behavioral) onto each SkillGenome, then run_judging_pipeline
runs L1-L5 on the per-challenge results. The pipeline's per-skill
aggregation step rebuilt skill.pareto_objectives wholesale from result-
level keys (comparative.py's legacy {correctness, token_efficiency,
code_quality, trigger_accuracy, consistency}), silently clobbering the
richer structural keys the composite scorer had written.

Net effect: every atomic-run composite ended up scored only on the
legacy 5-axis schema. The last live run's composite showed
correctness=0 / consistency=0 / token_efficiency=0.04 with no sign
of the (actually useful) l0/ast/template/brevity breakdown that the
SKLD-bench composite scorer had already computed.

Fix: pipeline.py now MERGES instead of replaces — pre-existing
skill-level keys win on conflict; aggregation only fills in keys
the skill doesn't already carry. Atomic-mode runs now retain both
schemas side-by-side on skill.pareto_objectives; molecular-mode
parity is unaffected because skill.pareto_objectives starts empty
there and the aggregation is the only source.

Also noting for the reader: the three "zero" values in the last
live run are NOT bugs after this fix:

  consistency=0.0     — L6 is intentionally an MVP stub
                        (comparative.py:91); v1.1 will populate it.
  token_efficiency    — genuine signal: trace_len/(MAX_TURNS*2)
  ≈0.04                 means the competitor used most of the turn
                        budget. Slower is worse.
  correctness=0.0     — genuine: Haiku's generated solutions failed
                        pytest on the generated challenges. Expected
                        behavior at the cheap tier; the signal is
                        real and load-bearing.

Covered by one new unit test (test_pipeline_preserves_preexisting_
pareto_objectives); adds 411 passing total (+1).

QA: ruff + mypy + 411 pytest all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ty13r ty13r merged commit c95d9a6 into main Apr 20, 2026
2 checks passed
@ty13r ty13r deleted the fix/zero-pareto-scores-investigation branch April 20, 2026 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant