fix: judging pipeline merges pareto_objectives (preserves composite scorer output) by ty13r · Pull Request #55 · ty13r/skillforge

ty13r · 2026-04-20T05:54:25Z

Summary

Investigation + fix for the correctness=0.00, consistency=0.00, token_efficiency=0.04 scores on the passing live atomic run. Two of the three zeros are real signals, not bugs. One real bug was hiding underneath them.

Verdict per axis

Axis	Value	Verdict
`correctness`	0.00	Real signal. Haiku's generated solutions failed pytest on the generated challenges — expected behavior at the cheap tier. Every competitor's `det_scores[*:tests] = 0.0` confirmed via DB query.
`consistency`	0.00	Intentional MVP stub. `comparative.py:91` hardcodes it pending L6 work in v1.1.
`token_efficiency`	0.04	Real signal. `1 - trace_len/(MAX_TURNS*2)`; ~29 of 30 turns consumed means slow.

The actual bug — pipeline clobbering composite scores

Atomic mode scores genomes twice:

variant_evolution writes the composite scorer's rich structural breakdown (composite, l0, compile, ast, template, brevity, behavioral) onto skill.pareto_objectives.
Then run_judging_pipeline runs L1–L5 on per-challenge results. Its per-skill aggregation replaced skill.pareto_objectives wholesale with comparative.py's legacy 5-axis schema (correctness, token_efficiency, code_quality, trigger_accuracy, consistency).

Net: every atomic-run composite ended up scored only on the legacy schema. The l0/ast/template/brevity keys SKLD-bench had computed were silently clobbered between the two passes.

Fix

pipeline.py now merges instead of replaces — pre-existing skill-level keys win on conflict; aggregation only fills in keys the skill doesn't already carry. Atomic-mode runs now retain both schemas side-by-side; molecular-mode parity is unchanged because skill.pareto_objectives starts empty there.

Test plan

uv run ruff check skillforge — clean
uv run mypy skillforge — 65 files pass
uv run pytest tests/ — 411 passed (+1), 2 skipped
New test: test_pipeline_preserves_preexisting_pareto_objectives asserts both schemas survive

Not fixing (by design, documented above)

consistency=0.0 — retained as the MVP stub until L6 ships.
correctness=0.0 on cheap Haiku — a real fragility signal, not a scoring bug.
token_efficiency low when traces are long — correct behavior.

🤖 Generated with Claude Code

Atomic-mode runs go through two scoring passes: variant_evolution writes composite-scorer keys (composite/l0/compile/ast/template/ brevity/behavioral) onto each SkillGenome, then run_judging_pipeline runs L1-L5 on the per-challenge results. The pipeline's per-skill aggregation step rebuilt skill.pareto_objectives wholesale from result- level keys (comparative.py's legacy {correctness, token_efficiency, code_quality, trigger_accuracy, consistency}), silently clobbering the richer structural keys the composite scorer had written. Net effect: every atomic-run composite ended up scored only on the legacy 5-axis schema. The last live run's composite showed correctness=0 / consistency=0 / token_efficiency=0.04 with no sign of the (actually useful) l0/ast/template/brevity breakdown that the SKLD-bench composite scorer had already computed. Fix: pipeline.py now MERGES instead of replaces — pre-existing skill-level keys win on conflict; aggregation only fills in keys the skill doesn't already carry. Atomic-mode runs now retain both schemas side-by-side on skill.pareto_objectives; molecular-mode parity is unaffected because skill.pareto_objectives starts empty there and the aggregation is the only source. Also noting for the reader: the three "zero" values in the last live run are NOT bugs after this fix: consistency=0.0 — L6 is intentionally an MVP stub (comparative.py:91); v1.1 will populate it. token_efficiency — genuine signal: trace_len/(MAX_TURNS*2) ≈0.04 means the competitor used most of the turn budget. Slower is worse. correctness=0.0 — genuine: Haiku's generated solutions failed pytest on the generated challenges. Expected behavior at the cheap tier; the signal is real and load-bearing. Covered by one new unit test (test_pipeline_preserves_preexisting_ pareto_objectives); adds 411 passing total (+1). QA: ruff + mypy + 411 pytest all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ty13r merged commit c95d9a6 into main Apr 20, 2026
2 checks passed

ty13r deleted the fix/zero-pareto-scores-investigation branch April 20, 2026 05:55

ty13r mentioned this pull request Apr 20, 2026

Known limitation: composite scorer is Elixir-scoped #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: judging pipeline merges pareto_objectives (preserves composite scorer output)#55

fix: judging pipeline merges pareto_objectives (preserves composite scorer output)#55
ty13r merged 1 commit intomainfrom
fix/zero-pareto-scores-investigation

ty13r commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ty13r commented Apr 20, 2026

Summary

Verdict per axis

The actual bug — pipeline clobbering composite scores

Fix

Test plan

Not fixing (by design, documented above)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant