Conversation
Atomic-mode runs go through two scoring passes: variant_evolution
writes composite-scorer keys (composite/l0/compile/ast/template/
brevity/behavioral) onto each SkillGenome, then run_judging_pipeline
runs L1-L5 on the per-challenge results. The pipeline's per-skill
aggregation step rebuilt skill.pareto_objectives wholesale from result-
level keys (comparative.py's legacy {correctness, token_efficiency,
code_quality, trigger_accuracy, consistency}), silently clobbering the
richer structural keys the composite scorer had written.
Net effect: every atomic-run composite ended up scored only on the
legacy 5-axis schema. The last live run's composite showed
correctness=0 / consistency=0 / token_efficiency=0.04 with no sign
of the (actually useful) l0/ast/template/brevity breakdown that the
SKLD-bench composite scorer had already computed.
Fix: pipeline.py now MERGES instead of replaces — pre-existing
skill-level keys win on conflict; aggregation only fills in keys
the skill doesn't already carry. Atomic-mode runs now retain both
schemas side-by-side on skill.pareto_objectives; molecular-mode
parity is unaffected because skill.pareto_objectives starts empty
there and the aggregation is the only source.
Also noting for the reader: the three "zero" values in the last
live run are NOT bugs after this fix:
consistency=0.0 — L6 is intentionally an MVP stub
(comparative.py:91); v1.1 will populate it.
token_efficiency — genuine signal: trace_len/(MAX_TURNS*2)
≈0.04 means the competitor used most of the turn
budget. Slower is worse.
correctness=0.0 — genuine: Haiku's generated solutions failed
pytest on the generated challenges. Expected
behavior at the cheap tier; the signal is
real and load-bearing.
Covered by one new unit test (test_pipeline_preserves_preexisting_
pareto_objectives); adds 411 passing total (+1).
QA: ruff + mypy + 411 pytest all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Investigation + fix for the
correctness=0.00, consistency=0.00, token_efficiency=0.04scores on the passing live atomic run. Two of the three zeros are real signals, not bugs. One real bug was hiding underneath them.Verdict per axis
correctnessdet_scores[*:tests] = 0.0confirmed via DB query.consistencycomparative.py:91hardcodes it pending L6 work in v1.1.token_efficiency1 - trace_len/(MAX_TURNS*2); ~29 of 30 turns consumed means slow.The actual bug — pipeline clobbering composite scores
Atomic mode scores genomes twice:
variant_evolutionwrites the composite scorer's rich structural breakdown (composite, l0, compile, ast, template, brevity, behavioral) ontoskill.pareto_objectives.run_judging_pipelineruns L1–L5 on per-challenge results. Its per-skill aggregation replacedskill.pareto_objectiveswholesale withcomparative.py's legacy 5-axis schema (correctness, token_efficiency, code_quality, trigger_accuracy, consistency).Net: every atomic-run composite ended up scored only on the legacy schema. The
l0/ast/template/brevitykeys SKLD-bench had computed were silently clobbered between the two passes.Fix
pipeline.pynow merges instead of replaces — pre-existing skill-level keys win on conflict; aggregation only fills in keys the skill doesn't already carry. Atomic-mode runs now retain both schemas side-by-side; molecular-mode parity is unchanged becauseskill.pareto_objectivesstarts empty there.Test plan
uv run ruff check skillforge— cleanuv run mypy skillforge— 65 files passuv run pytest tests/— 411 passed (+1), 2 skippedtest_pipeline_preserves_preexisting_pareto_objectivesasserts both schemas surviveNot fixing (by design, documented above)
consistency=0.0— retained as the MVP stub until L6 ships.correctness=0.0on cheap Haiku — a real fragility signal, not a scoring bug.token_efficiencylow when traces are long — correct behavior.🤖 Generated with Claude Code