Status
Known limitation — not currently planned for fix. Elixir is the
intended scope for the foreseeable future. Filing for visibility so
any non-Elixir evolution attempts don't look like unexplained bugs
and so the fix path is documented if/when we expand beyond Elixir.
The gap
The 6-layer composite scorer (scripts/scoring/composite_scorer.py)
is hardcoded to the 7 Elixir lighthouse families:
SCAFFOLD_PATHS = {
"elixir-phoenix-liveview": ...,
"elixir-ecto-schema-changeset": ...,
"elixir-ecto-query-writer": ...,
"elixir-ecto-sandbox-test": ...,
"elixir-security-linter": ...,
"elixir-oban-worker": ...,
"elixir-pattern-match-refactor": ...,
}
For any family the Taxonomist classifies outside this set — a Python
spec, a Dockerfile skill, a YAML linter, etc. — the scorer returns
_FALLBACK, which zeros every structural axis (l0, compile, ast,
template, brevity, behavioral).
The atomic run's judging pipeline then writes those zeros onto
skill.pareto_objectives. After #55's merge-not-replace fix, the L4
legacy schema (correctness, code_quality, token_efficiency,
trigger_accuracy, consistency) fills in alongside, but the
structural keys stay at zero.
How it looks to a user
From live run #4 (pytest-data-validation-fixtures spec, 2026-04-20):
composite=0.00 l0=0.00 compile=0.00 ast=0.00 template=1.00 brevity=0.00
correctness=0.00 code_quality=0.98 token_efficiency=0.06 trigger_accuracy=1.00
template=1.00 is the only structural axis that works, because its
default rubric happens to have no Elixir-specific keywords. Everything
else is a dishonest zero.
On the run detail page: FitnessRadar, PerDimensionFitnessBar, and
the "best fitness" headline number all read from these keys. The user
can't distinguish "my skill failed" from "SKLD didn't grade it".
Why it matters (when we revisit)
- The homepage claims "6-layer composite scoring". Today that's
"6-layer for Elixir; partial signal for everything else."
- Atomic-mode winner selection runs off
pareto_objectives — without
structural signal, non-Elixir evolution is nearly blind.
- Visible on every non-Elixir run detail page.
Fix path (for future reference)
Three tiers, ordered by cost. See
plans/GAP-composite-scorer-scope.md
(local on main once pushed) for the full write-up.
Tier 1 — Honest signal (~0.5 day) — cheapest visible fix.
Non-Elixir runs render "not scored" instead of zeros. Return None
sentinels from _FALLBACK; preserve them through
scores_to_pareto_objectives; frontend renders "not scored" for
None and shows a "this family isn't in SKLD-bench" banner.
Tier 2 — Language-agnostic backends (3-4 days per language) —
split the scorer into per-layer dispatch:
| Layer |
Generalization strategy |
| L0 string match |
Already generic; each family owns a score.py |
| Compile |
Dispatch on file extension (py → py_compile, ts → tsc --noEmit, etc.) |
| AST quality |
Per-language walkers (Python ast, TS compiler, Elixir Code) |
| Behavioral tests |
Per-family test runner via verification_method |
| Template |
Per-family YAML rubric (data, not code) |
| Brevity |
Already language-agnostic |
Replace SCAFFOLD_PATHS / NAMESPACE_MAPS dicts with
taxonomy/<lang>/<family>/evaluation/config.yaml per family.
Tier 3 — Per-family onboarding contract. Every new family must
ship with scaffold/skld_bench/ + evaluation/config.yaml +
evaluation/score.py + evaluation/templates.yaml. Enforce in CI.
Source references
Related
Status
Known limitation — not currently planned for fix. Elixir is the
intended scope for the foreseeable future. Filing for visibility so
any non-Elixir evolution attempts don't look like unexplained bugs
and so the fix path is documented if/when we expand beyond Elixir.
The gap
The 6-layer composite scorer (
scripts/scoring/composite_scorer.py)is hardcoded to the 7 Elixir lighthouse families:
For any family the Taxonomist classifies outside this set — a Python
spec, a Dockerfile skill, a YAML linter, etc. — the scorer returns
_FALLBACK, which zeros every structural axis (l0,compile,ast,template,brevity,behavioral).The atomic run's judging pipeline then writes those zeros onto
skill.pareto_objectives. After #55's merge-not-replace fix, the L4legacy schema (
correctness,code_quality,token_efficiency,trigger_accuracy,consistency) fills in alongside, but thestructural keys stay at zero.
How it looks to a user
From live run #4 (
pytest-data-validation-fixturesspec, 2026-04-20):template=1.00is the only structural axis that works, because itsdefault rubric happens to have no Elixir-specific keywords. Everything
else is a dishonest zero.
On the run detail page:
FitnessRadar,PerDimensionFitnessBar, andthe "best fitness" headline number all read from these keys. The user
can't distinguish "my skill failed" from "SKLD didn't grade it".
Why it matters (when we revisit)
"6-layer for Elixir; partial signal for everything else."
pareto_objectives— withoutstructural signal, non-Elixir evolution is nearly blind.
Fix path (for future reference)
Three tiers, ordered by cost. See
plans/GAP-composite-scorer-scope.md(local on
mainonce pushed) for the full write-up.Tier 1 — Honest signal (~0.5 day) — cheapest visible fix.
Non-Elixir runs render "not scored" instead of zeros. Return
Nonesentinels from
_FALLBACK; preserve them throughscores_to_pareto_objectives; frontend renders "not scored" forNoneand shows a "this family isn't in SKLD-bench" banner.Tier 2 — Language-agnostic backends (3-4 days per language) —
split the scorer into per-layer dispatch:
score.pypy→py_compile,ts→tsc --noEmit, etc.)ast, TS compiler, ElixirCode)verification_methodReplace
SCAFFOLD_PATHS/NAMESPACE_MAPSdicts withtaxonomy/<lang>/<family>/evaluation/config.yamlper family.Tier 3 — Per-family onboarding contract. Every new family must
ship with
scaffold/skld_bench/+evaluation/config.yaml+evaluation/score.py+evaluation/templates.yaml. Enforce in CI.Source references
scripts/scoring/composite_scorer.py— the scoped scorerskillforge/engine/scorer.py::score_competitor— async wrapperskillforge/engine/variant_evolution/dimension.py— call siteskillforge/agents/judge/pipeline.py— fix: judging pipeline merges pareto_objectives (preserves composite scorer output) #55's merge-not-replace fixtaxonomy/elixir/SCHEMAS.md— what the per-family shape looks like todayRelated
journal/017-clean-code-overhaul.mdWave-2 through Wave-6 refactor