Skip to content

Known limitation: composite scorer is Elixir-scoped #58

@ty13r

Description

@ty13r

Status

Known limitation — not currently planned for fix. Elixir is the
intended scope for the foreseeable future. Filing for visibility so
any non-Elixir evolution attempts don't look like unexplained bugs
and so the fix path is documented if/when we expand beyond Elixir.

The gap

The 6-layer composite scorer (scripts/scoring/composite_scorer.py)
is hardcoded to the 7 Elixir lighthouse families:

SCAFFOLD_PATHS = {
    "elixir-phoenix-liveview": ...,
    "elixir-ecto-schema-changeset": ...,
    "elixir-ecto-query-writer": ...,
    "elixir-ecto-sandbox-test": ...,
    "elixir-security-linter": ...,
    "elixir-oban-worker": ...,
    "elixir-pattern-match-refactor": ...,
}

For any family the Taxonomist classifies outside this set — a Python
spec, a Dockerfile skill, a YAML linter, etc. — the scorer returns
_FALLBACK, which zeros every structural axis (l0, compile, ast,
template, brevity, behavioral).

The atomic run's judging pipeline then writes those zeros onto
skill.pareto_objectives. After #55's merge-not-replace fix, the L4
legacy schema (correctness, code_quality, token_efficiency,
trigger_accuracy, consistency) fills in alongside, but the
structural keys stay at zero.

How it looks to a user

From live run #4 (pytest-data-validation-fixtures spec, 2026-04-20):

composite=0.00   l0=0.00   compile=0.00   ast=0.00   template=1.00   brevity=0.00
correctness=0.00 code_quality=0.98 token_efficiency=0.06 trigger_accuracy=1.00

template=1.00 is the only structural axis that works, because its
default rubric happens to have no Elixir-specific keywords. Everything
else is a dishonest zero.

On the run detail page: FitnessRadar, PerDimensionFitnessBar, and
the "best fitness" headline number all read from these keys. The user
can't distinguish "my skill failed" from "SKLD didn't grade it".

Why it matters (when we revisit)

  1. The homepage claims "6-layer composite scoring". Today that's
    "6-layer for Elixir; partial signal for everything else."
  2. Atomic-mode winner selection runs off pareto_objectives — without
    structural signal, non-Elixir evolution is nearly blind.
  3. Visible on every non-Elixir run detail page.

Fix path (for future reference)

Three tiers, ordered by cost. See
plans/GAP-composite-scorer-scope.md
(local on main once pushed) for the full write-up.

Tier 1 — Honest signal (~0.5 day) — cheapest visible fix.
Non-Elixir runs render "not scored" instead of zeros. Return None
sentinels from _FALLBACK; preserve them through
scores_to_pareto_objectives; frontend renders "not scored" for
None and shows a "this family isn't in SKLD-bench" banner.

Tier 2 — Language-agnostic backends (3-4 days per language)
split the scorer into per-layer dispatch:

Layer Generalization strategy
L0 string match Already generic; each family owns a score.py
Compile Dispatch on file extension (pypy_compile, tstsc --noEmit, etc.)
AST quality Per-language walkers (Python ast, TS compiler, Elixir Code)
Behavioral tests Per-family test runner via verification_method
Template Per-family YAML rubric (data, not code)
Brevity Already language-agnostic

Replace SCAFFOLD_PATHS / NAMESPACE_MAPS dicts with
taxonomy/<lang>/<family>/evaluation/config.yaml per family.

Tier 3 — Per-family onboarding contract. Every new family must
ship with scaffold/skld_bench/ + evaluation/config.yaml +
evaluation/score.py + evaluation/templates.yaml. Enforce in CI.

Source references

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions