Known limitation: composite scorer is Elixir-scoped

## Status

**Known limitation — not currently planned for fix.** Elixir is the
intended scope for the foreseeable future. Filing for visibility so
any non-Elixir evolution attempts don't look like unexplained bugs
and so the fix path is documented if/when we expand beyond Elixir.

## The gap

The 6-layer composite scorer (`scripts/scoring/composite_scorer.py`)
is hardcoded to the 7 Elixir lighthouse families:

```python
SCAFFOLD_PATHS = {
    "elixir-phoenix-liveview": ...,
    "elixir-ecto-schema-changeset": ...,
    "elixir-ecto-query-writer": ...,
    "elixir-ecto-sandbox-test": ...,
    "elixir-security-linter": ...,
    "elixir-oban-worker": ...,
    "elixir-pattern-match-refactor": ...,
}
```

For any family the Taxonomist classifies outside this set — a Python
spec, a Dockerfile skill, a YAML linter, etc. — the scorer returns
`_FALLBACK`, which zeros every structural axis (`l0`, `compile`, `ast`,
`template`, `brevity`, `behavioral`).

The atomic run's judging pipeline then writes those zeros onto
`skill.pareto_objectives`. After #55's merge-not-replace fix, the L4
legacy schema (`correctness`, `code_quality`, `token_efficiency`,
`trigger_accuracy`, `consistency`) fills in alongside, but the
structural keys stay at zero.

## How it looks to a user

From live run #4 (`pytest-data-validation-fixtures` spec, 2026-04-20):

```
composite=0.00   l0=0.00   compile=0.00   ast=0.00   template=1.00   brevity=0.00
correctness=0.00 code_quality=0.98 token_efficiency=0.06 trigger_accuracy=1.00
```

`template=1.00` is the only structural axis that works, because its
default rubric happens to have no Elixir-specific keywords. Everything
else is a dishonest zero.

On the run detail page: `FitnessRadar`, `PerDimensionFitnessBar`, and
the "best fitness" headline number all read from these keys. The user
can't distinguish *"my skill failed"* from *"SKLD didn't grade it"*.

## Why it matters (when we revisit)

1. The homepage claims "6-layer composite scoring". Today that's
   "6-layer for Elixir; partial signal for everything else."
2. Atomic-mode winner selection runs off `pareto_objectives` — without
   structural signal, non-Elixir evolution is nearly blind.
3. Visible on every non-Elixir run detail page.

## Fix path (for future reference)

Three tiers, ordered by cost. See
[`plans/GAP-composite-scorer-scope.md`](../blob/main/plans/GAP-composite-scorer-scope.md)
(local on `main` once pushed) for the full write-up.

**Tier 1 — Honest signal (~0.5 day)** — cheapest visible fix.
Non-Elixir runs render "not scored" instead of zeros. Return `None`
sentinels from `_FALLBACK`; preserve them through
`scores_to_pareto_objectives`; frontend renders "not scored" for
`None` and shows a "this family isn't in SKLD-bench" banner.

**Tier 2 — Language-agnostic backends (3-4 days per language)** —
split the scorer into per-layer dispatch:

| Layer | Generalization strategy |
|---|---|
| L0 string match | Already generic; each family owns a `score.py` |
| Compile | Dispatch on file extension (`py` → `py_compile`, `ts` → `tsc --noEmit`, etc.) |
| AST quality | Per-language walkers (Python `ast`, TS compiler, Elixir `Code`) |
| Behavioral tests | Per-family test runner via `verification_method` |
| Template | Per-family YAML rubric (data, not code) |
| Brevity | Already language-agnostic |

Replace `SCAFFOLD_PATHS` / `NAMESPACE_MAPS` dicts with
`taxonomy/<lang>/<family>/evaluation/config.yaml` per family.

**Tier 3 — Per-family onboarding contract.** Every new family must
ship with `scaffold/skld_bench/` + `evaluation/config.yaml` +
`evaluation/score.py` + `evaluation/templates.yaml`. Enforce in CI.

## Source references

- `scripts/scoring/composite_scorer.py` — the scoped scorer
- `skillforge/engine/scorer.py::score_competitor` — async wrapper
- `skillforge/engine/variant_evolution/dimension.py` — call site
- `skillforge/agents/judge/pipeline.py` — #55's merge-not-replace fix
- `taxonomy/elixir/SCHEMAS.md` — what the per-family shape looks like today

## Related

- Introduced in PR #54 / #55 live-run session (2026-04-19)
- Referenced in `journal/017-clean-code-overhaul.md`
- Not a regression — pre-existing product gap made visible by the
  Wave-2 through Wave-6 refactor


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Known limitation: composite scorer is Elixir-scoped #58

Status

The gap

How it looks to a user

Why it matters (when we revisit)

Fix path (for future reference)

Source references

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Layer	Generalization strategy
L0 string match	Already generic; each family owns a `score.py`
Compile	Dispatch on file extension (`py` → `py_compile`, `ts` → `tsc --noEmit`, etc.)
AST quality	Per-language walkers (Python `ast`, TS compiler, Elixir `Code`)
Behavioral tests	Per-family test runner via `verification_method`
Template	Per-family YAML rubric (data, not code)
Brevity	Already language-agnostic

Known limitation: composite scorer is Elixir-scoped #58

Description

Status

The gap

How it looks to a user

Why it matters (when we revisit)

Fix path (for future reference)

Source references

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions