Skip to content

Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets#160

Merged
azalio merged 6 commits into
mainfrom
organ-pipe-pyrite
Jun 5, 2026
Merged

Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets#160
azalio merged 6 commits into
mainfrom
organ-pipe-pyrite

Conversation

@azalio

@azalio azalio commented Jun 5, 2026

Copy link
Copy Markdown
Owner

What

Adds a whole-skill optimization capability — measuring and improving a skill's body (its SKILL.md logic), not just its trigger description: — and applies it to the pilot skill map-task. Approach B (harness measures → human edits), body-only mutation.

Harness + fixtures (reusable flow for any skill)

  • tests/skills_eval/whole_skill/spike_runner.py — seeds an isolated temp (.claude + .map/scripts + fixture repo, git init), runs claude -p "/map-task ST-001", and scores the outcome with a hybrid metric: deterministic gates (scope fidelity via git diff, task-pass via the fixture's tests) + a trace-cited LLM judge. QUALITY = gate·(0.5 + 0.5·judge). expected_outcome-aware (complete|blocked).
  • Fixtures under tests/skills_eval/fixtures/whole_skill/: a scope-trap and an impossible/blocker mini-repo (each a real git project + committed .map/ plan/blueprint).

Key empirical finding (2 fixtures, 12 runs, 2 llm-council consults)

Generic scope/blocker prose in a thin-orchestration body is low-leverage: Body-Good vs Body-Bad (rules stripped) scored identically 1.0 — those behaviors are enforced by the shared actor/monitor agents, not the body. The body's real lever is sequencing, context relay, retry/termination, and the final report contract. (To move scope/correctness you must optimize the shared agent prompts — out of this PR's body-only scope.)

map-task body hardening (body-owned surfaces) + regression proof

  • Formalized the Outcome Report (required fields) and added the previously-missing BLOCKED outcome report; explicit retry-exhaustion / impossible-in-scope termination (stop + report blocker, never fake-complete or expand scope).
  • Fixed real defects: dead "What this command CANNOT do" reference, placeholder example, awkward artifact section.
  • Regression-proved: QUALITY 1.0 on both fixtures ×3 after the edit (no outcome regression). Honest claim: cleaner/more-complete body, not a coding-quality gain.

Also

  • docs/whole-skill-optimization-{notes,flow}.md (method + the reusable flow for other skills) and docs/SKILL-EVAL.md (description-tuning engine guide) + USAGE pointer.
  • 13 description-optimize eval-sets for the remaining /map-* skills.
  • Tooling hygiene: exclude the intentionally-broken whole_skill fixture mini-repos from pytest/ruff/pyright/mypy.

Validation

make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean; check-render byte-identical. Whole-skill measurements run via claude -p (subscription auth).

🤖 Generated with Claude Code

azalio and others added 6 commits June 5, 2026 03:32
…rdening + skill eval-sets

Whole-skill optimization (pilot: map-task), Approach B (measure -> human-edit), body-only:

- Harness: tests/skills_eval/whole_skill/spike_runner.py - seeds an isolated temp
  (.claude + .map/scripts + fixture repo, git init), runs `claude -p /map-task ST-001`
  with hybrid scoring (deterministic scope/task gates from git diff + a trace-cited LLM
  judge; QUALITY = gate*(0.5+0.5*judge)); expected_outcome-aware (complete|blocked).
- Fixtures: scope-trap (F2) + impossible/blocker (F3) under
  tests/skills_eval/fixtures/whole_skill/ (real mini-repos + committed .map plan/blueprint).
- Finding (2 fixtures, 12 runs, 2 llm-council consults): generic scope/blocker PROSE in a
  thin-orchestration body is LOW-LEVERAGE - body-good == body-bad == QUALITY 1.0, because the
  shared actor/monitor agents enforce those behaviors. The body's real lever is sequencing,
  context relay, retry/termination, and the final report contract.
- map-task body hardening (body-owned surfaces): formalized the Outcome Report (COMPLETE +
  a new BLOCKED schema with required fields), explicit retry-exhaustion/impossible-in-scope
  termination; fixed a dead reference, a placeholder example, and an awkward artifact section.
  Regression-proved: QUALITY 1.0 on F1+F3 (no outcome regression). Honest claim: cleaner/more
  complete body, no regression - not a coding-quality gain (that needs the shared agent prompts).
- Docs: docs/whole-skill-optimization-{notes,flow}.md (method + reusable flow for other skills);
  docs/SKILL-EVAL.md (description-tuning engine guide) + USAGE.md pointer.
- 13 description-optimize eval-sets for the remaining /map-* skills (tests/skills_eval/fixtures/).
- Tooling hygiene: exclude the whole_skill fixture mini-repos from pytest/ruff/pyright/mypy
  (they are intentionally-broken seeded repos).

make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean; check-render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ow-leverage (body AND actor)

Extend the whole-skill harness with `--degrade {body,actor,monitor}` and add a strong
scope-pressure fixture (the obvious one-line fix is out-of-scope), then ablate the ACTOR prompt:

- spike_runner.py: `--degrade` targets which prompt the 'bad' variant degrades; `_degrade_actor`
  strips actor.md's Mutation Boundary section + the quick-ref NEVER-scope clause (seed-only);
  `_degrade_monitor` stub (best-effort).
- Fixture tests/skills_eval/fixtures/whole_skill/map_task_scope_pressure: RATE trap — changing the
  shared RATE in config.py is the tempting out-of-scope fix; the in-scope fix is a surcharge in utils.py.

Result (current actor vs --degrade actor, 3 runs each): BOTH kept scope perfectly (config.py never
touched; only utils.py edited). The QUALITY delta (0.80 vs 1.00) is judge NOISE (inverted; the judge
penalized the current actor for lacking verbose scope-reasoning prose despite perfect actual scope).

Consolidated across 3 ablations (body + actor, 18 runs): prose-level scope discipline is low-leverage;
scope is governed by the blueprint affected_files contract + base-model competence + the mechanical
mutation-boundary/test-gate/monitor. Methodology note recorded: for scope, trust the deterministic
gate — the scope_discipline judge dimension is verbosity-biased. Full log in
docs/whole-skill-optimization-notes.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on_boundary

Option B (verify/strengthen the mechanical scope lever). The mutation-boundary
validator is correct + warn-only by design (strict via MAP_STRICT_SCOPE), but its
tests only exercised committed/staged extra files. Add a regression test proving a
NEW out-of-scope file the actor creates but never `git add`s (porcelain '??') is
still flagged as unexpected/warning — the real-world scope-leak case.

Notes: documented the lever verification + the strengthening options (default-strict
vs warn->actor-feedback vs single-subtask-strict) for a policy decision.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…leaks (self-healing)

Option (ii) for strengthening the mechanical scope lever. Previously a non-strict
scope leak detected by validate_mutation_boundary only produced a warn-log and the
MONITOR gate passed silently (hard-block only under MAP_STRICT_SCOPE). Now, at
validate_step("2.4"), a `warning` routes back to the Actor as feedback the FIRST
time it is seen per subtask (valid=False + actionable "revert the out-of-scope
changes OR escalate for a contract update"), so the actor self-corrects in the
existing retry loop — without a hard block.

- StepState.scope_feedback_subtasks (persisted, to_dict/from_dict) bounds the nudge
  to ONCE per subtask, so a persistent false positive (affected_files drift) cannot
  burn the retry budget — after the single nudge the gate passes.
- Strict-mode (MAP_STRICT_SCOPE=1) hard-reject path is unchanged.
- Edited the templates_src .jinja source and re-rendered (.map/scripts + templates).
- Tests: test_warning_routes_feedback_to_actor_once (orchestrator, incl. once-guard
  pass-through) + test_warning_on_untracked_new_out_of_scope_file (validator).

make check: 2259 passed, 3 skipped; check-render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng an empty subtask)

Apply the warn->actor-feedback pattern to MONITOR correctness validation. The
unenforced gap: MONITOR closing a subtask that declares affected_files but changed
NOTHING (empty diff) — a subtask that "completes" having done nothing.

At validate_step("2.4"), reusing the validate_mutation_boundary report: if
`expected` (declared affected_files) is non-empty but `actual` (changed files) is
empty, route back to the Actor once (valid=False + "False-progress: implement the
change or report a blocker"), bounded by StepState.progress_feedback_subtasks so a
re-validate passes (no retry-burn). Reuses the existing scope git machinery; no new
diff logic.

- StepState.progress_feedback_subtasks (persisted, to_dict/from_dict).
- Test: test_false_progress_routes_feedback_when_nothing_changed (+ once-guard pass-through).
- Edited templates_src .jinja + re-rendered.

make check: 2260 passed, 3 skipped; check-render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
map-skill-eval documented only trigger-description tuning. Add a section that
directs anyone improving a skill's BODY/logic (outcome quality) to the worked,
reusable flow + harness instead of starting from scratch:
docs/whole-skill-optimization-flow.md (+ notes), tests/skills_eval/whole_skill/
spike_runner.py, and the key finding (prose scope/correctness discipline is
low-leverage for thin orchestrators; the levers are the affected_files contract +
mechanical validators; prose pays off for report format + trigger description).

Edited the .jinja source + re-rendered. make check: 2260 passed, check-render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@azalio azalio merged commit effd649 into main Jun 5, 2026
6 checks passed
azalio added a commit that referenced this pull request Jun 6, 2026
…at) (#161)

* fix(skill-eval): trigger-eval no longer mis-records executing skills as non-trigger

The description-optimizer dispatcher ran `claude -p <prompt>` to completion and
parsed the transcript for the first `Skill` tool_use. For EXECUTING skills this
corrupted the signal: a positive prompt for map-check runs the full `make check`
suite, and map-task/map-efficient dispatch sub-agents — each overran the 120s
per-call timeout, was treated as a transient failure, retried 3x, and finally
recorded as a FALSE non-trigger. Latent until now: PR #159 applied only map-plan
(fast text-only planner); the executing-skill eval-sets from PR #160 were never run.

Trigger detection only needs the first `Skill` tool_use (the activation decision),
which lands in the transcript early (~30s) — not the skill body executing. Fixes:

- `--disallowed-tools` (Bash/Edit/Write/NotebookEdit/Task/Agent/WebFetch/WebSearch)
  on the claude -p argv: the body cannot run slow/mutating/network work or spawn
  sub-agents (which would orphan children and burn quota after the parent is
  killed), while the skill still TRIGGERS. Read-only tools and `Skill` stay allowed.
- Timeout is now TERMINAL, not retried: a 90s overrun re-run behaves identically —
  retrying turned every executing-skill positive into ~3x wall-clock (267s observed).
- On timeout, recover the trigger from the transcript (located by cwd slug, since
  the kill yields no envelope/session_id) instead of a false non-trigger, with a
  settle-poll to defeat the flush/visibility race right at the kill instant.
- Robust cwd→project-dir slug (`_cwd_to_project_slug`, shared by both locators):
  replicates Claude Code's transform — replaces `/`, `.`, AND `_` (any non
  alnum/dash) with `-`. The old `replace("/","-").replace(".","-")` missed `_`,
  so any `tempfile.mkdtemp()` suffix containing `_` (e.g. `mapeval-s_u5zv32` →
  project dir `…-mapeval-s-u5zv32`) failed to locate — an intermittent false
  non-trigger that surfaced live in the sweep. Also handles macOS `/var`→
  `/private/var` (tries raw + resolved cwd) and a slugified name-glob fallback.
- Default per-call timeout 120s → 90s (well above the ~30s trigger latency).

Validated live: map-check 75s (natural completion, triggers), map-efficient 90s
(single attempt, trigger recovered — was 267s/3-attempts), negatives ~31s non-trigger.
11 regression tests incl. the symlink-slug and underscore-slug cases.
make check green (2271 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(skill-eval): optimizer corrupted block-scalar SKILL.md descriptions → skill never triggered

`_set_frontmatter_description` replaced only the single line starting with
`description:`, leaving any YAML block-scalar body orphaned. For a skill whose
frontmatter declares:

    description: |
      Run quality gates ... use map-plan or map-efficient.

patching produced:

    description: "Run quality gates ... map-efficient.\n"
      Run quality gates ... map-efficient.    ← orphaned, indented, now invalid

That is malformed YAML, so Claude Code failed to register the skill in the seeded
candidate tree — it never auto-triggered, and EVERY eval cell (including the iter-0
baseline) recorded a false non-trigger. The optimizer would then "improve" a
description that was actually fine, against an all-zero baseline. It escaped
notice because PR #159 applied only map-plan (single-line description); all other
map-* skills use `description: |` block scalars and were silently broken.

Fix: when the description value is a block scalar (`|`/`>`) or any indented
multi-line value, also consume the continuation lines (indented deeper than the
key) before substituting the single double-quoted replacement line. Preserves the
key's indentation and every sibling key/body line.

Validated: candidate-seeding path now triggers map-check (was None). Two regression
tests (block `|` and folded `>`) assert the patched frontmatter re-parses as valid
YAML with exactly the new description and no orphaned body. make check green (2273).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(skill-eval): --model/--runs flags + model-tier trigger experiment

Adds `--model <alias>` and `--runs N` to `mapify skill-eval run` (and a `model`
param on ClaudeSubprocessDispatcher). `--model` pins the claude -p model tier
(haiku/sonnet/opus); default omits the flag → CLI session model (unchanged
behaviour). `--runs` averages out single-pass noise when comparing models.

Motivated by Murin 2026 (arXiv:2606.05970): "model choice dominates prompt
phrasing; a larger model redistributes agreement rather than uniformly raising
it." Tested the analog for skill trigger ROUTING (map-check, n=9/model):

  haiku 6/9 (67%, 26s)  ·  sonnet 7/9 (78%, 51s)  ·  opus 6/9 (67%, 51s)

Finding: model tier does NOT reliably improve trigger routing — Opus ties Haiku
and trails Sonnet; the spread is within noise and cell-by-cell it's
redistribution, not improvement (exactly Murin's pattern). One positive prompt
was missed by ALL tiers — a description gap no model size fixes. The lever for
trigger accuracy is the `description:` (the optimizer sweep), not the model;
execution-quality tiering is a separate, untested question. Full write-up in
docs/model-tier-trigger-experiment.md. 2 regression tests for the new flags.
make check green (2274 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(skill-eval): firm-up model-tier results revise the pilot conclusion

3 passes on map-check (n=27) + map-explain + map-task (n=9 each). The pilot's
single-pass n=9 (haiku 67% ≈ opus 67%) was noise. Authoritative matrix:

  skill         haiku    sonnet   opus
  map-check     59%      89%      78%
  map-explain   33%      33%      44%
  map-task      33%      67%      78%
  overall       49%      73%      71%

Revised findings: (1) model tier DOES matter — Haiku is consistently weakest
(−24pp vs Sonnet), not the free lunch the pilot suggested; (2) "bigger is better"
does NOT hold — Opus (71%) ties/trails Sonnet (73%) and they redistribute across
skills (Murin's exact pattern), so Opus buys nothing over Sonnet for routing;
(3) the description is the ceiling — map-explain caps at 33-44% across all tiers.
Recommendation for routing: Sonnet (sweet spot); the lever is the description
(optimizer sweep), not model size.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(whole-skill): model-lever knobs + a discriminating outcome fixture

Foundation for optimizing skill OUTCOME quality (not trigger descriptions),
covering the model lever and a fixture where quality actually varies.

spike_runner.py:
- --agent-model AGENT=MODEL (repeatable) rewrites a seeded agent's `model:`
  frontmatter — the precise EXECUTION-quality lever (the actor writes the code,
  so its model is the code-quality lever; sub-agents use their own model:, not
  the orchestrator's --model).
- --orchestrator-model passes --model to the top-level claude -p running the body.

Fixture map_task_semver:
- Existing whole_skill fixtures (scope/blocker traps) all score QUALITY=1.0 — they
  can't discriminate code quality. This one is genuinely non-trivial: semver 2.0.0
  compare() with pre-release precedence, numeric-vs-lexical identifier comparison,
  and build-metadata handling. A naive implementation passes the easy tests but
  fails the edge cases, so the test gate (task_pass) and QUALITY VARY with the lever.
- Validated: stub → 12 failed; a correct reference impl → 12 passed; orchestrator
  accepts the hand-crafted plan+blueprint (resume_single_subtask ST-001 → RESEARCH).

make check green (2274 passed); fixture excluded from gate per whole_skill rules.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): model lever on OUTCOME is absorbed by the gate (semver fixture)

Sweep on the discriminating map_task_semver fixture, full /map-task with
--agent-model actor=haiku|sonnet|opus (×2): all 6 runs task_pass=True (correct
semver). The one QUALITY=0.5 is a judge API-529, not a code failure. The test
gate + MONITOR retry loop drive even haiku to passing, and haiku is not slower.

Conclusion (execution-data-backed): outcome levers rank CONTRACT/GATES (#3) ≫
MODEL (#2) ≫ PROSE (#1). Invest in validation_criteria / validators / test
coverage to lift skill outcome quality — not actor model or prose. Model may
still matter on weakly-gated tasks (untested); QUALITY saturates on gated tasks
so a finer signal needs retry_count logging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(whole-skill): capture serial retry_count in the outcome harness

On a well-gated task QUALITY saturates (every model tier passes the test gate),
so the model effect hides in HOW MANY actor retries the MONITOR loop needed.
spike_runner now reads retry_count / clean_retry_count / contaminated_retry_count
from the run's step_state.json (before temp cleanup) and records + prints them,
so "passes, but only after more iterations" is visible in future model/prompt
sweeps. Addresses the method caveat from the semver model sweep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(whole-skill): weakly-gated fixture + hidden-suite scoring (model-effect probe)

To measure the model effect that a strong gate hides, add a WEAKLY-gated fixture
and post-run hidden scoring:

- spike_runner: run_hidden_tests() injects a comprehensive suite the workflow
  never saw (manifest hidden_test_src→dest + hidden_test_cmd) and records
  hidden_pass / n_passed / n_failed — the deterministic 'true code quality'
  signal, no judge noise.
- Fixture map_task_semver_weakgate: the workflow sees ONLY a thin 3-test gate
  (which a naive impl passes); the hidden 8-edge-case suite scores what the gate
  did not enforce.

Validated end-to-end (no quota): STUB → weak-gate fail; NAIVE → weak-gate PASS
but hidden 2p/6f; CORRECT → both pass. So hidden_pass discriminates code quality
that the weak gate misses — the probe for 'does the actor model matter when the
gate can't rescue a weak implementation'. make check green (2274).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(whole-skill): vague-contract fixture to isolate the model lever

map_task_semver_vague = weak gate + VAGUE aag_contract (only "compare two semver
strings, return -1/0/1" — the precedence rules are NOT spelled out), so the actor
must rely on its own semver knowledge. Reuses the weak basic gate + hidden
8-edge-case suite. This is the one configuration where the actor model COULD
matter for outcome (vague spec the gate can't rescue); the prior fixtures kept the
contract complete and found model-insensitivity. Orchestrator-accepted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): outcome is invariant to model/gate/contract within competence

Two more outcome configs (weak gate; weak gate + vague contract), hidden
8-edge-case suite, actor=haiku|sonnet|opus ×2 — all 18 runs produced correct
semver (retries=0). Even a vague contract + weak gate + haiku gets it right
first try (semver is within every tier's competence).

Definitive outcome finding: for a task within model competence, outcome quality
is invariant to model tier, test-gate strength, and contract completeness. The
execution-model lever only bites ABOVE the weak model's competence (threshold not
reached here). Evidence-based outcome-lever ranking: contract/gates ≫ model
(only beyond-competence) ≫ prose. Invest in contract + mechanical gates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(whole-skill): capability-discriminating calc fixture (expression evaluator)

Web research (EvalPlus / HumanEval-Pro / arXiv 2511.04355, 2510.13908) shows the
model-capability gap appears on EDGE-CASE-sensitive code, not isolated functions
like semver. The canonical discriminator is an arithmetic expression evaluator —
three traps that violate left-to-right intuition: right-associative ** (2**3**2),
** binding tighter than unary minus (-2**2 == -4 in Python), overloaded unary
minus, negative exponents, error contract.

map_task_calc: weak gate = 3 trivial expressions (a naive left-assoc evaluator
passes); hidden 13-case suite probes the traps. Validated: STUB → gate fail;
NAIVE (left-assoc **, tight unary) → weak gate PASS but hidden 9p/3f; CORRECT →
12p/0f. Contract states the rules (right-assoc, ** tighter than unary) WITHOUT
leaking worked answers, so it tests implementation competence. Probe for the
execution-model threshold the semver fixtures were too easy to reach.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(whole-skill): vague-contract calc fixture (decisive model-threshold probe)

map_task_calc_vague = the calc evaluator with a VAGUE contract: it says only
"standard precedence" and does NOT spell out right-associative ** or unary rules.
So passing 2**3**2==512 requires the model to KNOW right-associativity from
training, not read it from the spec. Hidden suite keeps only language-unambiguous
cases (drops -2**2). Validated: a naive left-assoc impl passes the weak gate but
fails hidden (9p/2f). If haiku fails these while sonnet/opus pass, the execution-
model competence threshold is finally located; if haiku still passes, haiku 4.5
is genuinely competent on self-contained subtasks and the model is not the lever.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): capability probe — haiku matches bigger models on subtasks

Built the canonical edge-case discriminator (arithmetic expression evaluator:
right-assoc **, ** tighter than unary minus, negative exponent, errors) per web
research. Sweeps: FULL-contract calc → haiku 12/12; VAGUE-contract calc (rules
NOT stated) → haiku 11/11,11/11, sonnet 10/11,11/11, opus 11/11,11/11. Even on
the edge-case-hard task with a vague contract, haiku 4.5 matched/edged the bigger
models (the one miss was sonnet's — noise). No bigger-is-better ordering.

Why: the tier gap lives on multi-step/multi-file/concurrency tasks, not single
self-contained functions — and map-task decomposes into well-scoped one-file
subtasks, the regime where haiku is competent. Within map-task's granularity the
model is not the outcome lever (actor can run on haiku — cost win). Final ranking:
contract/gates ≫ model (only beyond-competence) ≫ prose.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(whole-skill): acceptEdits in outcome harness + use orchestrator-model

Two harness corrections after verifying transcripts (council-flagged confounds):

1. The real model lever is the ORCHESTRATOR model: in headless `claude -p`,
   map-task runs INLINE on the session model and does NOT spawn the actor/monitor
   sub-agents (0 Task dispatches in the transcript), so the earlier
   `--agent-model actor=...` override was a no-op — every "tier" actually ran on
   claude-opus-4-8. Use `--orchestrator-model` (= claude -p --model) to vary the
   model that does the work.

2. Add `--permission-mode acceptEdits` to the run: in headless default mode,
   file-edit permission prompts auto-deny, and weaker/less-agentic models stall
   ("I need permission to edit") — observed haiku hitting 4 perm-denials and
   giving up (0/11) while opus wrote freely (and even dispatched sub-agents).
   That is a permission/agency artifact, not code quality. With acceptEdits,
   haiku writes its code and passes 10/11 — confirming the gap on a well-scoped
   subtask is small. NOT a full bypass: only edits are auto-accepted; the temp is
   an isolated throwaway.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): CORRECTED model-outcome result (two confounds removed)

Transcript verification (prompted by llm-council) exposed two confounds that
invalidated the earlier "model doesn't matter for outcome": (1) --agent-model was
a no-op (headless map-task runs inline on the orchestrator/session model, 0
sub-agent dispatches — all sweeps were opus-4-8); (2) permission stalls made haiku
return 0/11 despite knowing the algorithm. Fixes: --orchestrator-model +
--permission-mode acceptEdits.

CLEAN result (calc_vague, hidden /11, n=4): haiku 9.8/11 (variable), sonnet
10.5/11, opus 10.7/11. Modest but consistent gradient opus ≥ sonnet > haiku;
haiku more variable on hard edge cases (~1 test / ~8% gap on a single subtask),
plus a separate agency gap. Earlier "model irrelevant for outcome" RETRACTED (it
was opus measured 3x). Methodology lesson: verify the manipulation reached the
transcript before trusting an agentic-harness result.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): DECISIVE — actor model matters for code (direct-invocation method)

Cleanest measurement (read the agent prompt, call claude -p with that prompt +
task + chosen model; no orchestrator/dispatch). calc_vague, hidden /11, n=3:
haiku 8/10/10 (mean 9.3, never perfect, drops edge cases), sonnet 11/11/11,
opus 11/11 (+1 broken run). The ACTOR model materially affects code quality —
haiku systematically worse; use sonnet+ for actor. The two confounds (no-op
--agent-model, permission-stall) had hidden this; "haiku ≈ opus" RETRACTED for
the actor role.

Probes also showed: headless skills don't auto-dispatch sub-agents (run inline),
and even forced dispatches run on the session model — so per-agent model: is inert
in headless. Architectural fix: orchestrate agents by direct claude -p
--append-system-prompt <agent.md> --model <tier> so per-agent models actually take
effect. Then which-agent/model-per-skill is deterministic. Prototype: .map/actor_probe.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(actor): cut 247 lines of worked examples (A/B-validated, -23%)

The actor prompt carried a <Actor_Reference_Examples> block — 4 full worked
implementations (~247 lines, ~22% of the prompt). The output structure is already
specified earlier ("Required Output Structure"), so the examples were illustrative
bloat loaded on every actor invocation.

A/B test (direct actor-prompt invocation, calc_vague hidden /11, n=4):
  A uncut (1095L): haiku 9.3, sonnet 9.5
  B cut   (846L):  haiku 10.0, sonnet 11.0
B >= A on both models — the cut does not regress (if anything helps), at -23%
prompt size / ~7.7KB saved per invocation. Validated via .map/actor_probe.py
(renders the agent template, runs it as the system prompt with a chosen model,
scores against a hidden edge-case suite). make check green (2274), render
byte-identical across all trees.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): A/B agent polish — actor examples cuttable, monitor's are not

Built per-agent A/B harnesses (render agent template -> claude -p
--append-system-prompt --model -> score). Actor: cut 247-line examples block,
A/B B>=A (haiku 9.3->10.0, sonnet 9.5->11.0), kept (d78acd5). Monitor: cut 134
lines of good/bad examples, A/B showed recall preserved (4/4) but clean-pass on
identical correct code A=6/6 vs B=4/6 -> reverted. The monitor's examples
calibrate its accept threshold (gate agent => false-positive calibration matters).
Lesson: "cut prompt examples" is safe for a generative agent, NOT for a gate agent;
each agent needs its own A/B gate, and harness coverage bounds what's cuttable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(synthesizer): cut 95 lines of Step-7 strategy worked-examples (A/B-validated)

The synthesizer's Step 7 carried two `**Example**:` worked-code blocks
(base_enhance / fresh_generation, ~95 lines) illustrating strategies whose
numbered-step LOGIC is already stated. A capable model applies the steps without
the example code.

A/B (built .map/synth_probe.py: feed 3 variant calc solutions — 1 correct + 2
buggy — the synthesizer must extract correct decisions and WRITE unified code,
scored by the calc hidden suite, n=3):
  A (HEAD, 1161L): 33/33 hidden  ·  B (cut, 1066L): 33/33 hidden  -> B == A
Non-regressive, -95 lines (-8%). Confirms the rule from the actor cut: pure
worked-examples are cuttable in GENERATIVE agents; guidance/patterns and
gate-calibration examples are NOT (monitor + actor-patterns cuts were A/B-rejected
and reverted). make check green (2274), render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): agent-polish validated rule (2 wins, 2 rejects) + coverage limit

Worked-examples are cuttable in generative agents (actor -247L, synthesizer -95L;
both A/B B>=A) but NOT in gate agents or for guidance/patterns (monitor, actor
error-patterns; both A/B-rejected, reverted). Remaining generative agents
(task-decomposer/predictor/reflector/research-agent) output fuzzy artifacts with
no clean deterministic gate, so further cuts can't satisfy the A/B rule without
building lower-confidence gates first.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(task-decomposer): cut 151-line REFERENCE EXAMPLES (A/B-validated, -14%)

The decomposer carried a "## REFERENCE EXAMPLES" section (Example A: a full 139-line
worked CRUD blueprint) — pure illustration; the JSON Schema definition (separate
section) stays. A capable model produces a valid blueprint from the schema + 5-phase
process without the worked example.

A/B (built .map/decomp_probe.py: feed a known 5-part feature, score blueprint on
schema-validity + requirement-coverage + acyclic deps + sane count; opus, n=3):
  A (HEAD, 1078L): 3/3 PASS (coverage 5/5)  ·  B (cut, 927L): 3/3 PASS (coverage 5/5)
B == A, non-regressive, -151 lines (-14%). Third generative-agent worked-example cut
to validate (after actor -247L, synthesizer -95L); gate-calibration/guidance cuts
remain A/B-rejected (monitor, actor-patterns). make check green, render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(predictor): cut 417-line <examples> block (A/B-validated, -21%)

The predictor carried an <examples> block (Examples 1-3: full worked impact
analyses, ~417 lines incl. 107L/91L/46L JSON outputs) — illustration; the JSON
Schema, risk rubric, and "Good vs Bad Predictions" guidance stay.

A/B (built .map/pred_probe.py: feed a known breaking + high-blast-radius signature
change across 8 callers; gate on the deterministic core — breaking_changes detected
AND affected components identified; sonnet, n=3):
  A (HEAD, 2003L): 3/3 PASS  ·  B (cut, 1586L): 3/3 PASS
B == A, non-regressive, -417 lines (-21%, the largest cut). Fourth generative-agent
worked-example cut validated (actor -247, synthesizer -95, decomposer -151, predictor
-417 = 910 lines total). Gate covers breaking+affected detection (predictor's core);
risk-level (subjective) was rated identically by both versions. make check green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(reflector): cut 139-line COMPLETE EXAMPLES (A/B-validated)

The reflector's "# COMPLETE EXAMPLES" section (4 worked lesson-extraction examples,
139 lines) is illustration; the lesson-format rules + decision frameworks stay.

A/B (built .map/reflect_probe.py: feed a clear SQL-injection failure outcome; gate on
whether the reflector extracts a lesson naming the root-cause concept
injection/parameterization; sonnet, n=3):
  A (HEAD): 3/3 PASS  ·  B (cut): 3/3 PASS
B == A, non-regressive, -139 lines. Fifth generative-agent worked-example cut.
TOTAL across 5 generative agents: actor -247, synthesizer -95, task-decomposer -151,
predictor -417, reflector -139 = 1049 lines of worked-example bloat removed, all
A/B-validated B>=A. Gate agents (monitor etc.) NOT cut — examples calibrate the
accept threshold (monitor A/B-rejected). (reflector gate is keyword-presence, fuzzier
than the code-output gates.) make check green, render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): comprehensive agent-polish tally (5 wins, 1049 lines, A/B-validated)

Swept all generative agents for worked-example/REFERENCE blocks, A/B-gated each:
KEPT (B>=A): actor -247, synthesizer -95, task-decomposer -151, predictor -417,
reflector -139 = 1049 lines. REJECTED (B<A, reverted): monitor examples (calibrate
accept threshold), actor error-patterns/decision-tree (scaffold weaker models). Gate
agents untouched (examples calibrate); research-agent lean. Final rule + reusable
harnesses (.map/*_probe.py) documented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant