Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets by azalio · Pull Request #160 · azalio/map-framework

azalio · 2026-06-05T00:33:31Z

What

Adds a whole-skill optimization capability — measuring and improving a skill's body (its SKILL.md logic), not just its trigger description: — and applies it to the pilot skill map-task. Approach B (harness measures → human edits), body-only mutation.

Harness + fixtures (reusable flow for any skill)

tests/skills_eval/whole_skill/spike_runner.py — seeds an isolated temp (.claude + .map/scripts + fixture repo, git init), runs claude -p "/map-task ST-001", and scores the outcome with a hybrid metric: deterministic gates (scope fidelity via git diff, task-pass via the fixture's tests) + a trace-cited LLM judge. QUALITY = gate·(0.5 + 0.5·judge). expected_outcome-aware (complete|blocked).
Fixtures under tests/skills_eval/fixtures/whole_skill/: a scope-trap and an impossible/blocker mini-repo (each a real git project + committed .map/ plan/blueprint).

Key empirical finding (2 fixtures, 12 runs, 2 llm-council consults)

Generic scope/blocker prose in a thin-orchestration body is low-leverage: Body-Good vs Body-Bad (rules stripped) scored identically 1.0 — those behaviors are enforced by the shared actor/monitor agents, not the body. The body's real lever is sequencing, context relay, retry/termination, and the final report contract. (To move scope/correctness you must optimize the shared agent prompts — out of this PR's body-only scope.)

map-task body hardening (body-owned surfaces) + regression proof

Formalized the Outcome Report (required fields) and added the previously-missing BLOCKED outcome report; explicit retry-exhaustion / impossible-in-scope termination (stop + report blocker, never fake-complete or expand scope).
Fixed real defects: dead "What this command CANNOT do" reference, placeholder example, awkward artifact section.
Regression-proved: QUALITY 1.0 on both fixtures ×3 after the edit (no outcome regression). Honest claim: cleaner/more-complete body, not a coding-quality gain.

Also

docs/whole-skill-optimization-{notes,flow}.md (method + the reusable flow for other skills) and docs/SKILL-EVAL.md (description-tuning engine guide) + USAGE pointer.
13 description-optimize eval-sets for the remaining /map-* skills.
Tooling hygiene: exclude the intentionally-broken whole_skill fixture mini-repos from pytest/ruff/pyright/mypy.

Validation

make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean; check-render byte-identical. Whole-skill measurements run via claude -p (subscription auth).

🤖 Generated with Claude Code

…rdening + skill eval-sets Whole-skill optimization (pilot: map-task), Approach B (measure -> human-edit), body-only: - Harness: tests/skills_eval/whole_skill/spike_runner.py - seeds an isolated temp (.claude + .map/scripts + fixture repo, git init), runs `claude -p /map-task ST-001` with hybrid scoring (deterministic scope/task gates from git diff + a trace-cited LLM judge; QUALITY = gate*(0.5+0.5*judge)); expected_outcome-aware (complete|blocked). - Fixtures: scope-trap (F2) + impossible/blocker (F3) under tests/skills_eval/fixtures/whole_skill/ (real mini-repos + committed .map plan/blueprint). - Finding (2 fixtures, 12 runs, 2 llm-council consults): generic scope/blocker PROSE in a thin-orchestration body is LOW-LEVERAGE - body-good == body-bad == QUALITY 1.0, because the shared actor/monitor agents enforce those behaviors. The body's real lever is sequencing, context relay, retry/termination, and the final report contract. - map-task body hardening (body-owned surfaces): formalized the Outcome Report (COMPLETE + a new BLOCKED schema with required fields), explicit retry-exhaustion/impossible-in-scope termination; fixed a dead reference, a placeholder example, and an awkward artifact section. Regression-proved: QUALITY 1.0 on F1+F3 (no outcome regression). Honest claim: cleaner/more complete body, no regression - not a coding-quality gain (that needs the shared agent prompts). - Docs: docs/whole-skill-optimization-{notes,flow}.md (method + reusable flow for other skills); docs/SKILL-EVAL.md (description-tuning engine guide) + USAGE.md pointer. - 13 description-optimize eval-sets for the remaining /map-* skills (tests/skills_eval/fixtures/). - Tooling hygiene: exclude the whole_skill fixture mini-repos from pytest/ruff/pyright/mypy (they are intentionally-broken seeded repos). make check: 2257 passed, 3 skipped; ruff/mypy/pyright clean; check-render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ow-leverage (body AND actor) Extend the whole-skill harness with `--degrade {body,actor,monitor}` and add a strong scope-pressure fixture (the obvious one-line fix is out-of-scope), then ablate the ACTOR prompt: - spike_runner.py: `--degrade` targets which prompt the 'bad' variant degrades; `_degrade_actor` strips actor.md's Mutation Boundary section + the quick-ref NEVER-scope clause (seed-only); `_degrade_monitor` stub (best-effort). - Fixture tests/skills_eval/fixtures/whole_skill/map_task_scope_pressure: RATE trap — changing the shared RATE in config.py is the tempting out-of-scope fix; the in-scope fix is a surcharge in utils.py. Result (current actor vs --degrade actor, 3 runs each): BOTH kept scope perfectly (config.py never touched; only utils.py edited). The QUALITY delta (0.80 vs 1.00) is judge NOISE (inverted; the judge penalized the current actor for lacking verbose scope-reasoning prose despite perfect actual scope). Consolidated across 3 ablations (body + actor, 18 runs): prose-level scope discipline is low-leverage; scope is governed by the blueprint affected_files contract + base-model competence + the mechanical mutation-boundary/test-gate/monitor. Methodology note recorded: for scope, trust the deterministic gate — the scope_discipline judge dimension is verbosity-biased. Full log in docs/whole-skill-optimization-notes.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…on_boundary Option B (verify/strengthen the mechanical scope lever). The mutation-boundary validator is correct + warn-only by design (strict via MAP_STRICT_SCOPE), but its tests only exercised committed/staged extra files. Add a regression test proving a NEW out-of-scope file the actor creates but never `git add`s (porcelain '??') is still flagged as unexpected/warning — the real-world scope-leak case. Notes: documented the lever verification + the strengthening options (default-strict vs warn->actor-feedback vs single-subtask-strict) for a policy decision. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…leaks (self-healing) Option (ii) for strengthening the mechanical scope lever. Previously a non-strict scope leak detected by validate_mutation_boundary only produced a warn-log and the MONITOR gate passed silently (hard-block only under MAP_STRICT_SCOPE). Now, at validate_step("2.4"), a `warning` routes back to the Actor as feedback the FIRST time it is seen per subtask (valid=False + actionable "revert the out-of-scope changes OR escalate for a contract update"), so the actor self-corrects in the existing retry loop — without a hard block. - StepState.scope_feedback_subtasks (persisted, to_dict/from_dict) bounds the nudge to ONCE per subtask, so a persistent false positive (affected_files drift) cannot burn the retry budget — after the single nudge the gate passes. - Strict-mode (MAP_STRICT_SCOPE=1) hard-reject path is unchanged. - Edited the templates_src .jinja source and re-rendered (.map/scripts + templates). - Tests: test_warning_routes_feedback_to_actor_once (orchestrator, incl. once-guard pass-through) + test_warning_on_untracked_new_out_of_scope_file (validator). make check: 2259 passed, 3 skipped; check-render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ng an empty subtask) Apply the warn->actor-feedback pattern to MONITOR correctness validation. The unenforced gap: MONITOR closing a subtask that declares affected_files but changed NOTHING (empty diff) — a subtask that "completes" having done nothing. At validate_step("2.4"), reusing the validate_mutation_boundary report: if `expected` (declared affected_files) is non-empty but `actual` (changed files) is empty, route back to the Actor once (valid=False + "False-progress: implement the change or report a blocker"), bounded by StepState.progress_feedback_subtasks so a re-validate passes (no retry-burn). Reuses the existing scope git machinery; no new diff logic. - StepState.progress_feedback_subtasks (persisted, to_dict/from_dict). - Test: test_false_progress_routes_feedback_when_nothing_changed (+ once-guard pass-through). - Edited templates_src .jinja + re-rendered. make check: 2260 passed, 3 skipped; check-render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

map-skill-eval documented only trigger-description tuning. Add a section that directs anyone improving a skill's BODY/logic (outcome quality) to the worked, reusable flow + harness instead of starting from scratch: docs/whole-skill-optimization-flow.md (+ notes), tests/skills_eval/whole_skill/ spike_runner.py, and the key finding (prose scope/correctness discipline is low-leverage for thin orchestrators; the levers are the affected_files contract + mechanical validators; prose pays off for report format + trigger description). Edited the .jinja source + re-rendered. make check: 2260 passed, check-render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…at) (#161) * fix(skill-eval): trigger-eval no longer mis-records executing skills as non-trigger The description-optimizer dispatcher ran `claude -p <prompt>` to completion and parsed the transcript for the first `Skill` tool_use. For EXECUTING skills this corrupted the signal: a positive prompt for map-check runs the full `make check` suite, and map-task/map-efficient dispatch sub-agents — each overran the 120s per-call timeout, was treated as a transient failure, retried 3x, and finally recorded as a FALSE non-trigger. Latent until now: PR #159 applied only map-plan (fast text-only planner); the executing-skill eval-sets from PR #160 were never run. Trigger detection only needs the first `Skill` tool_use (the activation decision), which lands in the transcript early (~30s) — not the skill body executing. Fixes: - `--disallowed-tools` (Bash/Edit/Write/NotebookEdit/Task/Agent/WebFetch/WebSearch) on the claude -p argv: the body cannot run slow/mutating/network work or spawn sub-agents (which would orphan children and burn quota after the parent is killed), while the skill still TRIGGERS. Read-only tools and `Skill` stay allowed. - Timeout is now TERMINAL, not retried: a 90s overrun re-run behaves identically — retrying turned every executing-skill positive into ~3x wall-clock (267s observed). - On timeout, recover the trigger from the transcript (located by cwd slug, since the kill yields no envelope/session_id) instead of a false non-trigger, with a settle-poll to defeat the flush/visibility race right at the kill instant. - Robust cwd→project-dir slug (`_cwd_to_project_slug`, shared by both locators): replicates Claude Code's transform — replaces `/`, `.`, AND `_` (any non alnum/dash) with `-`. The old `replace("/","-").replace(".","-")` missed `_`, so any `tempfile.mkdtemp()` suffix containing `_` (e.g. `mapeval-s_u5zv32` → project dir `…-mapeval-s-u5zv32`) failed to locate — an intermittent false non-trigger that surfaced live in the sweep. Also handles macOS `/var`→ `/private/var` (tries raw + resolved cwd) and a slugified name-glob fallback. - Default per-call timeout 120s → 90s (well above the ~30s trigger latency). Validated live: map-check 75s (natural completion, triggers), map-efficient 90s (single attempt, trigger recovered — was 267s/3-attempts), negatives ~31s non-trigger. 11 regression tests incl. the symlink-slug and underscore-slug cases. make check green (2271 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(skill-eval): optimizer corrupted block-scalar SKILL.md descriptions → skill never triggered `_set_frontmatter_description` replaced only the single line starting with `description:`, leaving any YAML block-scalar body orphaned. For a skill whose frontmatter declares: description: | Run quality gates ... use map-plan or map-efficient. patching produced: description: "Run quality gates ... map-efficient.\n" Run quality gates ... map-efficient. ← orphaned, indented, now invalid That is malformed YAML, so Claude Code failed to register the skill in the seeded candidate tree — it never auto-triggered, and EVERY eval cell (including the iter-0 baseline) recorded a false non-trigger. The optimizer would then "improve" a description that was actually fine, against an all-zero baseline. It escaped notice because PR #159 applied only map-plan (single-line description); all other map-* skills use `description: |` block scalars and were silently broken. Fix: when the description value is a block scalar (`|`/`>`) or any indented multi-line value, also consume the continuation lines (indented deeper than the key) before substituting the single double-quoted replacement line. Preserves the key's indentation and every sibling key/body line. Validated: candidate-seeding path now triggers map-check (was None). Two regression tests (block `|` and folded `>`) assert the patched frontmatter re-parses as valid YAML with exactly the new description and no orphaned body. make check green (2273). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(skill-eval): --model/--runs flags + model-tier trigger experiment Adds `--model <alias>` and `--runs N` to `mapify skill-eval run` (and a `model` param on ClaudeSubprocessDispatcher). `--model` pins the claude -p model tier (haiku/sonnet/opus); default omits the flag → CLI session model (unchanged behaviour). `--runs` averages out single-pass noise when comparing models. Motivated by Murin 2026 (arXiv:2606.05970): "model choice dominates prompt phrasing; a larger model redistributes agreement rather than uniformly raising it." Tested the analog for skill trigger ROUTING (map-check, n=9/model): haiku 6/9 (67%, 26s) · sonnet 7/9 (78%, 51s) · opus 6/9 (67%, 51s) Finding: model tier does NOT reliably improve trigger routing — Opus ties Haiku and trails Sonnet; the spread is within noise and cell-by-cell it's redistribution, not improvement (exactly Murin's pattern). One positive prompt was missed by ALL tiers — a description gap no model size fixes. The lever for trigger accuracy is the `description:` (the optimizer sweep), not the model; execution-quality tiering is a separate, untested question. Full write-up in docs/model-tier-trigger-experiment.md. 2 regression tests for the new flags. make check green (2274 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(skill-eval): firm-up model-tier results revise the pilot conclusion 3 passes on map-check (n=27) + map-explain + map-task (n=9 each). The pilot's single-pass n=9 (haiku 67% ≈ opus 67%) was noise. Authoritative matrix: skill haiku sonnet opus map-check 59% 89% 78% map-explain 33% 33% 44% map-task 33% 67% 78% overall 49% 73% 71% Revised findings: (1) model tier DOES matter — Haiku is consistently weakest (−24pp vs Sonnet), not the free lunch the pilot suggested; (2) "bigger is better" does NOT hold — Opus (71%) ties/trails Sonnet (73%) and they redistribute across skills (Murin's exact pattern), so Opus buys nothing over Sonnet for routing; (3) the description is the ceiling — map-explain caps at 33-44% across all tiers. Recommendation for routing: Sonnet (sweet spot); the lever is the description (optimizer sweep), not model size. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(whole-skill): model-lever knobs + a discriminating outcome fixture Foundation for optimizing skill OUTCOME quality (not trigger descriptions), covering the model lever and a fixture where quality actually varies. spike_runner.py: - --agent-model AGENT=MODEL (repeatable) rewrites a seeded agent's `model:` frontmatter — the precise EXECUTION-quality lever (the actor writes the code, so its model is the code-quality lever; sub-agents use their own model:, not the orchestrator's --model). - --orchestrator-model passes --model to the top-level claude -p running the body. Fixture map_task_semver: - Existing whole_skill fixtures (scope/blocker traps) all score QUALITY=1.0 — they can't discriminate code quality. This one is genuinely non-trivial: semver 2.0.0 compare() with pre-release precedence, numeric-vs-lexical identifier comparison, and build-metadata handling. A naive implementation passes the easy tests but fails the edge cases, so the test gate (task_pass) and QUALITY VARY with the lever. - Validated: stub → 12 failed; a correct reference impl → 12 passed; orchestrator accepts the hand-crafted plan+blueprint (resume_single_subtask ST-001 → RESEARCH). make check green (2274 passed); fixture excluded from gate per whole_skill rules. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): model lever on OUTCOME is absorbed by the gate (semver fixture) Sweep on the discriminating map_task_semver fixture, full /map-task with --agent-model actor=haiku|sonnet|opus (×2): all 6 runs task_pass=True (correct semver). The one QUALITY=0.5 is a judge API-529, not a code failure. The test gate + MONITOR retry loop drive even haiku to passing, and haiku is not slower. Conclusion (execution-data-backed): outcome levers rank CONTRACT/GATES (#3) ≫ MODEL (#2) ≫ PROSE (#1). Invest in validation_criteria / validators / test coverage to lift skill outcome quality — not actor model or prose. Model may still matter on weakly-gated tasks (untested); QUALITY saturates on gated tasks so a finer signal needs retry_count logging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(whole-skill): capture serial retry_count in the outcome harness On a well-gated task QUALITY saturates (every model tier passes the test gate), so the model effect hides in HOW MANY actor retries the MONITOR loop needed. spike_runner now reads retry_count / clean_retry_count / contaminated_retry_count from the run's step_state.json (before temp cleanup) and records + prints them, so "passes, but only after more iterations" is visible in future model/prompt sweeps. Addresses the method caveat from the semver model sweep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(whole-skill): weakly-gated fixture + hidden-suite scoring (model-effect probe) To measure the model effect that a strong gate hides, add a WEAKLY-gated fixture and post-run hidden scoring: - spike_runner: run_hidden_tests() injects a comprehensive suite the workflow never saw (manifest hidden_test_src→dest + hidden_test_cmd) and records hidden_pass / n_passed / n_failed — the deterministic 'true code quality' signal, no judge noise. - Fixture map_task_semver_weakgate: the workflow sees ONLY a thin 3-test gate (which a naive impl passes); the hidden 8-edge-case suite scores what the gate did not enforce. Validated end-to-end (no quota): STUB → weak-gate fail; NAIVE → weak-gate PASS but hidden 2p/6f; CORRECT → both pass. So hidden_pass discriminates code quality that the weak gate misses — the probe for 'does the actor model matter when the gate can't rescue a weak implementation'. make check green (2274). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(whole-skill): vague-contract fixture to isolate the model lever map_task_semver_vague = weak gate + VAGUE aag_contract (only "compare two semver strings, return -1/0/1" — the precedence rules are NOT spelled out), so the actor must rely on its own semver knowledge. Reuses the weak basic gate + hidden 8-edge-case suite. This is the one configuration where the actor model COULD matter for outcome (vague spec the gate can't rescue); the prior fixtures kept the contract complete and found model-insensitivity. Orchestrator-accepted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): outcome is invariant to model/gate/contract within competence Two more outcome configs (weak gate; weak gate + vague contract), hidden 8-edge-case suite, actor=haiku|sonnet|opus ×2 — all 18 runs produced correct semver (retries=0). Even a vague contract + weak gate + haiku gets it right first try (semver is within every tier's competence). Definitive outcome finding: for a task within model competence, outcome quality is invariant to model tier, test-gate strength, and contract completeness. The execution-model lever only bites ABOVE the weak model's competence (threshold not reached here). Evidence-based outcome-lever ranking: contract/gates ≫ model (only beyond-competence) ≫ prose. Invest in contract + mechanical gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(whole-skill): capability-discriminating calc fixture (expression evaluator) Web research (EvalPlus / HumanEval-Pro / arXiv 2511.04355, 2510.13908) shows the model-capability gap appears on EDGE-CASE-sensitive code, not isolated functions like semver. The canonical discriminator is an arithmetic expression evaluator — three traps that violate left-to-right intuition: right-associative ** (2**3**2), ** binding tighter than unary minus (-2**2 == -4 in Python), overloaded unary minus, negative exponents, error contract. map_task_calc: weak gate = 3 trivial expressions (a naive left-assoc evaluator passes); hidden 13-case suite probes the traps. Validated: STUB → gate fail; NAIVE (left-assoc **, tight unary) → weak gate PASS but hidden 9p/3f; CORRECT → 12p/0f. Contract states the rules (right-assoc, ** tighter than unary) WITHOUT leaking worked answers, so it tests implementation competence. Probe for the execution-model threshold the semver fixtures were too easy to reach. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(whole-skill): vague-contract calc fixture (decisive model-threshold probe) map_task_calc_vague = the calc evaluator with a VAGUE contract: it says only "standard precedence" and does NOT spell out right-associative ** or unary rules. So passing 2**3**2==512 requires the model to KNOW right-associativity from training, not read it from the spec. Hidden suite keeps only language-unambiguous cases (drops -2**2). Validated: a naive left-assoc impl passes the weak gate but fails hidden (9p/2f). If haiku fails these while sonnet/opus pass, the execution- model competence threshold is finally located; if haiku still passes, haiku 4.5 is genuinely competent on self-contained subtasks and the model is not the lever. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): capability probe — haiku matches bigger models on subtasks Built the canonical edge-case discriminator (arithmetic expression evaluator: right-assoc **, ** tighter than unary minus, negative exponent, errors) per web research. Sweeps: FULL-contract calc → haiku 12/12; VAGUE-contract calc (rules NOT stated) → haiku 11/11,11/11, sonnet 10/11,11/11, opus 11/11,11/11. Even on the edge-case-hard task with a vague contract, haiku 4.5 matched/edged the bigger models (the one miss was sonnet's — noise). No bigger-is-better ordering. Why: the tier gap lives on multi-step/multi-file/concurrency tasks, not single self-contained functions — and map-task decomposes into well-scoped one-file subtasks, the regime where haiku is competent. Within map-task's granularity the model is not the outcome lever (actor can run on haiku — cost win). Final ranking: contract/gates ≫ model (only beyond-competence) ≫ prose. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(whole-skill): acceptEdits in outcome harness + use orchestrator-model Two harness corrections after verifying transcripts (council-flagged confounds): 1. The real model lever is the ORCHESTRATOR model: in headless `claude -p`, map-task runs INLINE on the session model and does NOT spawn the actor/monitor sub-agents (0 Task dispatches in the transcript), so the earlier `--agent-model actor=...` override was a no-op — every "tier" actually ran on claude-opus-4-8. Use `--orchestrator-model` (= claude -p --model) to vary the model that does the work. 2. Add `--permission-mode acceptEdits` to the run: in headless default mode, file-edit permission prompts auto-deny, and weaker/less-agentic models stall ("I need permission to edit") — observed haiku hitting 4 perm-denials and giving up (0/11) while opus wrote freely (and even dispatched sub-agents). That is a permission/agency artifact, not code quality. With acceptEdits, haiku writes its code and passes 10/11 — confirming the gap on a well-scoped subtask is small. NOT a full bypass: only edits are auto-accepted; the temp is an isolated throwaway. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): CORRECTED model-outcome result (two confounds removed) Transcript verification (prompted by llm-council) exposed two confounds that invalidated the earlier "model doesn't matter for outcome": (1) --agent-model was a no-op (headless map-task runs inline on the orchestrator/session model, 0 sub-agent dispatches — all sweeps were opus-4-8); (2) permission stalls made haiku return 0/11 despite knowing the algorithm. Fixes: --orchestrator-model + --permission-mode acceptEdits. CLEAN result (calc_vague, hidden /11, n=4): haiku 9.8/11 (variable), sonnet 10.5/11, opus 10.7/11. Modest but consistent gradient opus ≥ sonnet > haiku; haiku more variable on hard edge cases (~1 test / ~8% gap on a single subtask), plus a separate agency gap. Earlier "model irrelevant for outcome" RETRACTED (it was opus measured 3x). Methodology lesson: verify the manipulation reached the transcript before trusting an agentic-harness result. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): DECISIVE — actor model matters for code (direct-invocation method) Cleanest measurement (read the agent prompt, call claude -p with that prompt + task + chosen model; no orchestrator/dispatch). calc_vague, hidden /11, n=3: haiku 8/10/10 (mean 9.3, never perfect, drops edge cases), sonnet 11/11/11, opus 11/11 (+1 broken run). The ACTOR model materially affects code quality — haiku systematically worse; use sonnet+ for actor. The two confounds (no-op --agent-model, permission-stall) had hidden this; "haiku ≈ opus" RETRACTED for the actor role. Probes also showed: headless skills don't auto-dispatch sub-agents (run inline), and even forced dispatches run on the session model — so per-agent model: is inert in headless. Architectural fix: orchestrate agents by direct claude -p --append-system-prompt <agent.md> --model <tier> so per-agent models actually take effect. Then which-agent/model-per-skill is deterministic. Prototype: .map/actor_probe.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(actor): cut 247 lines of worked examples (A/B-validated, -23%) The actor prompt carried a <Actor_Reference_Examples> block — 4 full worked implementations (~247 lines, ~22% of the prompt). The output structure is already specified earlier ("Required Output Structure"), so the examples were illustrative bloat loaded on every actor invocation. A/B test (direct actor-prompt invocation, calc_vague hidden /11, n=4): A uncut (1095L): haiku 9.3, sonnet 9.5 B cut (846L): haiku 10.0, sonnet 11.0 B >= A on both models — the cut does not regress (if anything helps), at -23% prompt size / ~7.7KB saved per invocation. Validated via .map/actor_probe.py (renders the agent template, runs it as the system prompt with a chosen model, scores against a hidden edge-case suite). make check green (2274), render byte-identical across all trees. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): A/B agent polish — actor examples cuttable, monitor's are not Built per-agent A/B harnesses (render agent template -> claude -p --append-system-prompt --model -> score). Actor: cut 247-line examples block, A/B B>=A (haiku 9.3->10.0, sonnet 9.5->11.0), kept (d78acd5). Monitor: cut 134 lines of good/bad examples, A/B showed recall preserved (4/4) but clean-pass on identical correct code A=6/6 vs B=4/6 -> reverted. The monitor's examples calibrate its accept threshold (gate agent => false-positive calibration matters). Lesson: "cut prompt examples" is safe for a generative agent, NOT for a gate agent; each agent needs its own A/B gate, and harness coverage bounds what's cuttable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(synthesizer): cut 95 lines of Step-7 strategy worked-examples (A/B-validated) The synthesizer's Step 7 carried two `**Example**:` worked-code blocks (base_enhance / fresh_generation, ~95 lines) illustrating strategies whose numbered-step LOGIC is already stated. A capable model applies the steps without the example code. A/B (built .map/synth_probe.py: feed 3 variant calc solutions — 1 correct + 2 buggy — the synthesizer must extract correct decisions and WRITE unified code, scored by the calc hidden suite, n=3): A (HEAD, 1161L): 33/33 hidden · B (cut, 1066L): 33/33 hidden -> B == A Non-regressive, -95 lines (-8%). Confirms the rule from the actor cut: pure worked-examples are cuttable in GENERATIVE agents; guidance/patterns and gate-calibration examples are NOT (monitor + actor-patterns cuts were A/B-rejected and reverted). make check green (2274), render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): agent-polish validated rule (2 wins, 2 rejects) + coverage limit Worked-examples are cuttable in generative agents (actor -247L, synthesizer -95L; both A/B B>=A) but NOT in gate agents or for guidance/patterns (monitor, actor error-patterns; both A/B-rejected, reverted). Remaining generative agents (task-decomposer/predictor/reflector/research-agent) output fuzzy artifacts with no clean deterministic gate, so further cuts can't satisfy the A/B rule without building lower-confidence gates first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(task-decomposer): cut 151-line REFERENCE EXAMPLES (A/B-validated, -14%) The decomposer carried a "## REFERENCE EXAMPLES" section (Example A: a full 139-line worked CRUD blueprint) — pure illustration; the JSON Schema definition (separate section) stays. A capable model produces a valid blueprint from the schema + 5-phase process without the worked example. A/B (built .map/decomp_probe.py: feed a known 5-part feature, score blueprint on schema-validity + requirement-coverage + acyclic deps + sane count; opus, n=3): A (HEAD, 1078L): 3/3 PASS (coverage 5/5) · B (cut, 927L): 3/3 PASS (coverage 5/5) B == A, non-regressive, -151 lines (-14%). Third generative-agent worked-example cut to validate (after actor -247L, synthesizer -95L); gate-calibration/guidance cuts remain A/B-rejected (monitor, actor-patterns). make check green, render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(predictor): cut 417-line <examples> block (A/B-validated, -21%) The predictor carried an <examples> block (Examples 1-3: full worked impact analyses, ~417 lines incl. 107L/91L/46L JSON outputs) — illustration; the JSON Schema, risk rubric, and "Good vs Bad Predictions" guidance stay. A/B (built .map/pred_probe.py: feed a known breaking + high-blast-radius signature change across 8 callers; gate on the deterministic core — breaking_changes detected AND affected components identified; sonnet, n=3): A (HEAD, 2003L): 3/3 PASS · B (cut, 1586L): 3/3 PASS B == A, non-regressive, -417 lines (-21%, the largest cut). Fourth generative-agent worked-example cut validated (actor -247, synthesizer -95, decomposer -151, predictor -417 = 910 lines total). Gate covers breaking+affected detection (predictor's core); risk-level (subjective) was rated identically by both versions. make check green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(reflector): cut 139-line COMPLETE EXAMPLES (A/B-validated) The reflector's "# COMPLETE EXAMPLES" section (4 worked lesson-extraction examples, 139 lines) is illustration; the lesson-format rules + decision frameworks stay. A/B (built .map/reflect_probe.py: feed a clear SQL-injection failure outcome; gate on whether the reflector extracts a lesson naming the root-cause concept injection/parameterization; sonnet, n=3): A (HEAD): 3/3 PASS · B (cut): 3/3 PASS B == A, non-regressive, -139 lines. Fifth generative-agent worked-example cut. TOTAL across 5 generative agents: actor -247, synthesizer -95, task-decomposer -151, predictor -417, reflector -139 = 1049 lines of worked-example bloat removed, all A/B-validated B>=A. Gate agents (monitor etc.) NOT cut — examples calibrate the accept threshold (monitor A/B-rejected). (reflector gate is keyword-presence, fuzzier than the code-output gates.) make check green, render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): comprehensive agent-polish tally (5 wins, 1049 lines, A/B-validated) Swept all generative agents for worked-example/REFERENCE blocks, A/B-gated each: KEPT (B>=A): actor -247, synthesizer -95, task-decomposer -151, predictor -417, reflector -139 = 1049 lines. REJECTED (B<A, reverted): monitor examples (calibrate accept threshold), actor error-patterns/decision-tree (scaffold weaker models). Gate agents untouched (examples calibrate); research-agent lean. Final rule + reusable harnesses (.map/*_probe.py) documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

azalio and others added 6 commits June 5, 2026 03:32

azalio merged commit effd649 into main Jun 5, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets#160

Whole-skill outcome-eval harness + map-task body hardening + skill eval-sets#160
azalio merged 6 commits into
mainfrom
organ-pipe-pyrite

azalio commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azalio commented Jun 5, 2026

What

Harness + fixtures (reusable flow for any skill)

Key empirical finding (2 fixtures, 12 runs, 2 llm-council consults)

map-task body hardening (body-owned surfaces) + regression proof

Also

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant