Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description) by azalio · Pull Request #159 · azalio/map-framework

azalio · 2026-06-04T19:23:05Z

Phase F — Skill-Eval Description Optimizer + HTML Viewer (+ fixes)

Builds the optimization/reporting layer on top of the shipped F.1 eval engine, plus two fixes surfaced by a real end-to-end run.

What's in it

Phase F (9 subtasks, each Monitor-approved):

eval_schema.py: OptimizeIterationRecord / OptimizeResult / ProposerFn (round-trip to_dict/from_dict).
proposer.py: claude -p description proposer (argv-list, MAP_INVOKED_BY, None on all failures).
description_optimizer.py: anti-overfit optimizer — deterministic hashlib 60/40 split, selection by held-out TEST pass-rate (overfit candidates flagged and never selected), per-iteration resume=False jsonl isolation, candidate .claude/ re-seed cleaned in finally. Clock-free (caller supplies run_ts).
viewer.py: jinja2 HTML report — autoescape=True (XSS-safe), difflib per-iteration diff, overfit rows red.
apply_patcher.py: fail-loud block-scalar frontmatter --apply patcher → re-render to byte-identical trees, scoped git add, path-safe.
optimize / view CLI subcommands (exit 2/0/1, dry-run budget, --open best-effort, JSON+HTML artifacts).
3 optimizer eval-set fixtures (≥8 entries) + authoring note; no-anthropic guard over all 4 new modules.

Fix 1 — eval/telegram isolation (f46e78d): the eval claude -p subprocess inherited the user's telegram-bridge SessionStart hook (tg listen), which could block on the Telegram long-poll until the dispatch timeout (a triggered-skill cell then mis-records as a non-trigger). The dispatcher now sets TG_STATE_DIR to a config-less path so any tg/tg listen the agent runs exits immediately instead of blocking. (Best-effort: the SessionStart injection still appears via the plugin hook env, so a perfectly-obedient agent can still spend a few turns on it; the hard block is removed.)

Fix 2 — apply optimized map-plan description + correct the description cap (b1c515f): a real optimize run on map-plan selected a candidate that doubled the held-out TEST pass-rate (25%→50%, no overfit); applied it. The winner is 622 chars, which the old self-imposed 250-char test rejected — but 250 was a transient Claude Code v2.1.86 cap (raised to 1536 in v2.1.105, replaced by a usage-ranked listing budget in v2.1.129+). The Agent Skills spec maximum is 1024 chars and the model loads the full description for triggering, so the test now enforces 1024 and the proposer caps candidates at 1024 (prompt instruction + hard rejection of over-limit proposals). Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills.

Real run evidence (2 skills, not mocked)

map-plan: winner = iter 1 — baseline 20%/25% (train/test) → candidate 20%/50%, selected on held-out test, not overfit.
map-debug: no improvement (full tie → baseline retained), 25%.

Verification

make check (lint + pyright 0/0/0 + full pytest) and make check-render green on a clean run (2247→2256 tests as the suite grew across subtasks).
Final Verifier (Ralph loop): PASS.

Note for reviewers: on the author's machine, two timing-sensitive hook tests flake under heavy local load + an unusually large active-session transcript (test_hook_inventory_smoke::test_every_configured_hook_execs_via_shebang — resolve_session_id scanning a multi-MB transcript; test_safety_guardrails::test_execution_under_100ms — 116ms vs 100ms). Both are unrelated to this diff (it touches no hooks/memory) and pass in clean CI.

🤖 Generated with Claude Code

… in eval_schema Contract-first typed structs for the description optimizer (AC-7): round-trip to_dict/from_dict mirroring EvalResultRecord's _MISSING-tolerant pattern, flat per-iteration token totals, overfit/selected flags, and per-iteration jsonl paths so the viewer renders without re-parsing. ProposerFn aliases the Callable[[str, list[EvalResultRecord]], str | None] proposer interface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Default ProposerFn: propose_description() builds an argv list for `claude -p --output-format json` (never a shell string), sets MAP_INVOKED_BY on the subprocess env (INV-2/HC-7), reads only .result, and returns the stripped candidate text or None on non-zero exit / timeout / OSError / malformed JSON / empty .result. Untrusted eval-record text is passed as a discrete argv element — no shell interpolation. No anthropic import, no ANTHROPIC_API_KEY, no --model flag (INV-1/AC-10/D2). Adds the no-anthropic source-scan guard (test_inv1_no_anthropic_optimize.py, extended in ST-009). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ction optimize() runs an N-iteration loop (iter 0 = baseline) over a deterministic 60/40 split and selects the candidate that maximises held-out TEST pass-rate, never the overfit one (train↑/test↓ flagged overfit, structurally excluded from selection). Determinism without random/datetime: split_train_test keys a hashlib.sha256 ordering by seed (clock-free; run_ts supplied by caller). Each candidate is evaluated by re-seeding a throwaway .claude/ copy, patching ONLY the SKILL.md frontmatter description (fail-loud), and running runner.run_eval over train+test to per-iteration isolated jsonl paths (resume=False); the temp seed is removed in finally — production .claude/ and templates_src/ untouched. Proposer returning None records a proposal_failed iteration and the loop continues with baseline eligible; full-tie => baseline wins (no_improvement). Also: guard iterations>=1 (avoid latent IndexError) + tests for the fail-loud frontmatter raise paths; extend the no-anthropic guard to cover this module. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

render_html()/render_to_path() render a pure HTML report from the typed OptimizeResult (Producer-Owns-Parse, no dispatch/subprocess): one row per iteration with a difflib unified-diff vs the prior iteration, train/test pass-rates, token totals, the selected iteration marked, and overfit rows (train↑/test↓) highlighted red. Security: candidate_description is untrusted claude -p output, so the renderer uses jinja2 Environment(autoescape=True) plus explicit |e on every dynamic slot — an embedded <script> is escaped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…er orchestration patch_skill_description() rewrites ONLY the YAML frontmatter `description:` block scalar of a SKILL.md(.jinja) — fail-loud (no partial write) on missing frontmatter/anchor, written as a `|-` block scalar that round-trips exactly. apply_optimized_description() edits the single-source templates_src/skills/<skill>/SKILL.md.jinja then re-renders both providers so generated trees stay byte-identical (INV-5/check-render), staging only the patched source + existing gate trees (scoped `git add --`, never `-A`, never commits). Path-safety (under templates_src, reject `.git/`) runs before any FS touch. Two distinct no-op messages (baseline wins vs winner==current); baseline never written back; skill-rules.json never auto-patched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`mapify skill-eval optimize <skill> --eval-set [--iterations --apply --open --dry-run]` drives the ST-003 optimizer (proposer + viewer + apply patcher), persisting OptimizeResult JSON + HTML under .map/eval-runs/<skill>/. Strict exit-code order: <5 entries -> Exit(2) (>=5 minimum) before any dispatcher; --dry-run prints the call budget + 'model: default (resolved by claude CLI)' and Exit(0) with zero quota; claude-absent -> Exit(1). run_ts is generated at the CLI boundary (clock-free core). `view <skill> [--result --open]` renders a stored result. --open is best-effort (never errors the run). Without --apply, nothing outside .map/ is touched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ng note New map-plan (9), map-efficient (9), and larger map-debug (10) eval-sets for `skill-eval optimize`, each splittable 60/40 with n_test>=3 so the held-out test pass-rate is meaningful (a <5-entry set degenerates to a 0/1 coin-flip). README documents the purpose + >=8 sizing rationale. The 3-entry map_debug_eval_set.json smoke fixture is left untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…docs Edited the .jinja sources (SKILL.md.jinja + skill-rules.json.jinja) to document `skill-eval optimize <skill> --eval-set --iterations --apply --open --dry-run` (anti-overfit held-out selection, propose-only default, two no-op cases) and `skill-eval view <skill> --result --open`, plus new skill-rules keywords/ intentPatterns; requires-cmd stays ["claude"]. Ran `make render-templates`; check-render + skills-consistency + template-render green. Also re-syncs the generated .map/scripts/map_orchestrator.py to its .jinja source: the gitignore-tolerance in record_subtask_result was already in templates_src but the committed generated file was stale (.map/ is not in check-render's gate, so the drift went uncaught). render-templates corrects it. Follow-up (noted): add .map/scripts/ to check-render's gate to catch such drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… green The INV-1/AC-10 AST guard now covers proposer, description_optimizer, viewer, and apply_patcher (no `import anthropic`, no ANTHROPIC_API_KEY). Integration gate verified: `make check` green (2247 passed), `make check-render` green, pyright 0/0/0 on every touched Phase F source + test file. uv-run confirmed to resolve to this worktree, so the green reflects the real Phase F code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tener The eval dispatcher's claude -p subprocess inherits the user's telegram-bridge plugin SessionStart hook, which injects an "always-listen — run `tg listen`" instruction. When the eval agent obeys it, `tg listen` blocks on the Telegram long-poll until the dispatch timeout, so a triggered-skill cell mis-records as a non-trigger (and the run wall-clock explodes — observed cells hitting the full 120s/3600s timeout). Fix: `_eval_subprocess_env(cwd)` sets TG_STATE_DIR to a config-less path inside the throwaway eval cwd. Any `tg listen`/`tg send` the agent runs inherits this env, finds no config.json, and exits immediately (`die`) instead of blocking — neutralising the hang. The operator's real ~/.claude/telegram config is never touched (per-subprocess override on a path removed with the temp cwd). Plugin hooks run in a restricted env so the cosmetic injection may still appear, but it is inert. MAP_INVOKED_BY guard preserved. Verified end-to-end: previously-hanging prompts now finish in ~80-160s through the real ClaudeSubprocessDispatcher with the skill still triggering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…on cap to the spec (1024) Applies the optimizer's winning map-plan description (the candidate selected on held-out TEST pass-rate: 25%->50% vs baseline, no overfit) to the single-source .jinja and re-renders byte-identical trees. The winner is 622 chars, which the old self-imposed 250-char test rejected. That 250 was a transient Claude Code v2.1.86 listing cap — raised to 1536 in v2.1.105 and replaced by a usage-ranked listing budget in v2.1.129+. The Agent Skills spec maximum for `description` is 1024 chars, and the model loads the FULL description for triggering. So 250 was stale; the test now enforces the real 1024-char spec limit. The proposer is constrained to <=1024 chars (prompt instruction + hard rejection of an over-limit candidate) so the optimizer only ever proves a shippable description. Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…at) (#161) * fix(skill-eval): trigger-eval no longer mis-records executing skills as non-trigger The description-optimizer dispatcher ran `claude -p <prompt>` to completion and parsed the transcript for the first `Skill` tool_use. For EXECUTING skills this corrupted the signal: a positive prompt for map-check runs the full `make check` suite, and map-task/map-efficient dispatch sub-agents — each overran the 120s per-call timeout, was treated as a transient failure, retried 3x, and finally recorded as a FALSE non-trigger. Latent until now: PR #159 applied only map-plan (fast text-only planner); the executing-skill eval-sets from PR #160 were never run. Trigger detection only needs the first `Skill` tool_use (the activation decision), which lands in the transcript early (~30s) — not the skill body executing. Fixes: - `--disallowed-tools` (Bash/Edit/Write/NotebookEdit/Task/Agent/WebFetch/WebSearch) on the claude -p argv: the body cannot run slow/mutating/network work or spawn sub-agents (which would orphan children and burn quota after the parent is killed), while the skill still TRIGGERS. Read-only tools and `Skill` stay allowed. - Timeout is now TERMINAL, not retried: a 90s overrun re-run behaves identically — retrying turned every executing-skill positive into ~3x wall-clock (267s observed). - On timeout, recover the trigger from the transcript (located by cwd slug, since the kill yields no envelope/session_id) instead of a false non-trigger, with a settle-poll to defeat the flush/visibility race right at the kill instant. - Robust cwd→project-dir slug (`_cwd_to_project_slug`, shared by both locators): replicates Claude Code's transform — replaces `/`, `.`, AND `_` (any non alnum/dash) with `-`. The old `replace("/","-").replace(".","-")` missed `_`, so any `tempfile.mkdtemp()` suffix containing `_` (e.g. `mapeval-s_u5zv32` → project dir `…-mapeval-s-u5zv32`) failed to locate — an intermittent false non-trigger that surfaced live in the sweep. Also handles macOS `/var`→ `/private/var` (tries raw + resolved cwd) and a slugified name-glob fallback. - Default per-call timeout 120s → 90s (well above the ~30s trigger latency). Validated live: map-check 75s (natural completion, triggers), map-efficient 90s (single attempt, trigger recovered — was 267s/3-attempts), negatives ~31s non-trigger. 11 regression tests incl. the symlink-slug and underscore-slug cases. make check green (2271 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(skill-eval): optimizer corrupted block-scalar SKILL.md descriptions → skill never triggered `_set_frontmatter_description` replaced only the single line starting with `description:`, leaving any YAML block-scalar body orphaned. For a skill whose frontmatter declares: description: | Run quality gates ... use map-plan or map-efficient. patching produced: description: "Run quality gates ... map-efficient.\n" Run quality gates ... map-efficient. ← orphaned, indented, now invalid That is malformed YAML, so Claude Code failed to register the skill in the seeded candidate tree — it never auto-triggered, and EVERY eval cell (including the iter-0 baseline) recorded a false non-trigger. The optimizer would then "improve" a description that was actually fine, against an all-zero baseline. It escaped notice because PR #159 applied only map-plan (single-line description); all other map-* skills use `description: |` block scalars and were silently broken. Fix: when the description value is a block scalar (`|`/`>`) or any indented multi-line value, also consume the continuation lines (indented deeper than the key) before substituting the single double-quoted replacement line. Preserves the key's indentation and every sibling key/body line. Validated: candidate-seeding path now triggers map-check (was None). Two regression tests (block `|` and folded `>`) assert the patched frontmatter re-parses as valid YAML with exactly the new description and no orphaned body. make check green (2273). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(skill-eval): --model/--runs flags + model-tier trigger experiment Adds `--model <alias>` and `--runs N` to `mapify skill-eval run` (and a `model` param on ClaudeSubprocessDispatcher). `--model` pins the claude -p model tier (haiku/sonnet/opus); default omits the flag → CLI session model (unchanged behaviour). `--runs` averages out single-pass noise when comparing models. Motivated by Murin 2026 (arXiv:2606.05970): "model choice dominates prompt phrasing; a larger model redistributes agreement rather than uniformly raising it." Tested the analog for skill trigger ROUTING (map-check, n=9/model): haiku 6/9 (67%, 26s) · sonnet 7/9 (78%, 51s) · opus 6/9 (67%, 51s) Finding: model tier does NOT reliably improve trigger routing — Opus ties Haiku and trails Sonnet; the spread is within noise and cell-by-cell it's redistribution, not improvement (exactly Murin's pattern). One positive prompt was missed by ALL tiers — a description gap no model size fixes. The lever for trigger accuracy is the `description:` (the optimizer sweep), not the model; execution-quality tiering is a separate, untested question. Full write-up in docs/model-tier-trigger-experiment.md. 2 regression tests for the new flags. make check green (2274 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(skill-eval): firm-up model-tier results revise the pilot conclusion 3 passes on map-check (n=27) + map-explain + map-task (n=9 each). The pilot's single-pass n=9 (haiku 67% ≈ opus 67%) was noise. Authoritative matrix: skill haiku sonnet opus map-check 59% 89% 78% map-explain 33% 33% 44% map-task 33% 67% 78% overall 49% 73% 71% Revised findings: (1) model tier DOES matter — Haiku is consistently weakest (−24pp vs Sonnet), not the free lunch the pilot suggested; (2) "bigger is better" does NOT hold — Opus (71%) ties/trails Sonnet (73%) and they redistribute across skills (Murin's exact pattern), so Opus buys nothing over Sonnet for routing; (3) the description is the ceiling — map-explain caps at 33-44% across all tiers. Recommendation for routing: Sonnet (sweet spot); the lever is the description (optimizer sweep), not model size. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(whole-skill): model-lever knobs + a discriminating outcome fixture Foundation for optimizing skill OUTCOME quality (not trigger descriptions), covering the model lever and a fixture where quality actually varies. spike_runner.py: - --agent-model AGENT=MODEL (repeatable) rewrites a seeded agent's `model:` frontmatter — the precise EXECUTION-quality lever (the actor writes the code, so its model is the code-quality lever; sub-agents use their own model:, not the orchestrator's --model). - --orchestrator-model passes --model to the top-level claude -p running the body. Fixture map_task_semver: - Existing whole_skill fixtures (scope/blocker traps) all score QUALITY=1.0 — they can't discriminate code quality. This one is genuinely non-trivial: semver 2.0.0 compare() with pre-release precedence, numeric-vs-lexical identifier comparison, and build-metadata handling. A naive implementation passes the easy tests but fails the edge cases, so the test gate (task_pass) and QUALITY VARY with the lever. - Validated: stub → 12 failed; a correct reference impl → 12 passed; orchestrator accepts the hand-crafted plan+blueprint (resume_single_subtask ST-001 → RESEARCH). make check green (2274 passed); fixture excluded from gate per whole_skill rules. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): model lever on OUTCOME is absorbed by the gate (semver fixture) Sweep on the discriminating map_task_semver fixture, full /map-task with --agent-model actor=haiku|sonnet|opus (×2): all 6 runs task_pass=True (correct semver). The one QUALITY=0.5 is a judge API-529, not a code failure. The test gate + MONITOR retry loop drive even haiku to passing, and haiku is not slower. Conclusion (execution-data-backed): outcome levers rank CONTRACT/GATES (#3) ≫ MODEL (#2) ≫ PROSE (#1). Invest in validation_criteria / validators / test coverage to lift skill outcome quality — not actor model or prose. Model may still matter on weakly-gated tasks (untested); QUALITY saturates on gated tasks so a finer signal needs retry_count logging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(whole-skill): capture serial retry_count in the outcome harness On a well-gated task QUALITY saturates (every model tier passes the test gate), so the model effect hides in HOW MANY actor retries the MONITOR loop needed. spike_runner now reads retry_count / clean_retry_count / contaminated_retry_count from the run's step_state.json (before temp cleanup) and records + prints them, so "passes, but only after more iterations" is visible in future model/prompt sweeps. Addresses the method caveat from the semver model sweep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(whole-skill): weakly-gated fixture + hidden-suite scoring (model-effect probe) To measure the model effect that a strong gate hides, add a WEAKLY-gated fixture and post-run hidden scoring: - spike_runner: run_hidden_tests() injects a comprehensive suite the workflow never saw (manifest hidden_test_src→dest + hidden_test_cmd) and records hidden_pass / n_passed / n_failed — the deterministic 'true code quality' signal, no judge noise. - Fixture map_task_semver_weakgate: the workflow sees ONLY a thin 3-test gate (which a naive impl passes); the hidden 8-edge-case suite scores what the gate did not enforce. Validated end-to-end (no quota): STUB → weak-gate fail; NAIVE → weak-gate PASS but hidden 2p/6f; CORRECT → both pass. So hidden_pass discriminates code quality that the weak gate misses — the probe for 'does the actor model matter when the gate can't rescue a weak implementation'. make check green (2274). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(whole-skill): vague-contract fixture to isolate the model lever map_task_semver_vague = weak gate + VAGUE aag_contract (only "compare two semver strings, return -1/0/1" — the precedence rules are NOT spelled out), so the actor must rely on its own semver knowledge. Reuses the weak basic gate + hidden 8-edge-case suite. This is the one configuration where the actor model COULD matter for outcome (vague spec the gate can't rescue); the prior fixtures kept the contract complete and found model-insensitivity. Orchestrator-accepted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): outcome is invariant to model/gate/contract within competence Two more outcome configs (weak gate; weak gate + vague contract), hidden 8-edge-case suite, actor=haiku|sonnet|opus ×2 — all 18 runs produced correct semver (retries=0). Even a vague contract + weak gate + haiku gets it right first try (semver is within every tier's competence). Definitive outcome finding: for a task within model competence, outcome quality is invariant to model tier, test-gate strength, and contract completeness. The execution-model lever only bites ABOVE the weak model's competence (threshold not reached here). Evidence-based outcome-lever ranking: contract/gates ≫ model (only beyond-competence) ≫ prose. Invest in contract + mechanical gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(whole-skill): capability-discriminating calc fixture (expression evaluator) Web research (EvalPlus / HumanEval-Pro / arXiv 2511.04355, 2510.13908) shows the model-capability gap appears on EDGE-CASE-sensitive code, not isolated functions like semver. The canonical discriminator is an arithmetic expression evaluator — three traps that violate left-to-right intuition: right-associative ** (2**3**2), ** binding tighter than unary minus (-2**2 == -4 in Python), overloaded unary minus, negative exponents, error contract. map_task_calc: weak gate = 3 trivial expressions (a naive left-assoc evaluator passes); hidden 13-case suite probes the traps. Validated: STUB → gate fail; NAIVE (left-assoc **, tight unary) → weak gate PASS but hidden 9p/3f; CORRECT → 12p/0f. Contract states the rules (right-assoc, ** tighter than unary) WITHOUT leaking worked answers, so it tests implementation competence. Probe for the execution-model threshold the semver fixtures were too easy to reach. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(whole-skill): vague-contract calc fixture (decisive model-threshold probe) map_task_calc_vague = the calc evaluator with a VAGUE contract: it says only "standard precedence" and does NOT spell out right-associative ** or unary rules. So passing 2**3**2==512 requires the model to KNOW right-associativity from training, not read it from the spec. Hidden suite keeps only language-unambiguous cases (drops -2**2). Validated: a naive left-assoc impl passes the weak gate but fails hidden (9p/2f). If haiku fails these while sonnet/opus pass, the execution- model competence threshold is finally located; if haiku still passes, haiku 4.5 is genuinely competent on self-contained subtasks and the model is not the lever. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): capability probe — haiku matches bigger models on subtasks Built the canonical edge-case discriminator (arithmetic expression evaluator: right-assoc **, ** tighter than unary minus, negative exponent, errors) per web research. Sweeps: FULL-contract calc → haiku 12/12; VAGUE-contract calc (rules NOT stated) → haiku 11/11,11/11, sonnet 10/11,11/11, opus 11/11,11/11. Even on the edge-case-hard task with a vague contract, haiku 4.5 matched/edged the bigger models (the one miss was sonnet's — noise). No bigger-is-better ordering. Why: the tier gap lives on multi-step/multi-file/concurrency tasks, not single self-contained functions — and map-task decomposes into well-scoped one-file subtasks, the regime where haiku is competent. Within map-task's granularity the model is not the outcome lever (actor can run on haiku — cost win). Final ranking: contract/gates ≫ model (only beyond-competence) ≫ prose. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(whole-skill): acceptEdits in outcome harness + use orchestrator-model Two harness corrections after verifying transcripts (council-flagged confounds): 1. The real model lever is the ORCHESTRATOR model: in headless `claude -p`, map-task runs INLINE on the session model and does NOT spawn the actor/monitor sub-agents (0 Task dispatches in the transcript), so the earlier `--agent-model actor=...` override was a no-op — every "tier" actually ran on claude-opus-4-8. Use `--orchestrator-model` (= claude -p --model) to vary the model that does the work. 2. Add `--permission-mode acceptEdits` to the run: in headless default mode, file-edit permission prompts auto-deny, and weaker/less-agentic models stall ("I need permission to edit") — observed haiku hitting 4 perm-denials and giving up (0/11) while opus wrote freely (and even dispatched sub-agents). That is a permission/agency artifact, not code quality. With acceptEdits, haiku writes its code and passes 10/11 — confirming the gap on a well-scoped subtask is small. NOT a full bypass: only edits are auto-accepted; the temp is an isolated throwaway. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): CORRECTED model-outcome result (two confounds removed) Transcript verification (prompted by llm-council) exposed two confounds that invalidated the earlier "model doesn't matter for outcome": (1) --agent-model was a no-op (headless map-task runs inline on the orchestrator/session model, 0 sub-agent dispatches — all sweeps were opus-4-8); (2) permission stalls made haiku return 0/11 despite knowing the algorithm. Fixes: --orchestrator-model + --permission-mode acceptEdits. CLEAN result (calc_vague, hidden /11, n=4): haiku 9.8/11 (variable), sonnet 10.5/11, opus 10.7/11. Modest but consistent gradient opus ≥ sonnet > haiku; haiku more variable on hard edge cases (~1 test / ~8% gap on a single subtask), plus a separate agency gap. Earlier "model irrelevant for outcome" RETRACTED (it was opus measured 3x). Methodology lesson: verify the manipulation reached the transcript before trusting an agentic-harness result. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): DECISIVE — actor model matters for code (direct-invocation method) Cleanest measurement (read the agent prompt, call claude -p with that prompt + task + chosen model; no orchestrator/dispatch). calc_vague, hidden /11, n=3: haiku 8/10/10 (mean 9.3, never perfect, drops edge cases), sonnet 11/11/11, opus 11/11 (+1 broken run). The ACTOR model materially affects code quality — haiku systematically worse; use sonnet+ for actor. The two confounds (no-op --agent-model, permission-stall) had hidden this; "haiku ≈ opus" RETRACTED for the actor role. Probes also showed: headless skills don't auto-dispatch sub-agents (run inline), and even forced dispatches run on the session model — so per-agent model: is inert in headless. Architectural fix: orchestrate agents by direct claude -p --append-system-prompt <agent.md> --model <tier> so per-agent models actually take effect. Then which-agent/model-per-skill is deterministic. Prototype: .map/actor_probe.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(actor): cut 247 lines of worked examples (A/B-validated, -23%) The actor prompt carried a <Actor_Reference_Examples> block — 4 full worked implementations (~247 lines, ~22% of the prompt). The output structure is already specified earlier ("Required Output Structure"), so the examples were illustrative bloat loaded on every actor invocation. A/B test (direct actor-prompt invocation, calc_vague hidden /11, n=4): A uncut (1095L): haiku 9.3, sonnet 9.5 B cut (846L): haiku 10.0, sonnet 11.0 B >= A on both models — the cut does not regress (if anything helps), at -23% prompt size / ~7.7KB saved per invocation. Validated via .map/actor_probe.py (renders the agent template, runs it as the system prompt with a chosen model, scores against a hidden edge-case suite). make check green (2274), render byte-identical across all trees. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): A/B agent polish — actor examples cuttable, monitor's are not Built per-agent A/B harnesses (render agent template -> claude -p --append-system-prompt --model -> score). Actor: cut 247-line examples block, A/B B>=A (haiku 9.3->10.0, sonnet 9.5->11.0), kept (d78acd5). Monitor: cut 134 lines of good/bad examples, A/B showed recall preserved (4/4) but clean-pass on identical correct code A=6/6 vs B=4/6 -> reverted. The monitor's examples calibrate its accept threshold (gate agent => false-positive calibration matters). Lesson: "cut prompt examples" is safe for a generative agent, NOT for a gate agent; each agent needs its own A/B gate, and harness coverage bounds what's cuttable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(synthesizer): cut 95 lines of Step-7 strategy worked-examples (A/B-validated) The synthesizer's Step 7 carried two `**Example**:` worked-code blocks (base_enhance / fresh_generation, ~95 lines) illustrating strategies whose numbered-step LOGIC is already stated. A capable model applies the steps without the example code. A/B (built .map/synth_probe.py: feed 3 variant calc solutions — 1 correct + 2 buggy — the synthesizer must extract correct decisions and WRITE unified code, scored by the calc hidden suite, n=3): A (HEAD, 1161L): 33/33 hidden · B (cut, 1066L): 33/33 hidden -> B == A Non-regressive, -95 lines (-8%). Confirms the rule from the actor cut: pure worked-examples are cuttable in GENERATIVE agents; guidance/patterns and gate-calibration examples are NOT (monitor + actor-patterns cuts were A/B-rejected and reverted). make check green (2274), render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): agent-polish validated rule (2 wins, 2 rejects) + coverage limit Worked-examples are cuttable in generative agents (actor -247L, synthesizer -95L; both A/B B>=A) but NOT in gate agents or for guidance/patterns (monitor, actor error-patterns; both A/B-rejected, reverted). Remaining generative agents (task-decomposer/predictor/reflector/research-agent) output fuzzy artifacts with no clean deterministic gate, so further cuts can't satisfy the A/B rule without building lower-confidence gates first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(task-decomposer): cut 151-line REFERENCE EXAMPLES (A/B-validated, -14%) The decomposer carried a "## REFERENCE EXAMPLES" section (Example A: a full 139-line worked CRUD blueprint) — pure illustration; the JSON Schema definition (separate section) stays. A capable model produces a valid blueprint from the schema + 5-phase process without the worked example. A/B (built .map/decomp_probe.py: feed a known 5-part feature, score blueprint on schema-validity + requirement-coverage + acyclic deps + sane count; opus, n=3): A (HEAD, 1078L): 3/3 PASS (coverage 5/5) · B (cut, 927L): 3/3 PASS (coverage 5/5) B == A, non-regressive, -151 lines (-14%). Third generative-agent worked-example cut to validate (after actor -247L, synthesizer -95L); gate-calibration/guidance cuts remain A/B-rejected (monitor, actor-patterns). make check green, render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(predictor): cut 417-line <examples> block (A/B-validated, -21%) The predictor carried an <examples> block (Examples 1-3: full worked impact analyses, ~417 lines incl. 107L/91L/46L JSON outputs) — illustration; the JSON Schema, risk rubric, and "Good vs Bad Predictions" guidance stay. A/B (built .map/pred_probe.py: feed a known breaking + high-blast-radius signature change across 8 callers; gate on the deterministic core — breaking_changes detected AND affected components identified; sonnet, n=3): A (HEAD, 2003L): 3/3 PASS · B (cut, 1586L): 3/3 PASS B == A, non-regressive, -417 lines (-21%, the largest cut). Fourth generative-agent worked-example cut validated (actor -247, synthesizer -95, decomposer -151, predictor -417 = 910 lines total). Gate covers breaking+affected detection (predictor's core); risk-level (subjective) was rated identically by both versions. make check green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * polish(reflector): cut 139-line COMPLETE EXAMPLES (A/B-validated) The reflector's "# COMPLETE EXAMPLES" section (4 worked lesson-extraction examples, 139 lines) is illustration; the lesson-format rules + decision frameworks stay. A/B (built .map/reflect_probe.py: feed a clear SQL-injection failure outcome; gate on whether the reflector extracts a lesson naming the root-cause concept injection/parameterization; sonnet, n=3): A (HEAD): 3/3 PASS · B (cut): 3/3 PASS B == A, non-regressive, -139 lines. Fifth generative-agent worked-example cut. TOTAL across 5 generative agents: actor -247, synthesizer -95, task-decomposer -151, predictor -417, reflector -139 = 1049 lines of worked-example bloat removed, all A/B-validated B>=A. Gate agents (monitor etc.) NOT cut — examples calibrate the accept threshold (monitor A/B-rejected). (reflector gate is keyword-presence, fuzzier than the code-output gates.) make check green, render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(whole-skill): comprehensive agent-polish tally (5 wins, 1049 lines, A/B-validated) Swept all generative agents for worked-example/REFERENCE blocks, A/B-gated each: KEPT (B>=A): actor -247, synthesizer -95, task-decomposer -151, predictor -417, reflector -139 = 1049 lines. REJECTED (B<A, reverted): monitor examples (calibrate accept threshold), actor error-patterns/decision-tree (scaffold weaker models). Gate agents untouched (examples calibrate); research-agent lean. Final rule + reusable harnesses (.map/*_probe.py) documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

azalio and others added 11 commits June 4, 2026 14:27

azalio merged commit 1b1d064 into main Jun 4, 2026
6 checks passed

azalio deleted the smoke-cobalt branch June 4, 2026 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description)#159

Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description)#159
azalio merged 11 commits into
mainfrom
smoke-cobalt

azalio commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azalio commented Jun 4, 2026

Phase F — Skill-Eval Description Optimizer + HTML Viewer (+ fixes)

What's in it

Real run evidence (2 skills, not mocked)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant