Skip to content

Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description)#159

Merged
azalio merged 11 commits into
mainfrom
smoke-cobalt
Jun 4, 2026
Merged

Phase F: skill-eval description optimizer + HTML viewer (+ telegram isolation, applied map-plan description)#159
azalio merged 11 commits into
mainfrom
smoke-cobalt

Conversation

@azalio

@azalio azalio commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Phase F — Skill-Eval Description Optimizer + HTML Viewer (+ fixes)

Builds the optimization/reporting layer on top of the shipped F.1 eval engine, plus two fixes surfaced by a real end-to-end run.

What's in it

Phase F (9 subtasks, each Monitor-approved):

  • eval_schema.py: OptimizeIterationRecord / OptimizeResult / ProposerFn (round-trip to_dict/from_dict).
  • proposer.py: claude -p description proposer (argv-list, MAP_INVOKED_BY, None on all failures).
  • description_optimizer.py: anti-overfit optimizer — deterministic hashlib 60/40 split, selection by held-out TEST pass-rate (overfit candidates flagged and never selected), per-iteration resume=False jsonl isolation, candidate .claude/ re-seed cleaned in finally. Clock-free (caller supplies run_ts).
  • viewer.py: jinja2 HTML report — autoescape=True (XSS-safe), difflib per-iteration diff, overfit rows red.
  • apply_patcher.py: fail-loud block-scalar frontmatter --apply patcher → re-render to byte-identical trees, scoped git add, path-safe.
  • optimize / view CLI subcommands (exit 2/0/1, dry-run budget, --open best-effort, JSON+HTML artifacts).
  • 3 optimizer eval-set fixtures (≥8 entries) + authoring note; no-anthropic guard over all 4 new modules.

Fix 1 — eval/telegram isolation (f46e78d): the eval claude -p subprocess inherited the user's telegram-bridge SessionStart hook (tg listen), which could block on the Telegram long-poll until the dispatch timeout (a triggered-skill cell then mis-records as a non-trigger). The dispatcher now sets TG_STATE_DIR to a config-less path so any tg/tg listen the agent runs exits immediately instead of blocking. (Best-effort: the SessionStart injection still appears via the plugin hook env, so a perfectly-obedient agent can still spend a few turns on it; the hard block is removed.)

Fix 2 — apply optimized map-plan description + correct the description cap (b1c515f): a real optimize run on map-plan selected a candidate that doubled the held-out TEST pass-rate (25%→50%, no overfit); applied it. The winner is 622 chars, which the old self-imposed 250-char test rejected — but 250 was a transient Claude Code v2.1.86 cap (raised to 1536 in v2.1.105, replaced by a usage-ranked listing budget in v2.1.129+). The Agent Skills spec maximum is 1024 chars and the model loads the full description for triggering, so the test now enforces 1024 and the proposer caps candidates at 1024 (prompt instruction + hard rejection of over-limit proposals). Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills.

Real run evidence (2 skills, not mocked)

  • map-plan: winner = iter 1 — baseline 20%/25% (train/test) → candidate 20%/50%, selected on held-out test, not overfit.
  • map-debug: no improvement (full tie → baseline retained), 25%.

Verification

  • make check (lint + pyright 0/0/0 + full pytest) and make check-render green on a clean run (2247→2256 tests as the suite grew across subtasks).
  • Final Verifier (Ralph loop): PASS.

Note for reviewers: on the author's machine, two timing-sensitive hook tests flake under heavy local load + an unusually large active-session transcript (test_hook_inventory_smoke::test_every_configured_hook_execs_via_shebangresolve_session_id scanning a multi-MB transcript; test_safety_guardrails::test_execution_under_100ms — 116ms vs 100ms). Both are unrelated to this diff (it touches no hooks/memory) and pass in clean CI.

🤖 Generated with Claude Code

azalio and others added 11 commits June 4, 2026 14:27
… in eval_schema

Contract-first typed structs for the description optimizer (AC-7): round-trip
to_dict/from_dict mirroring EvalResultRecord's _MISSING-tolerant pattern, flat
per-iteration token totals, overfit/selected flags, and per-iteration jsonl
paths so the viewer renders without re-parsing. ProposerFn aliases the
Callable[[str, list[EvalResultRecord]], str | None] proposer interface.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Default ProposerFn: propose_description() builds an argv list for
`claude -p --output-format json` (never a shell string), sets MAP_INVOKED_BY
on the subprocess env (INV-2/HC-7), reads only .result, and returns the
stripped candidate text or None on non-zero exit / timeout / OSError /
malformed JSON / empty .result. Untrusted eval-record text is passed as a
discrete argv element — no shell interpolation. No anthropic import, no
ANTHROPIC_API_KEY, no --model flag (INV-1/AC-10/D2). Adds the no-anthropic
source-scan guard (test_inv1_no_anthropic_optimize.py, extended in ST-009).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction

optimize() runs an N-iteration loop (iter 0 = baseline) over a deterministic
60/40 split and selects the candidate that maximises held-out TEST pass-rate,
never the overfit one (train↑/test↓ flagged overfit, structurally excluded
from selection). Determinism without random/datetime: split_train_test keys a
hashlib.sha256 ordering by seed (clock-free; run_ts supplied by caller). Each
candidate is evaluated by re-seeding a throwaway .claude/ copy, patching ONLY
the SKILL.md frontmatter description (fail-loud), and running runner.run_eval
over train+test to per-iteration isolated jsonl paths (resume=False); the temp
seed is removed in finally — production .claude/ and templates_src/ untouched.
Proposer returning None records a proposal_failed iteration and the loop
continues with baseline eligible; full-tie => baseline wins (no_improvement).

Also: guard iterations>=1 (avoid latent IndexError) + tests for the fail-loud
frontmatter raise paths; extend the no-anthropic guard to cover this module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
render_html()/render_to_path() render a pure HTML report from the typed
OptimizeResult (Producer-Owns-Parse, no dispatch/subprocess): one row per
iteration with a difflib unified-diff vs the prior iteration, train/test
pass-rates, token totals, the selected iteration marked, and overfit rows
(train↑/test↓) highlighted red. Security: candidate_description is untrusted
claude -p output, so the renderer uses jinja2 Environment(autoescape=True)
plus explicit |e on every dynamic slot — an embedded <script> is escaped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er orchestration

patch_skill_description() rewrites ONLY the YAML frontmatter `description:` block
scalar of a SKILL.md(.jinja) — fail-loud (no partial write) on missing
frontmatter/anchor, written as a `|-` block scalar that round-trips exactly.
apply_optimized_description() edits the single-source
templates_src/skills/<skill>/SKILL.md.jinja then re-renders both providers so
generated trees stay byte-identical (INV-5/check-render), staging only the
patched source + existing gate trees (scoped `git add --`, never `-A`, never
commits). Path-safety (under templates_src, reject `.git/`) runs before any FS
touch. Two distinct no-op messages (baseline wins vs winner==current); baseline
never written back; skill-rules.json never auto-patched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`mapify skill-eval optimize <skill> --eval-set [--iterations --apply --open
--dry-run]` drives the ST-003 optimizer (proposer + viewer + apply patcher),
persisting OptimizeResult JSON + HTML under .map/eval-runs/<skill>/. Strict
exit-code order: <5 entries -> Exit(2) (>=5 minimum) before any dispatcher;
--dry-run prints the call budget + 'model: default (resolved by claude CLI)'
and Exit(0) with zero quota; claude-absent -> Exit(1). run_ts is generated at
the CLI boundary (clock-free core). `view <skill> [--result --open]` renders a
stored result. --open is best-effort (never errors the run). Without --apply,
nothing outside .map/ is touched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng note

New map-plan (9), map-efficient (9), and larger map-debug (10) eval-sets for
`skill-eval optimize`, each splittable 60/40 with n_test>=3 so the held-out
test pass-rate is meaningful (a <5-entry set degenerates to a 0/1 coin-flip).
README documents the purpose + >=8 sizing rationale. The 3-entry
map_debug_eval_set.json smoke fixture is left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…docs

Edited the .jinja sources (SKILL.md.jinja + skill-rules.json.jinja) to document
`skill-eval optimize <skill> --eval-set --iterations --apply --open --dry-run`
(anti-overfit held-out selection, propose-only default, two no-op cases) and
`skill-eval view <skill> --result --open`, plus new skill-rules keywords/
intentPatterns; requires-cmd stays ["claude"]. Ran `make render-templates`;
check-render + skills-consistency + template-render green.

Also re-syncs the generated .map/scripts/map_orchestrator.py to its .jinja
source: the gitignore-tolerance in record_subtask_result was already in
templates_src but the committed generated file was stale (.map/ is not in
check-render's gate, so the drift went uncaught). render-templates corrects it.
Follow-up (noted): add .map/scripts/ to check-render's gate to catch such drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… green

The INV-1/AC-10 AST guard now covers proposer, description_optimizer, viewer,
and apply_patcher (no `import anthropic`, no ANTHROPIC_API_KEY). Integration
gate verified: `make check` green (2247 passed), `make check-render` green,
pyright 0/0/0 on every touched Phase F source + test file. uv-run confirmed to
resolve to this worktree, so the green reflects the real Phase F code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tener

The eval dispatcher's claude -p subprocess inherits the user's telegram-bridge
plugin SessionStart hook, which injects an "always-listen — run `tg listen`"
instruction. When the eval agent obeys it, `tg listen` blocks on the Telegram
long-poll until the dispatch timeout, so a triggered-skill cell mis-records as a
non-trigger (and the run wall-clock explodes — observed cells hitting the full
120s/3600s timeout).

Fix: `_eval_subprocess_env(cwd)` sets TG_STATE_DIR to a config-less path inside
the throwaway eval cwd. Any `tg listen`/`tg send` the agent runs inherits this
env, finds no config.json, and exits immediately (`die`) instead of blocking —
neutralising the hang. The operator's real ~/.claude/telegram config is never
touched (per-subprocess override on a path removed with the temp cwd). Plugin
hooks run in a restricted env so the cosmetic injection may still appear, but it
is inert. MAP_INVOKED_BY guard preserved.

Verified end-to-end: previously-hanging prompts now finish in ~80-160s through
the real ClaudeSubprocessDispatcher with the skill still triggering.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on cap to the spec (1024)

Applies the optimizer's winning map-plan description (the candidate selected on
held-out TEST pass-rate: 25%->50% vs baseline, no overfit) to the single-source
.jinja and re-renders byte-identical trees.

The winner is 622 chars, which the old self-imposed 250-char test rejected. That
250 was a transient Claude Code v2.1.86 listing cap — raised to 1536 in v2.1.105
and replaced by a usage-ranked listing budget in v2.1.129+. The Agent Skills
spec maximum for `description` is 1024 chars, and the model loads the FULL
description for triggering. So 250 was stale; the test now enforces the real
1024-char spec limit.

The proposer is constrained to <=1024 chars (prompt instruction + hard rejection
of an over-limit candidate) so the optimizer only ever proves a shippable
description. Refs: anthropics/claude-code #40121, #47627; code.claude.com/docs/en/skills.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@azalio azalio merged commit 1b1d064 into main Jun 4, 2026
6 checks passed
@azalio azalio deleted the smoke-cobalt branch June 4, 2026 19:30
azalio added a commit that referenced this pull request Jun 6, 2026
…at) (#161)

* fix(skill-eval): trigger-eval no longer mis-records executing skills as non-trigger

The description-optimizer dispatcher ran `claude -p <prompt>` to completion and
parsed the transcript for the first `Skill` tool_use. For EXECUTING skills this
corrupted the signal: a positive prompt for map-check runs the full `make check`
suite, and map-task/map-efficient dispatch sub-agents — each overran the 120s
per-call timeout, was treated as a transient failure, retried 3x, and finally
recorded as a FALSE non-trigger. Latent until now: PR #159 applied only map-plan
(fast text-only planner); the executing-skill eval-sets from PR #160 were never run.

Trigger detection only needs the first `Skill` tool_use (the activation decision),
which lands in the transcript early (~30s) — not the skill body executing. Fixes:

- `--disallowed-tools` (Bash/Edit/Write/NotebookEdit/Task/Agent/WebFetch/WebSearch)
  on the claude -p argv: the body cannot run slow/mutating/network work or spawn
  sub-agents (which would orphan children and burn quota after the parent is
  killed), while the skill still TRIGGERS. Read-only tools and `Skill` stay allowed.
- Timeout is now TERMINAL, not retried: a 90s overrun re-run behaves identically —
  retrying turned every executing-skill positive into ~3x wall-clock (267s observed).
- On timeout, recover the trigger from the transcript (located by cwd slug, since
  the kill yields no envelope/session_id) instead of a false non-trigger, with a
  settle-poll to defeat the flush/visibility race right at the kill instant.
- Robust cwd→project-dir slug (`_cwd_to_project_slug`, shared by both locators):
  replicates Claude Code's transform — replaces `/`, `.`, AND `_` (any non
  alnum/dash) with `-`. The old `replace("/","-").replace(".","-")` missed `_`,
  so any `tempfile.mkdtemp()` suffix containing `_` (e.g. `mapeval-s_u5zv32` →
  project dir `…-mapeval-s-u5zv32`) failed to locate — an intermittent false
  non-trigger that surfaced live in the sweep. Also handles macOS `/var`→
  `/private/var` (tries raw + resolved cwd) and a slugified name-glob fallback.
- Default per-call timeout 120s → 90s (well above the ~30s trigger latency).

Validated live: map-check 75s (natural completion, triggers), map-efficient 90s
(single attempt, trigger recovered — was 267s/3-attempts), negatives ~31s non-trigger.
11 regression tests incl. the symlink-slug and underscore-slug cases.
make check green (2271 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(skill-eval): optimizer corrupted block-scalar SKILL.md descriptions → skill never triggered

`_set_frontmatter_description` replaced only the single line starting with
`description:`, leaving any YAML block-scalar body orphaned. For a skill whose
frontmatter declares:

    description: |
      Run quality gates ... use map-plan or map-efficient.

patching produced:

    description: "Run quality gates ... map-efficient.\n"
      Run quality gates ... map-efficient.    ← orphaned, indented, now invalid

That is malformed YAML, so Claude Code failed to register the skill in the seeded
candidate tree — it never auto-triggered, and EVERY eval cell (including the iter-0
baseline) recorded a false non-trigger. The optimizer would then "improve" a
description that was actually fine, against an all-zero baseline. It escaped
notice because PR #159 applied only map-plan (single-line description); all other
map-* skills use `description: |` block scalars and were silently broken.

Fix: when the description value is a block scalar (`|`/`>`) or any indented
multi-line value, also consume the continuation lines (indented deeper than the
key) before substituting the single double-quoted replacement line. Preserves the
key's indentation and every sibling key/body line.

Validated: candidate-seeding path now triggers map-check (was None). Two regression
tests (block `|` and folded `>`) assert the patched frontmatter re-parses as valid
YAML with exactly the new description and no orphaned body. make check green (2273).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(skill-eval): --model/--runs flags + model-tier trigger experiment

Adds `--model <alias>` and `--runs N` to `mapify skill-eval run` (and a `model`
param on ClaudeSubprocessDispatcher). `--model` pins the claude -p model tier
(haiku/sonnet/opus); default omits the flag → CLI session model (unchanged
behaviour). `--runs` averages out single-pass noise when comparing models.

Motivated by Murin 2026 (arXiv:2606.05970): "model choice dominates prompt
phrasing; a larger model redistributes agreement rather than uniformly raising
it." Tested the analog for skill trigger ROUTING (map-check, n=9/model):

  haiku 6/9 (67%, 26s)  ·  sonnet 7/9 (78%, 51s)  ·  opus 6/9 (67%, 51s)

Finding: model tier does NOT reliably improve trigger routing — Opus ties Haiku
and trails Sonnet; the spread is within noise and cell-by-cell it's
redistribution, not improvement (exactly Murin's pattern). One positive prompt
was missed by ALL tiers — a description gap no model size fixes. The lever for
trigger accuracy is the `description:` (the optimizer sweep), not the model;
execution-quality tiering is a separate, untested question. Full write-up in
docs/model-tier-trigger-experiment.md. 2 regression tests for the new flags.
make check green (2274 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(skill-eval): firm-up model-tier results revise the pilot conclusion

3 passes on map-check (n=27) + map-explain + map-task (n=9 each). The pilot's
single-pass n=9 (haiku 67% ≈ opus 67%) was noise. Authoritative matrix:

  skill         haiku    sonnet   opus
  map-check     59%      89%      78%
  map-explain   33%      33%      44%
  map-task      33%      67%      78%
  overall       49%      73%      71%

Revised findings: (1) model tier DOES matter — Haiku is consistently weakest
(−24pp vs Sonnet), not the free lunch the pilot suggested; (2) "bigger is better"
does NOT hold — Opus (71%) ties/trails Sonnet (73%) and they redistribute across
skills (Murin's exact pattern), so Opus buys nothing over Sonnet for routing;
(3) the description is the ceiling — map-explain caps at 33-44% across all tiers.
Recommendation for routing: Sonnet (sweet spot); the lever is the description
(optimizer sweep), not model size.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(whole-skill): model-lever knobs + a discriminating outcome fixture

Foundation for optimizing skill OUTCOME quality (not trigger descriptions),
covering the model lever and a fixture where quality actually varies.

spike_runner.py:
- --agent-model AGENT=MODEL (repeatable) rewrites a seeded agent's `model:`
  frontmatter — the precise EXECUTION-quality lever (the actor writes the code,
  so its model is the code-quality lever; sub-agents use their own model:, not
  the orchestrator's --model).
- --orchestrator-model passes --model to the top-level claude -p running the body.

Fixture map_task_semver:
- Existing whole_skill fixtures (scope/blocker traps) all score QUALITY=1.0 — they
  can't discriminate code quality. This one is genuinely non-trivial: semver 2.0.0
  compare() with pre-release precedence, numeric-vs-lexical identifier comparison,
  and build-metadata handling. A naive implementation passes the easy tests but
  fails the edge cases, so the test gate (task_pass) and QUALITY VARY with the lever.
- Validated: stub → 12 failed; a correct reference impl → 12 passed; orchestrator
  accepts the hand-crafted plan+blueprint (resume_single_subtask ST-001 → RESEARCH).

make check green (2274 passed); fixture excluded from gate per whole_skill rules.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): model lever on OUTCOME is absorbed by the gate (semver fixture)

Sweep on the discriminating map_task_semver fixture, full /map-task with
--agent-model actor=haiku|sonnet|opus (×2): all 6 runs task_pass=True (correct
semver). The one QUALITY=0.5 is a judge API-529, not a code failure. The test
gate + MONITOR retry loop drive even haiku to passing, and haiku is not slower.

Conclusion (execution-data-backed): outcome levers rank CONTRACT/GATES (#3) ≫
MODEL (#2) ≫ PROSE (#1). Invest in validation_criteria / validators / test
coverage to lift skill outcome quality — not actor model or prose. Model may
still matter on weakly-gated tasks (untested); QUALITY saturates on gated tasks
so a finer signal needs retry_count logging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(whole-skill): capture serial retry_count in the outcome harness

On a well-gated task QUALITY saturates (every model tier passes the test gate),
so the model effect hides in HOW MANY actor retries the MONITOR loop needed.
spike_runner now reads retry_count / clean_retry_count / contaminated_retry_count
from the run's step_state.json (before temp cleanup) and records + prints them,
so "passes, but only after more iterations" is visible in future model/prompt
sweeps. Addresses the method caveat from the semver model sweep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(whole-skill): weakly-gated fixture + hidden-suite scoring (model-effect probe)

To measure the model effect that a strong gate hides, add a WEAKLY-gated fixture
and post-run hidden scoring:

- spike_runner: run_hidden_tests() injects a comprehensive suite the workflow
  never saw (manifest hidden_test_src→dest + hidden_test_cmd) and records
  hidden_pass / n_passed / n_failed — the deterministic 'true code quality'
  signal, no judge noise.
- Fixture map_task_semver_weakgate: the workflow sees ONLY a thin 3-test gate
  (which a naive impl passes); the hidden 8-edge-case suite scores what the gate
  did not enforce.

Validated end-to-end (no quota): STUB → weak-gate fail; NAIVE → weak-gate PASS
but hidden 2p/6f; CORRECT → both pass. So hidden_pass discriminates code quality
that the weak gate misses — the probe for 'does the actor model matter when the
gate can't rescue a weak implementation'. make check green (2274).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(whole-skill): vague-contract fixture to isolate the model lever

map_task_semver_vague = weak gate + VAGUE aag_contract (only "compare two semver
strings, return -1/0/1" — the precedence rules are NOT spelled out), so the actor
must rely on its own semver knowledge. Reuses the weak basic gate + hidden
8-edge-case suite. This is the one configuration where the actor model COULD
matter for outcome (vague spec the gate can't rescue); the prior fixtures kept the
contract complete and found model-insensitivity. Orchestrator-accepted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): outcome is invariant to model/gate/contract within competence

Two more outcome configs (weak gate; weak gate + vague contract), hidden
8-edge-case suite, actor=haiku|sonnet|opus ×2 — all 18 runs produced correct
semver (retries=0). Even a vague contract + weak gate + haiku gets it right
first try (semver is within every tier's competence).

Definitive outcome finding: for a task within model competence, outcome quality
is invariant to model tier, test-gate strength, and contract completeness. The
execution-model lever only bites ABOVE the weak model's competence (threshold not
reached here). Evidence-based outcome-lever ranking: contract/gates ≫ model
(only beyond-competence) ≫ prose. Invest in contract + mechanical gates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(whole-skill): capability-discriminating calc fixture (expression evaluator)

Web research (EvalPlus / HumanEval-Pro / arXiv 2511.04355, 2510.13908) shows the
model-capability gap appears on EDGE-CASE-sensitive code, not isolated functions
like semver. The canonical discriminator is an arithmetic expression evaluator —
three traps that violate left-to-right intuition: right-associative ** (2**3**2),
** binding tighter than unary minus (-2**2 == -4 in Python), overloaded unary
minus, negative exponents, error contract.

map_task_calc: weak gate = 3 trivial expressions (a naive left-assoc evaluator
passes); hidden 13-case suite probes the traps. Validated: STUB → gate fail;
NAIVE (left-assoc **, tight unary) → weak gate PASS but hidden 9p/3f; CORRECT →
12p/0f. Contract states the rules (right-assoc, ** tighter than unary) WITHOUT
leaking worked answers, so it tests implementation competence. Probe for the
execution-model threshold the semver fixtures were too easy to reach.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(whole-skill): vague-contract calc fixture (decisive model-threshold probe)

map_task_calc_vague = the calc evaluator with a VAGUE contract: it says only
"standard precedence" and does NOT spell out right-associative ** or unary rules.
So passing 2**3**2==512 requires the model to KNOW right-associativity from
training, not read it from the spec. Hidden suite keeps only language-unambiguous
cases (drops -2**2). Validated: a naive left-assoc impl passes the weak gate but
fails hidden (9p/2f). If haiku fails these while sonnet/opus pass, the execution-
model competence threshold is finally located; if haiku still passes, haiku 4.5
is genuinely competent on self-contained subtasks and the model is not the lever.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): capability probe — haiku matches bigger models on subtasks

Built the canonical edge-case discriminator (arithmetic expression evaluator:
right-assoc **, ** tighter than unary minus, negative exponent, errors) per web
research. Sweeps: FULL-contract calc → haiku 12/12; VAGUE-contract calc (rules
NOT stated) → haiku 11/11,11/11, sonnet 10/11,11/11, opus 11/11,11/11. Even on
the edge-case-hard task with a vague contract, haiku 4.5 matched/edged the bigger
models (the one miss was sonnet's — noise). No bigger-is-better ordering.

Why: the tier gap lives on multi-step/multi-file/concurrency tasks, not single
self-contained functions — and map-task decomposes into well-scoped one-file
subtasks, the regime where haiku is competent. Within map-task's granularity the
model is not the outcome lever (actor can run on haiku — cost win). Final ranking:
contract/gates ≫ model (only beyond-competence) ≫ prose.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(whole-skill): acceptEdits in outcome harness + use orchestrator-model

Two harness corrections after verifying transcripts (council-flagged confounds):

1. The real model lever is the ORCHESTRATOR model: in headless `claude -p`,
   map-task runs INLINE on the session model and does NOT spawn the actor/monitor
   sub-agents (0 Task dispatches in the transcript), so the earlier
   `--agent-model actor=...` override was a no-op — every "tier" actually ran on
   claude-opus-4-8. Use `--orchestrator-model` (= claude -p --model) to vary the
   model that does the work.

2. Add `--permission-mode acceptEdits` to the run: in headless default mode,
   file-edit permission prompts auto-deny, and weaker/less-agentic models stall
   ("I need permission to edit") — observed haiku hitting 4 perm-denials and
   giving up (0/11) while opus wrote freely (and even dispatched sub-agents).
   That is a permission/agency artifact, not code quality. With acceptEdits,
   haiku writes its code and passes 10/11 — confirming the gap on a well-scoped
   subtask is small. NOT a full bypass: only edits are auto-accepted; the temp is
   an isolated throwaway.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): CORRECTED model-outcome result (two confounds removed)

Transcript verification (prompted by llm-council) exposed two confounds that
invalidated the earlier "model doesn't matter for outcome": (1) --agent-model was
a no-op (headless map-task runs inline on the orchestrator/session model, 0
sub-agent dispatches — all sweeps were opus-4-8); (2) permission stalls made haiku
return 0/11 despite knowing the algorithm. Fixes: --orchestrator-model +
--permission-mode acceptEdits.

CLEAN result (calc_vague, hidden /11, n=4): haiku 9.8/11 (variable), sonnet
10.5/11, opus 10.7/11. Modest but consistent gradient opus ≥ sonnet > haiku;
haiku more variable on hard edge cases (~1 test / ~8% gap on a single subtask),
plus a separate agency gap. Earlier "model irrelevant for outcome" RETRACTED (it
was opus measured 3x). Methodology lesson: verify the manipulation reached the
transcript before trusting an agentic-harness result.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): DECISIVE — actor model matters for code (direct-invocation method)

Cleanest measurement (read the agent prompt, call claude -p with that prompt +
task + chosen model; no orchestrator/dispatch). calc_vague, hidden /11, n=3:
haiku 8/10/10 (mean 9.3, never perfect, drops edge cases), sonnet 11/11/11,
opus 11/11 (+1 broken run). The ACTOR model materially affects code quality —
haiku systematically worse; use sonnet+ for actor. The two confounds (no-op
--agent-model, permission-stall) had hidden this; "haiku ≈ opus" RETRACTED for
the actor role.

Probes also showed: headless skills don't auto-dispatch sub-agents (run inline),
and even forced dispatches run on the session model — so per-agent model: is inert
in headless. Architectural fix: orchestrate agents by direct claude -p
--append-system-prompt <agent.md> --model <tier> so per-agent models actually take
effect. Then which-agent/model-per-skill is deterministic. Prototype: .map/actor_probe.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(actor): cut 247 lines of worked examples (A/B-validated, -23%)

The actor prompt carried a <Actor_Reference_Examples> block — 4 full worked
implementations (~247 lines, ~22% of the prompt). The output structure is already
specified earlier ("Required Output Structure"), so the examples were illustrative
bloat loaded on every actor invocation.

A/B test (direct actor-prompt invocation, calc_vague hidden /11, n=4):
  A uncut (1095L): haiku 9.3, sonnet 9.5
  B cut   (846L):  haiku 10.0, sonnet 11.0
B >= A on both models — the cut does not regress (if anything helps), at -23%
prompt size / ~7.7KB saved per invocation. Validated via .map/actor_probe.py
(renders the agent template, runs it as the system prompt with a chosen model,
scores against a hidden edge-case suite). make check green (2274), render
byte-identical across all trees.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): A/B agent polish — actor examples cuttable, monitor's are not

Built per-agent A/B harnesses (render agent template -> claude -p
--append-system-prompt --model -> score). Actor: cut 247-line examples block,
A/B B>=A (haiku 9.3->10.0, sonnet 9.5->11.0), kept (d78acd5). Monitor: cut 134
lines of good/bad examples, A/B showed recall preserved (4/4) but clean-pass on
identical correct code A=6/6 vs B=4/6 -> reverted. The monitor's examples
calibrate its accept threshold (gate agent => false-positive calibration matters).
Lesson: "cut prompt examples" is safe for a generative agent, NOT for a gate agent;
each agent needs its own A/B gate, and harness coverage bounds what's cuttable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(synthesizer): cut 95 lines of Step-7 strategy worked-examples (A/B-validated)

The synthesizer's Step 7 carried two `**Example**:` worked-code blocks
(base_enhance / fresh_generation, ~95 lines) illustrating strategies whose
numbered-step LOGIC is already stated. A capable model applies the steps without
the example code.

A/B (built .map/synth_probe.py: feed 3 variant calc solutions — 1 correct + 2
buggy — the synthesizer must extract correct decisions and WRITE unified code,
scored by the calc hidden suite, n=3):
  A (HEAD, 1161L): 33/33 hidden  ·  B (cut, 1066L): 33/33 hidden  -> B == A
Non-regressive, -95 lines (-8%). Confirms the rule from the actor cut: pure
worked-examples are cuttable in GENERATIVE agents; guidance/patterns and
gate-calibration examples are NOT (monitor + actor-patterns cuts were A/B-rejected
and reverted). make check green (2274), render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): agent-polish validated rule (2 wins, 2 rejects) + coverage limit

Worked-examples are cuttable in generative agents (actor -247L, synthesizer -95L;
both A/B B>=A) but NOT in gate agents or for guidance/patterns (monitor, actor
error-patterns; both A/B-rejected, reverted). Remaining generative agents
(task-decomposer/predictor/reflector/research-agent) output fuzzy artifacts with
no clean deterministic gate, so further cuts can't satisfy the A/B rule without
building lower-confidence gates first.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(task-decomposer): cut 151-line REFERENCE EXAMPLES (A/B-validated, -14%)

The decomposer carried a "## REFERENCE EXAMPLES" section (Example A: a full 139-line
worked CRUD blueprint) — pure illustration; the JSON Schema definition (separate
section) stays. A capable model produces a valid blueprint from the schema + 5-phase
process without the worked example.

A/B (built .map/decomp_probe.py: feed a known 5-part feature, score blueprint on
schema-validity + requirement-coverage + acyclic deps + sane count; opus, n=3):
  A (HEAD, 1078L): 3/3 PASS (coverage 5/5)  ·  B (cut, 927L): 3/3 PASS (coverage 5/5)
B == A, non-regressive, -151 lines (-14%). Third generative-agent worked-example cut
to validate (after actor -247L, synthesizer -95L); gate-calibration/guidance cuts
remain A/B-rejected (monitor, actor-patterns). make check green, render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(predictor): cut 417-line <examples> block (A/B-validated, -21%)

The predictor carried an <examples> block (Examples 1-3: full worked impact
analyses, ~417 lines incl. 107L/91L/46L JSON outputs) — illustration; the JSON
Schema, risk rubric, and "Good vs Bad Predictions" guidance stay.

A/B (built .map/pred_probe.py: feed a known breaking + high-blast-radius signature
change across 8 callers; gate on the deterministic core — breaking_changes detected
AND affected components identified; sonnet, n=3):
  A (HEAD, 2003L): 3/3 PASS  ·  B (cut, 1586L): 3/3 PASS
B == A, non-regressive, -417 lines (-21%, the largest cut). Fourth generative-agent
worked-example cut validated (actor -247, synthesizer -95, decomposer -151, predictor
-417 = 910 lines total). Gate covers breaking+affected detection (predictor's core);
risk-level (subjective) was rated identically by both versions. make check green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* polish(reflector): cut 139-line COMPLETE EXAMPLES (A/B-validated)

The reflector's "# COMPLETE EXAMPLES" section (4 worked lesson-extraction examples,
139 lines) is illustration; the lesson-format rules + decision frameworks stay.

A/B (built .map/reflect_probe.py: feed a clear SQL-injection failure outcome; gate on
whether the reflector extracts a lesson naming the root-cause concept
injection/parameterization; sonnet, n=3):
  A (HEAD): 3/3 PASS  ·  B (cut): 3/3 PASS
B == A, non-regressive, -139 lines. Fifth generative-agent worked-example cut.
TOTAL across 5 generative agents: actor -247, synthesizer -95, task-decomposer -151,
predictor -417, reflector -139 = 1049 lines of worked-example bloat removed, all
A/B-validated B>=A. Gate agents (monitor etc.) NOT cut — examples calibrate the
accept threshold (monitor A/B-rejected). (reflector gate is keyword-presence, fuzzier
than the code-output gates.) make check green, render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(whole-skill): comprehensive agent-polish tally (5 wins, 1049 lines, A/B-validated)

Swept all generative agents for worked-example/REFERENCE blocks, A/B-gated each:
KEPT (B>=A): actor -247, synthesizer -95, task-decomposer -151, predictor -417,
reflector -139 = 1049 lines. REJECTED (B<A, reverted): monitor examples (calibrate
accept threshold), actor error-patterns/decision-tree (scaffold weaker models). Gate
agents untouched (examples calibrate); research-agent lean. Final rule + reusable
harnesses (.map/*_probe.py) documented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant