Skip to content

Harden skill-eval + A/B-polish MAP agents (−1049 lines of example bloat)#161

Merged
azalio merged 24 commits into
mainfrom
gila-copper
Jun 6, 2026
Merged

Harden skill-eval + A/B-polish MAP agents (−1049 lines of example bloat)#161
azalio merged 24 commits into
mainfrom
gila-copper

Conversation

@azalio

@azalio azalio commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Summary

Hardens the skill-eval engine, builds an A/B outcome harness for MAP agents, and uses it to polish all five generative agents by removing worked-example bloat — every cut gated on an empirical before/after measurement (kept only if B ≥ A).

24 commits across three themes:

1. skill-eval bug fixes

  • Dispatcher no longer mis-records executing skills as non-triggers (9a180ee): added --disallowed-tools, terminal-timeout handling, and transcript recovery by cwd→project-slug (handles macOS /var/private/var and _ in mkdtemp names). Default timeout 120→90s.
  • Optimizer no longer corrupts block-scalar descriptions (20f70d7): _set_frontmatter_description now consumes block-scalar continuation lines instead of leaving orphaned indented body (was producing invalid YAML).
  • --model / --runs flags on skill_eval_run (e81123e) + model-tier trigger experiment.

2. Whole-skill outcome harness + model investigation

  • spike_runner.py: --agent-model, --orchestrator-model, --permission-mode acceptEdits, serial retry_count capture, run_hidden_tests scoring.
  • Discriminating fixtures (semver / weak-gate / vague-contract / calc / calc-vague).
  • Key correction: an early "model doesn't affect outcome" result was a confound — --agent-model is inert in headless (sub-agents never dispatched), so the lever is --orchestrator-model. Re-run shows actor model does matter for code quality (haiku ~9.8 < sonnet/opus ~10.7); outcome is robust on well-scoped subtasks within model competence.

3. Agent prompt polish (A/B-validated)

Removed 1049 lines of worked-example bloat from generative agents — each gated on a deterministic harness:

agent cut verdict
actor −247 haiku 9.3→10.0, sonnet 9.5→11.0 ✅
predictor −417 3/3 = 3/3 (breaking+affected) ✅
task-decomposer −151 3/3 = 3/3 (schema+coverage+acyclic) ✅
reflector −139 3/3 = 3/3 (lesson extraction) ✅
synthesizer −95 33/33 = 33/33 ✅

Validated rule: pure worked-examples/REFERENCE blocks are safely removable from generative agents; guidance/patterns and gate-calibration examples are load-bearing. Two cuts were A/B-rejected and reverted (monitor examples calibrate the accept threshold; actor error-patterns scaffold weaker models). Gate agents (monitor/evaluator/debate-arbiter/final-verifier) left intact by design; research-agent already lean.

All agent edits made in templates_src/**.jinja and propagated via make render-templates; make check-render byte-identity holds.

Note

  • Includes one validated map-debug description tweak (4 lines × 3 rendered trees) — the single applied result from the original description sweep.
  • Per-agent A/B harnesses are session prototypes under .map/*_probe.py (gitignored); methodology recorded in docs/whole-skill-optimization-notes.md. Productizing them is a follow-up.

Test plan

  • make check (lint + type + render parity + full suite) green in CI
  • pytest tests/test_template_render.py byte-identity
  • pytest tests/skills_eval/ (dispatcher timeout + optimizer regressions added)

🤖 Generated with Claude Code

azalio and others added 24 commits June 5, 2026 12:46
…as non-trigger

The description-optimizer dispatcher ran `claude -p <prompt>` to completion and
parsed the transcript for the first `Skill` tool_use. For EXECUTING skills this
corrupted the signal: a positive prompt for map-check runs the full `make check`
suite, and map-task/map-efficient dispatch sub-agents — each overran the 120s
per-call timeout, was treated as a transient failure, retried 3x, and finally
recorded as a FALSE non-trigger. Latent until now: PR #159 applied only map-plan
(fast text-only planner); the executing-skill eval-sets from PR #160 were never run.

Trigger detection only needs the first `Skill` tool_use (the activation decision),
which lands in the transcript early (~30s) — not the skill body executing. Fixes:

- `--disallowed-tools` (Bash/Edit/Write/NotebookEdit/Task/Agent/WebFetch/WebSearch)
  on the claude -p argv: the body cannot run slow/mutating/network work or spawn
  sub-agents (which would orphan children and burn quota after the parent is
  killed), while the skill still TRIGGERS. Read-only tools and `Skill` stay allowed.
- Timeout is now TERMINAL, not retried: a 90s overrun re-run behaves identically —
  retrying turned every executing-skill positive into ~3x wall-clock (267s observed).
- On timeout, recover the trigger from the transcript (located by cwd slug, since
  the kill yields no envelope/session_id) instead of a false non-trigger, with a
  settle-poll to defeat the flush/visibility race right at the kill instant.
- Robust cwd→project-dir slug (`_cwd_to_project_slug`, shared by both locators):
  replicates Claude Code's transform — replaces `/`, `.`, AND `_` (any non
  alnum/dash) with `-`. The old `replace("/","-").replace(".","-")` missed `_`,
  so any `tempfile.mkdtemp()` suffix containing `_` (e.g. `mapeval-s_u5zv32` →
  project dir `…-mapeval-s-u5zv32`) failed to locate — an intermittent false
  non-trigger that surfaced live in the sweep. Also handles macOS `/var`→
  `/private/var` (tries raw + resolved cwd) and a slugified name-glob fallback.
- Default per-call timeout 120s → 90s (well above the ~30s trigger latency).

Validated live: map-check 75s (natural completion, triggers), map-efficient 90s
(single attempt, trigger recovered — was 267s/3-attempts), negatives ~31s non-trigger.
11 regression tests incl. the symlink-slug and underscore-slug cases.
make check green (2271 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ns → skill never triggered

`_set_frontmatter_description` replaced only the single line starting with
`description:`, leaving any YAML block-scalar body orphaned. For a skill whose
frontmatter declares:

    description: |
      Run quality gates ... use map-plan or map-efficient.

patching produced:

    description: "Run quality gates ... map-efficient.\n"
      Run quality gates ... map-efficient.    ← orphaned, indented, now invalid

That is malformed YAML, so Claude Code failed to register the skill in the seeded
candidate tree — it never auto-triggered, and EVERY eval cell (including the iter-0
baseline) recorded a false non-trigger. The optimizer would then "improve" a
description that was actually fine, against an all-zero baseline. It escaped
notice because PR #159 applied only map-plan (single-line description); all other
map-* skills use `description: |` block scalars and were silently broken.

Fix: when the description value is a block scalar (`|`/`>`) or any indented
multi-line value, also consume the continuation lines (indented deeper than the
key) before substituting the single double-quoted replacement line. Preserves the
key's indentation and every sibling key/body line.

Validated: candidate-seeding path now triggers map-check (was None). Two regression
tests (block `|` and folded `>`) assert the patched frontmatter re-parses as valid
YAML with exactly the new description and no orphaned body. make check green (2273).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds `--model <alias>` and `--runs N` to `mapify skill-eval run` (and a `model`
param on ClaudeSubprocessDispatcher). `--model` pins the claude -p model tier
(haiku/sonnet/opus); default omits the flag → CLI session model (unchanged
behaviour). `--runs` averages out single-pass noise when comparing models.

Motivated by Murin 2026 (arXiv:2606.05970): "model choice dominates prompt
phrasing; a larger model redistributes agreement rather than uniformly raising
it." Tested the analog for skill trigger ROUTING (map-check, n=9/model):

  haiku 6/9 (67%, 26s)  ·  sonnet 7/9 (78%, 51s)  ·  opus 6/9 (67%, 51s)

Finding: model tier does NOT reliably improve trigger routing — Opus ties Haiku
and trails Sonnet; the spread is within noise and cell-by-cell it's
redistribution, not improvement (exactly Murin's pattern). One positive prompt
was missed by ALL tiers — a description gap no model size fixes. The lever for
trigger accuracy is the `description:` (the optimizer sweep), not the model;
execution-quality tiering is a separate, untested question. Full write-up in
docs/model-tier-trigger-experiment.md. 2 regression tests for the new flags.
make check green (2274 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3 passes on map-check (n=27) + map-explain + map-task (n=9 each). The pilot's
single-pass n=9 (haiku 67% ≈ opus 67%) was noise. Authoritative matrix:

  skill         haiku    sonnet   opus
  map-check     59%      89%      78%
  map-explain   33%      33%      44%
  map-task      33%      67%      78%
  overall       49%      73%      71%

Revised findings: (1) model tier DOES matter — Haiku is consistently weakest
(−24pp vs Sonnet), not the free lunch the pilot suggested; (2) "bigger is better"
does NOT hold — Opus (71%) ties/trails Sonnet (73%) and they redistribute across
skills (Murin's exact pattern), so Opus buys nothing over Sonnet for routing;
(3) the description is the ceiling — map-explain caps at 33-44% across all tiers.
Recommendation for routing: Sonnet (sweet spot); the lever is the description
(optimizer sweep), not model size.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Foundation for optimizing skill OUTCOME quality (not trigger descriptions),
covering the model lever and a fixture where quality actually varies.

spike_runner.py:
- --agent-model AGENT=MODEL (repeatable) rewrites a seeded agent's `model:`
  frontmatter — the precise EXECUTION-quality lever (the actor writes the code,
  so its model is the code-quality lever; sub-agents use their own model:, not
  the orchestrator's --model).
- --orchestrator-model passes --model to the top-level claude -p running the body.

Fixture map_task_semver:
- Existing whole_skill fixtures (scope/blocker traps) all score QUALITY=1.0 — they
  can't discriminate code quality. This one is genuinely non-trivial: semver 2.0.0
  compare() with pre-release precedence, numeric-vs-lexical identifier comparison,
  and build-metadata handling. A naive implementation passes the easy tests but
  fails the edge cases, so the test gate (task_pass) and QUALITY VARY with the lever.
- Validated: stub → 12 failed; a correct reference impl → 12 passed; orchestrator
  accepts the hand-crafted plan+blueprint (resume_single_subtask ST-001 → RESEARCH).

make check green (2274 passed); fixture excluded from gate per whole_skill rules.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mver fixture)

Sweep on the discriminating map_task_semver fixture, full /map-task with
--agent-model actor=haiku|sonnet|opus (×2): all 6 runs task_pass=True (correct
semver). The one QUALITY=0.5 is a judge API-529, not a code failure. The test
gate + MONITOR retry loop drive even haiku to passing, and haiku is not slower.

Conclusion (execution-data-backed): outcome levers rank CONTRACT/GATES (#3) ≫
MODEL (#2) ≫ PROSE (#1). Invest in validation_criteria / validators / test
coverage to lift skill outcome quality — not actor model or prose. Model may
still matter on weakly-gated tasks (untested); QUALITY saturates on gated tasks
so a finer signal needs retry_count logging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On a well-gated task QUALITY saturates (every model tier passes the test gate),
so the model effect hides in HOW MANY actor retries the MONITOR loop needed.
spike_runner now reads retry_count / clean_retry_count / contaminated_retry_count
from the run's step_state.json (before temp cleanup) and records + prints them,
so "passes, but only after more iterations" is visible in future model/prompt
sweeps. Addresses the method caveat from the semver model sweep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-effect probe)

To measure the model effect that a strong gate hides, add a WEAKLY-gated fixture
and post-run hidden scoring:

- spike_runner: run_hidden_tests() injects a comprehensive suite the workflow
  never saw (manifest hidden_test_src→dest + hidden_test_cmd) and records
  hidden_pass / n_passed / n_failed — the deterministic 'true code quality'
  signal, no judge noise.
- Fixture map_task_semver_weakgate: the workflow sees ONLY a thin 3-test gate
  (which a naive impl passes); the hidden 8-edge-case suite scores what the gate
  did not enforce.

Validated end-to-end (no quota): STUB → weak-gate fail; NAIVE → weak-gate PASS
but hidden 2p/6f; CORRECT → both pass. So hidden_pass discriminates code quality
that the weak gate misses — the probe for 'does the actor model matter when the
gate can't rescue a weak implementation'. make check green (2274).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
map_task_semver_vague = weak gate + VAGUE aag_contract (only "compare two semver
strings, return -1/0/1" — the precedence rules are NOT spelled out), so the actor
must rely on its own semver knowledge. Reuses the weak basic gate + hidden
8-edge-case suite. This is the one configuration where the actor model COULD
matter for outcome (vague spec the gate can't rescue); the prior fixtures kept the
contract complete and found model-insensitivity. Orchestrator-accepted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… competence

Two more outcome configs (weak gate; weak gate + vague contract), hidden
8-edge-case suite, actor=haiku|sonnet|opus ×2 — all 18 runs produced correct
semver (retries=0). Even a vague contract + weak gate + haiku gets it right
first try (semver is within every tier's competence).

Definitive outcome finding: for a task within model competence, outcome quality
is invariant to model tier, test-gate strength, and contract completeness. The
execution-model lever only bites ABOVE the weak model's competence (threshold not
reached here). Evidence-based outcome-lever ranking: contract/gates ≫ model
(only beyond-competence) ≫ prose. Invest in contract + mechanical gates.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… evaluator)

Web research (EvalPlus / HumanEval-Pro / arXiv 2511.04355, 2510.13908) shows the
model-capability gap appears on EDGE-CASE-sensitive code, not isolated functions
like semver. The canonical discriminator is an arithmetic expression evaluator —
three traps that violate left-to-right intuition: right-associative ** (2**3**2),
** binding tighter than unary minus (-2**2 == -4 in Python), overloaded unary
minus, negative exponents, error contract.

map_task_calc: weak gate = 3 trivial expressions (a naive left-assoc evaluator
passes); hidden 13-case suite probes the traps. Validated: STUB → gate fail;
NAIVE (left-assoc **, tight unary) → weak gate PASS but hidden 9p/3f; CORRECT →
12p/0f. Contract states the rules (right-assoc, ** tighter than unary) WITHOUT
leaking worked answers, so it tests implementation competence. Probe for the
execution-model threshold the semver fixtures were too easy to reach.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…old probe)

map_task_calc_vague = the calc evaluator with a VAGUE contract: it says only
"standard precedence" and does NOT spell out right-associative ** or unary rules.
So passing 2**3**2==512 requires the model to KNOW right-associativity from
training, not read it from the spec. Hidden suite keeps only language-unambiguous
cases (drops -2**2). Validated: a naive left-assoc impl passes the weak gate but
fails hidden (9p/2f). If haiku fails these while sonnet/opus pass, the execution-
model competence threshold is finally located; if haiku still passes, haiku 4.5
is genuinely competent on self-contained subtasks and the model is not the lever.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…subtasks

Built the canonical edge-case discriminator (arithmetic expression evaluator:
right-assoc **, ** tighter than unary minus, negative exponent, errors) per web
research. Sweeps: FULL-contract calc → haiku 12/12; VAGUE-contract calc (rules
NOT stated) → haiku 11/11,11/11, sonnet 10/11,11/11, opus 11/11,11/11. Even on
the edge-case-hard task with a vague contract, haiku 4.5 matched/edged the bigger
models (the one miss was sonnet's — noise). No bigger-is-better ordering.

Why: the tier gap lives on multi-step/multi-file/concurrency tasks, not single
self-contained functions — and map-task decomposes into well-scoped one-file
subtasks, the regime where haiku is competent. Within map-task's granularity the
model is not the outcome lever (actor can run on haiku — cost win). Final ranking:
contract/gates ≫ model (only beyond-competence) ≫ prose.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odel

Two harness corrections after verifying transcripts (council-flagged confounds):

1. The real model lever is the ORCHESTRATOR model: in headless `claude -p`,
   map-task runs INLINE on the session model and does NOT spawn the actor/monitor
   sub-agents (0 Task dispatches in the transcript), so the earlier
   `--agent-model actor=...` override was a no-op — every "tier" actually ran on
   claude-opus-4-8. Use `--orchestrator-model` (= claude -p --model) to vary the
   model that does the work.

2. Add `--permission-mode acceptEdits` to the run: in headless default mode,
   file-edit permission prompts auto-deny, and weaker/less-agentic models stall
   ("I need permission to edit") — observed haiku hitting 4 perm-denials and
   giving up (0/11) while opus wrote freely (and even dispatched sub-agents).
   That is a permission/agency artifact, not code quality. With acceptEdits,
   haiku writes its code and passes 10/11 — confirming the gap on a well-scoped
   subtask is small. NOT a full bypass: only edits are auto-accepted; the temp is
   an isolated throwaway.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ved)

Transcript verification (prompted by llm-council) exposed two confounds that
invalidated the earlier "model doesn't matter for outcome": (1) --agent-model was
a no-op (headless map-task runs inline on the orchestrator/session model, 0
sub-agent dispatches — all sweeps were opus-4-8); (2) permission stalls made haiku
return 0/11 despite knowing the algorithm. Fixes: --orchestrator-model +
--permission-mode acceptEdits.

CLEAN result (calc_vague, hidden /11, n=4): haiku 9.8/11 (variable), sonnet
10.5/11, opus 10.7/11. Modest but consistent gradient opus ≥ sonnet > haiku;
haiku more variable on hard edge cases (~1 test / ~8% gap on a single subtask),
plus a separate agency gap. Earlier "model irrelevant for outcome" RETRACTED (it
was opus measured 3x). Methodology lesson: verify the manipulation reached the
transcript before trusting an agentic-harness result.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vocation method)

Cleanest measurement (read the agent prompt, call claude -p with that prompt +
task + chosen model; no orchestrator/dispatch). calc_vague, hidden /11, n=3:
haiku 8/10/10 (mean 9.3, never perfect, drops edge cases), sonnet 11/11/11,
opus 11/11 (+1 broken run). The ACTOR model materially affects code quality —
haiku systematically worse; use sonnet+ for actor. The two confounds (no-op
--agent-model, permission-stall) had hidden this; "haiku ≈ opus" RETRACTED for
the actor role.

Probes also showed: headless skills don't auto-dispatch sub-agents (run inline),
and even forced dispatches run on the session model — so per-agent model: is inert
in headless. Architectural fix: orchestrate agents by direct claude -p
--append-system-prompt <agent.md> --model <tier> so per-agent models actually take
effect. Then which-agent/model-per-skill is deterministic. Prototype: .map/actor_probe.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The actor prompt carried a <Actor_Reference_Examples> block — 4 full worked
implementations (~247 lines, ~22% of the prompt). The output structure is already
specified earlier ("Required Output Structure"), so the examples were illustrative
bloat loaded on every actor invocation.

A/B test (direct actor-prompt invocation, calc_vague hidden /11, n=4):
  A uncut (1095L): haiku 9.3, sonnet 9.5
  B cut   (846L):  haiku 10.0, sonnet 11.0
B >= A on both models — the cut does not regress (if anything helps), at -23%
prompt size / ~7.7KB saved per invocation. Validated via .map/actor_probe.py
(renders the agent template, runs it as the system prompt with a chosen model,
scores against a hidden edge-case suite). make check green (2274), render
byte-identical across all trees.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r's are not

Built per-agent A/B harnesses (render agent template -> claude -p
--append-system-prompt --model -> score). Actor: cut 247-line examples block,
A/B B>=A (haiku 9.3->10.0, sonnet 9.5->11.0), kept (d78acd5). Monitor: cut 134
lines of good/bad examples, A/B showed recall preserved (4/4) but clean-pass on
identical correct code A=6/6 vs B=4/6 -> reverted. The monitor's examples
calibrate its accept threshold (gate agent => false-positive calibration matters).
Lesson: "cut prompt examples" is safe for a generative agent, NOT for a gate agent;
each agent needs its own A/B gate, and harness coverage bounds what's cuttable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(A/B-validated)

The synthesizer's Step 7 carried two `**Example**:` worked-code blocks
(base_enhance / fresh_generation, ~95 lines) illustrating strategies whose
numbered-step LOGIC is already stated. A capable model applies the steps without
the example code.

A/B (built .map/synth_probe.py: feed 3 variant calc solutions — 1 correct + 2
buggy — the synthesizer must extract correct decisions and WRITE unified code,
scored by the calc hidden suite, n=3):
  A (HEAD, 1161L): 33/33 hidden  ·  B (cut, 1066L): 33/33 hidden  -> B == A
Non-regressive, -95 lines (-8%). Confirms the rule from the actor cut: pure
worked-examples are cuttable in GENERATIVE agents; guidance/patterns and
gate-calibration examples are NOT (monitor + actor-patterns cuts were A/B-rejected
and reverted). make check green (2274), render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…coverage limit

Worked-examples are cuttable in generative agents (actor -247L, synthesizer -95L;
both A/B B>=A) but NOT in gate agents or for guidance/patterns (monitor, actor
error-patterns; both A/B-rejected, reverted). Remaining generative agents
(task-decomposer/predictor/reflector/research-agent) output fuzzy artifacts with
no clean deterministic gate, so further cuts can't satisfy the A/B rule without
building lower-confidence gates first.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed, -14%)

The decomposer carried a "## REFERENCE EXAMPLES" section (Example A: a full 139-line
worked CRUD blueprint) — pure illustration; the JSON Schema definition (separate
section) stays. A capable model produces a valid blueprint from the schema + 5-phase
process without the worked example.

A/B (built .map/decomp_probe.py: feed a known 5-part feature, score blueprint on
schema-validity + requirement-coverage + acyclic deps + sane count; opus, n=3):
  A (HEAD, 1078L): 3/3 PASS (coverage 5/5)  ·  B (cut, 927L): 3/3 PASS (coverage 5/5)
B == A, non-regressive, -151 lines (-14%). Third generative-agent worked-example cut
to validate (after actor -247L, synthesizer -95L); gate-calibration/guidance cuts
remain A/B-rejected (monitor, actor-patterns). make check green, render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The predictor carried an <examples> block (Examples 1-3: full worked impact
analyses, ~417 lines incl. 107L/91L/46L JSON outputs) — illustration; the JSON
Schema, risk rubric, and "Good vs Bad Predictions" guidance stay.

A/B (built .map/pred_probe.py: feed a known breaking + high-blast-radius signature
change across 8 callers; gate on the deterministic core — breaking_changes detected
AND affected components identified; sonnet, n=3):
  A (HEAD, 2003L): 3/3 PASS  ·  B (cut, 1586L): 3/3 PASS
B == A, non-regressive, -417 lines (-21%, the largest cut). Fourth generative-agent
worked-example cut validated (actor -247, synthesizer -95, decomposer -151, predictor
-417 = 910 lines total). Gate covers breaking+affected detection (predictor's core);
risk-level (subjective) was rated identically by both versions. make check green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The reflector's "# COMPLETE EXAMPLES" section (4 worked lesson-extraction examples,
139 lines) is illustration; the lesson-format rules + decision frameworks stay.

A/B (built .map/reflect_probe.py: feed a clear SQL-injection failure outcome; gate on
whether the reflector extracts a lesson naming the root-cause concept
injection/parameterization; sonnet, n=3):
  A (HEAD): 3/3 PASS  ·  B (cut): 3/3 PASS
B == A, non-regressive, -139 lines. Fifth generative-agent worked-example cut.
TOTAL across 5 generative agents: actor -247, synthesizer -95, task-decomposer -151,
predictor -417, reflector -139 = 1049 lines of worked-example bloat removed, all
A/B-validated B>=A. Gate agents (monitor etc.) NOT cut — examples calibrate the
accept threshold (monitor A/B-rejected). (reflector gate is keyword-presence, fuzzier
than the code-output gates.) make check green, render byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…es, A/B-validated)

Swept all generative agents for worked-example/REFERENCE blocks, A/B-gated each:
KEPT (B>=A): actor -247, synthesizer -95, task-decomposer -151, predictor -417,
reflector -139 = 1049 lines. REJECTED (B<A, reverted): monitor examples (calibrate
accept threshold), actor error-patterns/decision-tree (scaffold weaker models). Gate
agents untouched (examples calibrate); research-agent lean. Final rule + reusable
harnesses (.map/*_probe.py) documented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@azalio azalio merged commit 72f74c6 into main Jun 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant