Harden skill-eval + A/B-polish MAP agents (−1049 lines of example bloat) by azalio · Pull Request #161 · azalio/map-framework

azalio · 2026-06-06T20:00:36Z

Summary

Hardens the skill-eval engine, builds an A/B outcome harness for MAP agents, and uses it to polish all five generative agents by removing worked-example bloat — every cut gated on an empirical before/after measurement (kept only if B ≥ A).

24 commits across three themes:

1. skill-eval bug fixes

Dispatcher no longer mis-records executing skills as non-triggers (9a180ee): added --disallowed-tools, terminal-timeout handling, and transcript recovery by cwd→project-slug (handles macOS /var→/private/var and _ in mkdtemp names). Default timeout 120→90s.
Optimizer no longer corrupts block-scalar descriptions (20f70d7): _set_frontmatter_description now consumes block-scalar continuation lines instead of leaving orphaned indented body (was producing invalid YAML).
--model / --runs flags on skill_eval_run (e81123e) + model-tier trigger experiment.

2. Whole-skill outcome harness + model investigation

spike_runner.py: --agent-model, --orchestrator-model, --permission-mode acceptEdits, serial retry_count capture, run_hidden_tests scoring.
Discriminating fixtures (semver / weak-gate / vague-contract / calc / calc-vague).
Key correction: an early "model doesn't affect outcome" result was a confound — --agent-model is inert in headless (sub-agents never dispatched), so the lever is --orchestrator-model. Re-run shows actor model does matter for code quality (haiku ~9.8 < sonnet/opus ~10.7); outcome is robust on well-scoped subtasks within model competence.

3. Agent prompt polish (A/B-validated)

Removed 1049 lines of worked-example bloat from generative agents — each gated on a deterministic harness:

agent	cut	verdict
actor	−247	haiku 9.3→10.0, sonnet 9.5→11.0 ✅
predictor	−417	3/3 = 3/3 (breaking+affected) ✅
task-decomposer	−151	3/3 = 3/3 (schema+coverage+acyclic) ✅
reflector	−139	3/3 = 3/3 (lesson extraction) ✅
synthesizer	−95	33/33 = 33/33 ✅

Validated rule: pure worked-examples/REFERENCE blocks are safely removable from generative agents; guidance/patterns and gate-calibration examples are load-bearing. Two cuts were A/B-rejected and reverted (monitor examples calibrate the accept threshold; actor error-patterns scaffold weaker models). Gate agents (monitor/evaluator/debate-arbiter/final-verifier) left intact by design; research-agent already lean.

All agent edits made in templates_src/**.jinja and propagated via make render-templates; make check-render byte-identity holds.

Note

Includes one validated map-debug description tweak (4 lines × 3 rendered trees) — the single applied result from the original description sweep.
Per-agent A/B harnesses are session prototypes under .map/*_probe.py (gitignored); methodology recorded in docs/whole-skill-optimization-notes.md. Productizing them is a follow-up.

Test plan

make check (lint + type + render parity + full suite) green in CI
pytest tests/test_template_render.py byte-identity
pytest tests/skills_eval/ (dispatcher timeout + optimizer regressions added)

🤖 Generated with Claude Code

…as non-trigger The description-optimizer dispatcher ran `claude -p <prompt>` to completion and parsed the transcript for the first `Skill` tool_use. For EXECUTING skills this corrupted the signal: a positive prompt for map-check runs the full `make check` suite, and map-task/map-efficient dispatch sub-agents — each overran the 120s per-call timeout, was treated as a transient failure, retried 3x, and finally recorded as a FALSE non-trigger. Latent until now: PR #159 applied only map-plan (fast text-only planner); the executing-skill eval-sets from PR #160 were never run. Trigger detection only needs the first `Skill` tool_use (the activation decision), which lands in the transcript early (~30s) — not the skill body executing. Fixes: - `--disallowed-tools` (Bash/Edit/Write/NotebookEdit/Task/Agent/WebFetch/WebSearch) on the claude -p argv: the body cannot run slow/mutating/network work or spawn sub-agents (which would orphan children and burn quota after the parent is killed), while the skill still TRIGGERS. Read-only tools and `Skill` stay allowed. - Timeout is now TERMINAL, not retried: a 90s overrun re-run behaves identically — retrying turned every executing-skill positive into ~3x wall-clock (267s observed). - On timeout, recover the trigger from the transcript (located by cwd slug, since the kill yields no envelope/session_id) instead of a false non-trigger, with a settle-poll to defeat the flush/visibility race right at the kill instant. - Robust cwd→project-dir slug (`_cwd_to_project_slug`, shared by both locators): replicates Claude Code's transform — replaces `/`, `.`, AND `_` (any non alnum/dash) with `-`. The old `replace("/","-").replace(".","-")` missed `_`, so any `tempfile.mkdtemp()` suffix containing `_` (e.g. `mapeval-s_u5zv32` → project dir `…-mapeval-s-u5zv32`) failed to locate — an intermittent false non-trigger that surfaced live in the sweep. Also handles macOS `/var`→ `/private/var` (tries raw + resolved cwd) and a slugified name-glob fallback. - Default per-call timeout 120s → 90s (well above the ~30s trigger latency). Validated live: map-check 75s (natural completion, triggers), map-efficient 90s (single attempt, trigger recovered — was 267s/3-attempts), negatives ~31s non-trigger. 11 regression tests incl. the symlink-slug and underscore-slug cases. make check green (2271 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ns → skill never triggered `_set_frontmatter_description` replaced only the single line starting with `description:`, leaving any YAML block-scalar body orphaned. For a skill whose frontmatter declares: description: | Run quality gates ... use map-plan or map-efficient. patching produced: description: "Run quality gates ... map-efficient.\n" Run quality gates ... map-efficient. ← orphaned, indented, now invalid That is malformed YAML, so Claude Code failed to register the skill in the seeded candidate tree — it never auto-triggered, and EVERY eval cell (including the iter-0 baseline) recorded a false non-trigger. The optimizer would then "improve" a description that was actually fine, against an all-zero baseline. It escaped notice because PR #159 applied only map-plan (single-line description); all other map-* skills use `description: |` block scalars and were silently broken. Fix: when the description value is a block scalar (`|`/`>`) or any indented multi-line value, also consume the continuation lines (indented deeper than the key) before substituting the single double-quoted replacement line. Preserves the key's indentation and every sibling key/body line. Validated: candidate-seeding path now triggers map-check (was None). Two regression tests (block `|` and folded `>`) assert the patched frontmatter re-parses as valid YAML with exactly the new description and no orphaned body. make check green (2273). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds `--model <alias>` and `--runs N` to `mapify skill-eval run` (and a `model` param on ClaudeSubprocessDispatcher). `--model` pins the claude -p model tier (haiku/sonnet/opus); default omits the flag → CLI session model (unchanged behaviour). `--runs` averages out single-pass noise when comparing models. Motivated by Murin 2026 (arXiv:2606.05970): "model choice dominates prompt phrasing; a larger model redistributes agreement rather than uniformly raising it." Tested the analog for skill trigger ROUTING (map-check, n=9/model): haiku 6/9 (67%, 26s) · sonnet 7/9 (78%, 51s) · opus 6/9 (67%, 51s) Finding: model tier does NOT reliably improve trigger routing — Opus ties Haiku and trails Sonnet; the spread is within noise and cell-by-cell it's redistribution, not improvement (exactly Murin's pattern). One positive prompt was missed by ALL tiers — a description gap no model size fixes. The lever for trigger accuracy is the `description:` (the optimizer sweep), not the model; execution-quality tiering is a separate, untested question. Full write-up in docs/model-tier-trigger-experiment.md. 2 regression tests for the new flags. make check green (2274 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

3 passes on map-check (n=27) + map-explain + map-task (n=9 each). The pilot's single-pass n=9 (haiku 67% ≈ opus 67%) was noise. Authoritative matrix: skill haiku sonnet opus map-check 59% 89% 78% map-explain 33% 33% 44% map-task 33% 67% 78% overall 49% 73% 71% Revised findings: (1) model tier DOES matter — Haiku is consistently weakest (−24pp vs Sonnet), not the free lunch the pilot suggested; (2) "bigger is better" does NOT hold — Opus (71%) ties/trails Sonnet (73%) and they redistribute across skills (Murin's exact pattern), so Opus buys nothing over Sonnet for routing; (3) the description is the ceiling — map-explain caps at 33-44% across all tiers. Recommendation for routing: Sonnet (sweet spot); the lever is the description (optimizer sweep), not model size. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Foundation for optimizing skill OUTCOME quality (not trigger descriptions), covering the model lever and a fixture where quality actually varies. spike_runner.py: - --agent-model AGENT=MODEL (repeatable) rewrites a seeded agent's `model:` frontmatter — the precise EXECUTION-quality lever (the actor writes the code, so its model is the code-quality lever; sub-agents use their own model:, not the orchestrator's --model). - --orchestrator-model passes --model to the top-level claude -p running the body. Fixture map_task_semver: - Existing whole_skill fixtures (scope/blocker traps) all score QUALITY=1.0 — they can't discriminate code quality. This one is genuinely non-trivial: semver 2.0.0 compare() with pre-release precedence, numeric-vs-lexical identifier comparison, and build-metadata handling. A naive implementation passes the easy tests but fails the edge cases, so the test gate (task_pass) and QUALITY VARY with the lever. - Validated: stub → 12 failed; a correct reference impl → 12 passed; orchestrator accepts the hand-crafted plan+blueprint (resume_single_subtask ST-001 → RESEARCH). make check green (2274 passed); fixture excluded from gate per whole_skill rules. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mver fixture) Sweep on the discriminating map_task_semver fixture, full /map-task with --agent-model actor=haiku|sonnet|opus (×2): all 6 runs task_pass=True (correct semver). The one QUALITY=0.5 is a judge API-529, not a code failure. The test gate + MONITOR retry loop drive even haiku to passing, and haiku is not slower. Conclusion (execution-data-backed): outcome levers rank CONTRACT/GATES (#3) ≫ MODEL (#2) ≫ PROSE (#1). Invest in validation_criteria / validators / test coverage to lift skill outcome quality — not actor model or prose. Model may still matter on weakly-gated tasks (untested); QUALITY saturates on gated tasks so a finer signal needs retry_count logging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

On a well-gated task QUALITY saturates (every model tier passes the test gate), so the model effect hides in HOW MANY actor retries the MONITOR loop needed. spike_runner now reads retry_count / clean_retry_count / contaminated_retry_count from the run's step_state.json (before temp cleanup) and records + prints them, so "passes, but only after more iterations" is visible in future model/prompt sweeps. Addresses the method caveat from the semver model sweep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-effect probe) To measure the model effect that a strong gate hides, add a WEAKLY-gated fixture and post-run hidden scoring: - spike_runner: run_hidden_tests() injects a comprehensive suite the workflow never saw (manifest hidden_test_src→dest + hidden_test_cmd) and records hidden_pass / n_passed / n_failed — the deterministic 'true code quality' signal, no judge noise. - Fixture map_task_semver_weakgate: the workflow sees ONLY a thin 3-test gate (which a naive impl passes); the hidden 8-edge-case suite scores what the gate did not enforce. Validated end-to-end (no quota): STUB → weak-gate fail; NAIVE → weak-gate PASS but hidden 2p/6f; CORRECT → both pass. So hidden_pass discriminates code quality that the weak gate misses — the probe for 'does the actor model matter when the gate can't rescue a weak implementation'. make check green (2274). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

map_task_semver_vague = weak gate + VAGUE aag_contract (only "compare two semver strings, return -1/0/1" — the precedence rules are NOT spelled out), so the actor must rely on its own semver knowledge. Reuses the weak basic gate + hidden 8-edge-case suite. This is the one configuration where the actor model COULD matter for outcome (vague spec the gate can't rescue); the prior fixtures kept the contract complete and found model-insensitivity. Orchestrator-accepted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… competence Two more outcome configs (weak gate; weak gate + vague contract), hidden 8-edge-case suite, actor=haiku|sonnet|opus ×2 — all 18 runs produced correct semver (retries=0). Even a vague contract + weak gate + haiku gets it right first try (semver is within every tier's competence). Definitive outcome finding: for a task within model competence, outcome quality is invariant to model tier, test-gate strength, and contract completeness. The execution-model lever only bites ABOVE the weak model's competence (threshold not reached here). Evidence-based outcome-lever ranking: contract/gates ≫ model (only beyond-competence) ≫ prose. Invest in contract + mechanical gates. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… evaluator) Web research (EvalPlus / HumanEval-Pro / arXiv 2511.04355, 2510.13908) shows the model-capability gap appears on EDGE-CASE-sensitive code, not isolated functions like semver. The canonical discriminator is an arithmetic expression evaluator — three traps that violate left-to-right intuition: right-associative ** (2**3**2), ** binding tighter than unary minus (-2**2 == -4 in Python), overloaded unary minus, negative exponents, error contract. map_task_calc: weak gate = 3 trivial expressions (a naive left-assoc evaluator passes); hidden 13-case suite probes the traps. Validated: STUB → gate fail; NAIVE (left-assoc **, tight unary) → weak gate PASS but hidden 9p/3f; CORRECT → 12p/0f. Contract states the rules (right-assoc, ** tighter than unary) WITHOUT leaking worked answers, so it tests implementation competence. Probe for the execution-model threshold the semver fixtures were too easy to reach. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…old probe) map_task_calc_vague = the calc evaluator with a VAGUE contract: it says only "standard precedence" and does NOT spell out right-associative ** or unary rules. So passing 2**3**2==512 requires the model to KNOW right-associativity from training, not read it from the spec. Hidden suite keeps only language-unambiguous cases (drops -2**2). Validated: a naive left-assoc impl passes the weak gate but fails hidden (9p/2f). If haiku fails these while sonnet/opus pass, the execution- model competence threshold is finally located; if haiku still passes, haiku 4.5 is genuinely competent on self-contained subtasks and the model is not the lever. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…subtasks Built the canonical edge-case discriminator (arithmetic expression evaluator: right-assoc **, ** tighter than unary minus, negative exponent, errors) per web research. Sweeps: FULL-contract calc → haiku 12/12; VAGUE-contract calc (rules NOT stated) → haiku 11/11,11/11, sonnet 10/11,11/11, opus 11/11,11/11. Even on the edge-case-hard task with a vague contract, haiku 4.5 matched/edged the bigger models (the one miss was sonnet's — noise). No bigger-is-better ordering. Why: the tier gap lives on multi-step/multi-file/concurrency tasks, not single self-contained functions — and map-task decomposes into well-scoped one-file subtasks, the regime where haiku is competent. Within map-task's granularity the model is not the outcome lever (actor can run on haiku — cost win). Final ranking: contract/gates ≫ model (only beyond-competence) ≫ prose. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…odel Two harness corrections after verifying transcripts (council-flagged confounds): 1. The real model lever is the ORCHESTRATOR model: in headless `claude -p`, map-task runs INLINE on the session model and does NOT spawn the actor/monitor sub-agents (0 Task dispatches in the transcript), so the earlier `--agent-model actor=...` override was a no-op — every "tier" actually ran on claude-opus-4-8. Use `--orchestrator-model` (= claude -p --model) to vary the model that does the work. 2. Add `--permission-mode acceptEdits` to the run: in headless default mode, file-edit permission prompts auto-deny, and weaker/less-agentic models stall ("I need permission to edit") — observed haiku hitting 4 perm-denials and giving up (0/11) while opus wrote freely (and even dispatched sub-agents). That is a permission/agency artifact, not code quality. With acceptEdits, haiku writes its code and passes 10/11 — confirming the gap on a well-scoped subtask is small. NOT a full bypass: only edits are auto-accepted; the temp is an isolated throwaway. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ved) Transcript verification (prompted by llm-council) exposed two confounds that invalidated the earlier "model doesn't matter for outcome": (1) --agent-model was a no-op (headless map-task runs inline on the orchestrator/session model, 0 sub-agent dispatches — all sweeps were opus-4-8); (2) permission stalls made haiku return 0/11 despite knowing the algorithm. Fixes: --orchestrator-model + --permission-mode acceptEdits. CLEAN result (calc_vague, hidden /11, n=4): haiku 9.8/11 (variable), sonnet 10.5/11, opus 10.7/11. Modest but consistent gradient opus ≥ sonnet > haiku; haiku more variable on hard edge cases (~1 test / ~8% gap on a single subtask), plus a separate agency gap. Earlier "model irrelevant for outcome" RETRACTED (it was opus measured 3x). Methodology lesson: verify the manipulation reached the transcript before trusting an agentic-harness result. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…vocation method) Cleanest measurement (read the agent prompt, call claude -p with that prompt + task + chosen model; no orchestrator/dispatch). calc_vague, hidden /11, n=3: haiku 8/10/10 (mean 9.3, never perfect, drops edge cases), sonnet 11/11/11, opus 11/11 (+1 broken run). The ACTOR model materially affects code quality — haiku systematically worse; use sonnet+ for actor. The two confounds (no-op --agent-model, permission-stall) had hidden this; "haiku ≈ opus" RETRACTED for the actor role. Probes also showed: headless skills don't auto-dispatch sub-agents (run inline), and even forced dispatches run on the session model — so per-agent model: is inert in headless. Architectural fix: orchestrate agents by direct claude -p --append-system-prompt <agent.md> --model <tier> so per-agent models actually take effect. Then which-agent/model-per-skill is deterministic. Prototype: .map/actor_probe.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The actor prompt carried a <Actor_Reference_Examples> block — 4 full worked implementations (~247 lines, ~22% of the prompt). The output structure is already specified earlier ("Required Output Structure"), so the examples were illustrative bloat loaded on every actor invocation. A/B test (direct actor-prompt invocation, calc_vague hidden /11, n=4): A uncut (1095L): haiku 9.3, sonnet 9.5 B cut (846L): haiku 10.0, sonnet 11.0 B >= A on both models — the cut does not regress (if anything helps), at -23% prompt size / ~7.7KB saved per invocation. Validated via .map/actor_probe.py (renders the agent template, runs it as the system prompt with a chosen model, scores against a hidden edge-case suite). make check green (2274), render byte-identical across all trees. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…r's are not Built per-agent A/B harnesses (render agent template -> claude -p --append-system-prompt --model -> score). Actor: cut 247-line examples block, A/B B>=A (haiku 9.3->10.0, sonnet 9.5->11.0), kept (d78acd5). Monitor: cut 134 lines of good/bad examples, A/B showed recall preserved (4/4) but clean-pass on identical correct code A=6/6 vs B=4/6 -> reverted. The monitor's examples calibrate its accept threshold (gate agent => false-positive calibration matters). Lesson: "cut prompt examples" is safe for a generative agent, NOT for a gate agent; each agent needs its own A/B gate, and harness coverage bounds what's cuttable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(A/B-validated) The synthesizer's Step 7 carried two `**Example**:` worked-code blocks (base_enhance / fresh_generation, ~95 lines) illustrating strategies whose numbered-step LOGIC is already stated. A capable model applies the steps without the example code. A/B (built .map/synth_probe.py: feed 3 variant calc solutions — 1 correct + 2 buggy — the synthesizer must extract correct decisions and WRITE unified code, scored by the calc hidden suite, n=3): A (HEAD, 1161L): 33/33 hidden · B (cut, 1066L): 33/33 hidden -> B == A Non-regressive, -95 lines (-8%). Confirms the rule from the actor cut: pure worked-examples are cuttable in GENERATIVE agents; guidance/patterns and gate-calibration examples are NOT (monitor + actor-patterns cuts were A/B-rejected and reverted). make check green (2274), render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…coverage limit Worked-examples are cuttable in generative agents (actor -247L, synthesizer -95L; both A/B B>=A) but NOT in gate agents or for guidance/patterns (monitor, actor error-patterns; both A/B-rejected, reverted). Remaining generative agents (task-decomposer/predictor/reflector/research-agent) output fuzzy artifacts with no clean deterministic gate, so further cuts can't satisfy the A/B rule without building lower-confidence gates first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ed, -14%) The decomposer carried a "## REFERENCE EXAMPLES" section (Example A: a full 139-line worked CRUD blueprint) — pure illustration; the JSON Schema definition (separate section) stays. A capable model produces a valid blueprint from the schema + 5-phase process without the worked example. A/B (built .map/decomp_probe.py: feed a known 5-part feature, score blueprint on schema-validity + requirement-coverage + acyclic deps + sane count; opus, n=3): A (HEAD, 1078L): 3/3 PASS (coverage 5/5) · B (cut, 927L): 3/3 PASS (coverage 5/5) B == A, non-regressive, -151 lines (-14%). Third generative-agent worked-example cut to validate (after actor -247L, synthesizer -95L); gate-calibration/guidance cuts remain A/B-rejected (monitor, actor-patterns). make check green, render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The predictor carried an <examples> block (Examples 1-3: full worked impact analyses, ~417 lines incl. 107L/91L/46L JSON outputs) — illustration; the JSON Schema, risk rubric, and "Good vs Bad Predictions" guidance stay. A/B (built .map/pred_probe.py: feed a known breaking + high-blast-radius signature change across 8 callers; gate on the deterministic core — breaking_changes detected AND affected components identified; sonnet, n=3): A (HEAD, 2003L): 3/3 PASS · B (cut, 1586L): 3/3 PASS B == A, non-regressive, -417 lines (-21%, the largest cut). Fourth generative-agent worked-example cut validated (actor -247, synthesizer -95, decomposer -151, predictor -417 = 910 lines total). Gate covers breaking+affected detection (predictor's core); risk-level (subjective) was rated identically by both versions. make check green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The reflector's "# COMPLETE EXAMPLES" section (4 worked lesson-extraction examples, 139 lines) is illustration; the lesson-format rules + decision frameworks stay. A/B (built .map/reflect_probe.py: feed a clear SQL-injection failure outcome; gate on whether the reflector extracts a lesson naming the root-cause concept injection/parameterization; sonnet, n=3): A (HEAD): 3/3 PASS · B (cut): 3/3 PASS B == A, non-regressive, -139 lines. Fifth generative-agent worked-example cut. TOTAL across 5 generative agents: actor -247, synthesizer -95, task-decomposer -151, predictor -417, reflector -139 = 1049 lines of worked-example bloat removed, all A/B-validated B>=A. Gate agents (monitor etc.) NOT cut — examples calibrate the accept threshold (monitor A/B-rejected). (reflector gate is keyword-presence, fuzzier than the code-output gates.) make check green, render byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…es, A/B-validated) Swept all generative agents for worked-example/REFERENCE blocks, A/B-gated each: KEPT (B>=A): actor -247, synthesizer -95, task-decomposer -151, predictor -417, reflector -139 = 1049 lines. REJECTED (B<A, reverted): monitor examples (calibrate accept threshold), actor error-patterns/decision-tree (scaffold weaker models). Gate agents untouched (examples calibrate); research-agent lean. Final rule + reusable harnesses (.map/*_probe.py) documented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

azalio and others added 24 commits June 5, 2026 12:46

azalio merged commit 72f74c6 into main Jun 6, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden skill-eval + A/B-polish MAP agents (−1049 lines of example bloat)#161

Harden skill-eval + A/B-polish MAP agents (−1049 lines of example bloat)#161
azalio merged 24 commits into
mainfrom
gila-copper

azalio commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azalio commented Jun 6, 2026

Summary

1. skill-eval bug fixes

2. Whole-skill outcome harness + model investigation

3. Agent prompt polish (A/B-validated)

Note

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant