allenai · jbragg · Jun 11, 2026 · Jun 2, 2026
diff --git a/DEVELOPER.md b/DEVELOPER.md
@@ -354,6 +354,12 @@ CI enforces that `plugins/asta` stays in sync — PRs fail if someone edits a sk
 3. Run `make build-plugins`
 4. To promote to core later: remove the `metadata.internal` block and rebuild
 
+#### Validating a behavior change
+
+When a change affects what an agent *does* — a routing `description:` or a step it follows — validate it before merging: reproduce the target behavior as an eval case and compare baseline vs. change with a paired eval. Where that behavior can't be fully captured in-sandbox, skip the new case and just run the existing cases it could affect, to catch regressions. The [`improve-skills`](plugins/asta-preview/skills/improve-skills/SKILL.md) skill walks this end to end; use it for any skill change you PR.
+
+Worked examples: [#60](https://github.com/allenai/asta-plugins/pull/60) (description rewrite), [#63](https://github.com/allenai/asta-plugins/pull/63) (new skill + cases), [#67](https://github.com/allenai/asta-plugins/pull/67) (multi-skill fix with ablation).
+
 ### Hooks
 
 Hooks in `hooks/` are bash scripts that can auto-approve tool usage:

diff --git a/README.md b/README.md
@@ -96,46 +96,18 @@ The [`workspace`](plugins/asta-preview/skills/workspace/SKILL.md) skill lets use
 
 ## Benchmarking
 
-[`agent-baselines`](https://github.com/allenai/agent-baselines)'s [`inspect-swe`](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe) solver runs Asta skills against any [Inspect](https://inspect.aisi.org.uk/)-compatible eval suite via `-S skills=<path>`.
+[`agent-baselines`](https://github.com/allenai/agent-baselines) solvers (e.g. [`inspect-swe`](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe)) run Asta skills against any [Inspect](https://inspect.aisi.org.uk/)-compatible eval suite via `-S skills=<path>`: [Swapping in local skills](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#swapping-in-local-skills) points it at the canonical `plugins/asta-preview/skills` tree (edit it directly). [Demo](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#demo) is a worked example on [AstaBench](https://github.com/allenai/asta-bench), a scientific research suite for AI agents.
 
-To run a benchmark, see [Demo](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#demo), which runs the [`astabench`](https://github.com/allenai/asta-bench) science-agent suite with default skills. For your own edits, use [Swapping in local skills](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#swapping-in-local-skills) (run `make build-plugins` first, then point at the regenerated tree).
+## Improving skills (or just reporting a problem)
 
-For measuring the effect of a skill change, run a paired comparison via [Comparing two configurations](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#comparing-two-configurations). Both arms must end up with the same `-S version=<cc-ver>` and same `ASTA_IMAGE=…@sha256:…`, so a typical flow is: run baseline with defaults, capture what resolved, pin the PR arm to match.
+Invoke the [`improve-skills`](plugins/asta-preview/skills/improve-skills/SKILL.md) skill if you:
 
-```bash
-# 1. Run baseline arm — defaults (`:latest`, `version=auto`) are fine.
-#    See the linked recipe above for the full `astabench eval` command;
-#    swap in `-S skills=…/asta-plugins-baseline/…` and a baseline log dir.
-
-# 2. Capture what resolved (these get reused in step 3):
-eval "$(inspect log dump logs/baseline/*.eval | jq -er '.samples[0].metadata
-    | "ASTA_IMAGE=\(.asta_image)\nAGENT_VERSION=\(.agent_version)"')"
-export ASTA_IMAGE AGENT_VERSION
-
-# 3. Run the PR arm. ASTA_IMAGE is read from the env (already exported), so
-#    no flag is needed for the image. Add `-S version="$AGENT_VERSION"` and
-#    the PR branch's `-S skills=…/asta-plugins/…`, into a different log dir.
-
-# 4. Map the @sha256:… digest in $ASTA_IMAGE to a readable release tag
-#    by matching the eval timestamp against this repo's release tags:
-git tag --sort=-creatordate -l 'v*' | head
-```
-
-Record the pins in the PR description like `claude_code 2.1.142 · sonnet-4-6 · ghcr.io/allenai/asta:v0.17.2` (`@sha256:bf92d6a2…`) — tag for readability, digest for strict reproducibility.
-
-See [#60](https://github.com/allenai/asta-plugins/pull/60) for a worked example against existing cases, and [#63](https://github.com/allenai/asta-plugins/pull/63) for a worked example of adding new per-skill cases.
-
-When a comparison includes a configuration that isn't a regular commit on a PR branch (an ablation, an A/B variant, etc.), preserve it as an annotated git tag under `experiments/PR-<num>/<description>` so reviewers can check it out and reproduce. Tag after the PR is open so the number is known:
-
-```bash
-git tag -a experiments/PR-123/workspace-ablate-artifacts-tightening \
-  -m "PR #123's workspace branch with plugins/asta-preview/skills/artifacts/SKILL.md reverted to main. Used to measure view_agent_output routing dependency on the artifacts tightening."
-git push origin experiments/PR-123/workspace-ablate-artifacts-tightening
-```
+- Observed an agent doing the wrong thing while using a skill (or not doing what you asked).
+- Want an agent to be able to do something it currently can't (extend a skill, or add a new one).
 
-Tags survive branch deletion. Listable per-PR with `git tag -l 'experiments/PR-123/*'`. Link the tag from the PR description.
+Hand off at whatever depth you reach: a reported problem, a failing test for a fixer to pick up, or a fix you've validated with a paired eval.
 
-External contributors push the tag to their fork (no write access here) and link to the fork's tag URL — same convention, different remote.
+Contributors changing a skill (including regression-checking before merging) follow the same workflow — see [DEVELOPER.md](DEVELOPER.md#validating-a-behavior-change).
 
 ## Development
 

diff --git a/plugins/asta-preview/skills/improve-skills/SKILL.md b/plugins/asta-preview/skills/improve-skills/SKILL.md
@@ -0,0 +1,142 @@
+---
+name: improve-skills
+description: Report or fix a behavior gap in Asta skills. Use when an agent does the wrong thing, doesn't do what was asked, or lacks a capability that's wanted.
+metadata:
+  internal: true
+allowed-tools: Bash(git clone *) Bash(git fetch *) Bash(git worktree *) Bash(git checkout *) Bash(git add *) Bash(git commit *) Bash(git tag *) Bash(gh issue view *) Bash(jq *) Bash(eval_env *) Bash(asta auth print-token *) Read Write Edit
+---
+
+# Improve Asta skills
+
+## 1. Capture the issue (bug or feature idea)
+
+Write the gap to `<project-root>/.asta/improve-skills/<slug>.md` (absolute path, dated slug): the skill(s), prompt, current vs. desired behavior, and relevant setup state. Seed it from an existing issue/doc/thread if one exists (`gh issue view <num> --repo allenai/asta-plugins --json body --jq .body > <project-root>/.asta/improve-skills/<slug>.md`).
+
+> **If stopping here (report):** file it as a new issue (skip if it already is one): `gh issue create --repo allenai/asta-plugins --title "<summary>" --body-file <project-root>/.asta/improve-skills/<slug>.md`.
+
+## 2. Assemble the case set
+
+Set a dev root for the repos, then add or reuse cases in the `asta_skills` suite (under `$DEV_ROOT/asta-bench-private`):
+
+```bash
+export DEV_ROOT=<your dev dir>
+[ -d "$DEV_ROOT/asta-bench-private" ] || git clone git@github.com:allenai/asta-bench-private "$DEV_ROOT/asta-bench-private"
+```
+
+Two groups:
+
+- **Cases for the behavior** (skip if regression-checking only). Reuse existing cases that fit; otherwise add new entries on a branch in `asta-bench-private` (see the [`asta_skills` eval README](https://github.com/allenai/asta-bench-private/blob/main/astabench/ai2/evals/asta_skills/README.md) for case format and metric handlers).
+- **Regression guards.** Existing cases that could be affected. Start by searching the suite's `data.json` for cases referencing the changed skill(s), but use judgment — cross-skill effects matter (a description change can shift routing on cases targeting *other* skills; see asta-plugins#67 for a routing-competition example).
+
+## 3. Run baseline
+
+Set up a clean `origin/main` worktree for the baseline:
+
+```bash
+[ -d "$DEV_ROOT/agent-baselines" ] || git clone git@github.com:allenai/agent-baselines "$DEV_ROOT/agent-baselines"
+[ -d "$DEV_ROOT/asta-plugins" ]    || git clone git@github.com:allenai/asta-plugins "$DEV_ROOT/asta-plugins"
+export ASTA_TOKEN="$(asta auth print-token --raw --refresh)"
+git -C "$DEV_ROOT/asta-plugins" fetch -q origin
+git -C "$DEV_ROOT/asta-plugins" worktree add --detach "$DEV_ROOT/asta-plugins-main" origin/main 2>/dev/null \
+    || git -C "$DEV_ROOT/asta-plugins-main" checkout -q --detach origin/main
+export ASTA_MAIN="$DEV_ROOT/asta-plugins-main"
+```
+
+From `$DEV_ROOT/agent-baselines`, define `eval_env` to run each `inspect` command in the `solvers/inspect-swe` project:
+
+```bash
+cd "$DEV_ROOT/agent-baselines"
+eval_env() { uv run --project solvers/inspect-swe --no-group astabench \
+    --with "$DEV_ROOT/asta-bench-private" --frozen --reinstall-package astabench-private -- "$@"; }
+```
+
+Run the full case set (behavior + guards) in one eval:
+
+```bash
+eval_env inspect eval astabench/asta_skills \
+    --sample-id <case_id_1>,<case_id_2>,... \
+    --solver agent_baselines/solvers/inspect_swe/agent.py@inspect_swe_solver \
+    --sandbox docker:solvers/inspect-swe/sandbox_compose.yaml \
+    --model anthropic/claude-sonnet-4-6 \
+    --epochs 3 --working-limit 600 \
+    -S agent=claude_code \
+    -S strict_reproducibility=true \
+    -S skills=$ASTA_MAIN/plugins/asta-preview/skills \
+    --log-dir logs/baseline
+# model & agent above are the asta-plugins low-cost convention
+# -S version and the ASTA_IMAGE env var are left unset for baseline -> auto / :latest
+```
+
+Read per-case-per-metric scores: `eval_env inspect log dump --header-only "$(ls -t logs/baseline/*.eval | head -1)" | jq '.reductions[0].samples'`. For every arm you run (here and in steps 5–6), also read the transcript (`eval_env inspect view --log-dir logs`, or the sample `messages`), not just the score — confirm the metric reflects the intended behavior, not a coincidence (the gap reproduces, any lift is real, a held guard isn't masking change). For a new case, its gap-capturing metric should score below ceiling; if at ceiling, the gap isn't reproducing — investigate.
+
+> **If stopping here (reproducible test + baseline, TDD red):** file an issue per step 1 with a `## Reproducible test` section — the baseline numbers, the transcript read (so the fixer inherits the *why*, not just the number), and the case(s): link the asta-bench-private PR if you added any (push the branch, open the PR), or name the existing case(s) you reused.
+
+## 4. Make the skill change(s)
+
+From the baseline transcript, hypothesize why the skill(s) produce the gap and what will close it.
+
+```bash
+# Worktree for the PR branch:
+git -C "$DEV_ROOT/asta-plugins" worktree add "$DEV_ROOT/asta-plugins-<pr-branch>" -b <pr-branch>
+```
+
+Then, in `$DEV_ROOT/asta-plugins-<pr-branch>`, edit existing skill(s) or add a new one under `plugins/asta-preview/skills/<name>/` (the canonical tree — edit it directly). Fixes may span multiple skills. Commit and push:
+
+```bash
+cd "$DEV_ROOT/asta-plugins-<pr-branch>"
+git add -A && git commit -m "<summary>" && git push -u origin <pr-branch>
+```
+
+## 5. Run the PR arm (same case set)
+
+Back in `$DEV_ROOT/agent-baselines`, pin to what the baseline resolved:
+
+```bash
+eval "$(eval_env inspect log dump "$(ls -t logs/baseline/*.eval | head -1)" | jq -er '.samples[0].metadata
+    | "ASTA_IMAGE=\(.asta_image)\nAGENT_VERSION=\(.agent_version)"')"
+export ASTA_IMAGE   # env var the sandbox compose substitutes into `image: ${ASTA_IMAGE}` (no flag sets the image); AGENT_VERSION stays a shell var for -S version below
+```
+
+Rerun the step-3 eval with the exported `ASTA_IMAGE` and:
+
+- `-S version="$AGENT_VERSION"`
+- `-S skills=$DEV_ROOT/asta-plugins-<pr-branch>/plugins/asta-preview/skills`
+- `--log-dir logs/pr-arm`
+
+Confirm new-case metrics fire and the regression guards hold at baseline. If a guard drops, your change regressed that case — find the cause and fix it (or explain it if the drop is benign) before opening the PR.
+
+## 6. Ablation
+
+For a multi-part fix, hypothesize which parts carry the result and which may be unnecessary or harmful, then ablate to test (e.g. asta-plugins#67).
+
+Create the variant — a worktree off the PR branch with one part undone:
+
+```bash
+git -C "$DEV_ROOT/asta-plugins-<pr-branch>" worktree add "$DEV_ROOT/asta-plugins-ablate" -b ablate-<short-desc>
+cd "$DEV_ROOT/asta-plugins-ablate"
+git checkout main -- plugins/asta-preview/skills/<skill>/SKILL.md   # undo one part: a whole file, or hand-edit to undo part of one
+git add -A && git commit -m "ablate: <description>"
+```
+
+Rerun the step-5 eval on the variant (keeping its `-S version` pin and exported `ASTA_IMAGE`) with:
+
+- `-S skills=$DEV_ROOT/asta-plugins-ablate/plugins/asta-preview/skills`
+- `--log-dir logs/ablation`
+
+A drop from the PR arm means that part is load-bearing — keep it. No drop means it isn't pulling weight — drop it from the PR (or note why it stays). If undoing a part recovers a regressed guard, it was causing that regression — drop or rework it.
+
+## 7. Open the PR(s)
+
+Draft a body per repo you touched, in `<project-root>/.asta/improve-skills/` as `<slug>-<repo>-pr.md` — concise, linking the companion rather than restating it:
+
+- **asta-plugins** (always): what it fixes — `Resolves #<issue>` if one exists, else the gap and the behavior it addresses — plus **Validation**: pinned setup (agent version, model, image tag + `@sha256:`), an arms table (baseline, PR, and ablation if you ran one) across cases and metrics, and the transcript read.
+- **asta-bench-private** (only if you changed cases): the case(s) and what they measure; defer results to the companion.
+
+Create each with `gh pr create --repo allenai/<repo> --body-file <project-root>/.asta/improve-skills/<slug>-<repo>-pr.md`, then cross-link their numbers with `gh pr edit`.
+
+If you ran an ablation, tag its variant and link it from the asta-plugins PR's validation section (`https://github.com/allenai/asta-plugins/tree/experiments/PR-<num>/<short-desc>`) so reviewers can reproduce it:
+
+```bash
+git -C "$DEV_ROOT/asta-plugins-ablate" tag -a experiments/PR-<num>/<short-desc> -m "PR #<num> minus <part> — ablation arm."
+git -C "$DEV_ROOT/asta-plugins-ablate" push origin refs/tags/experiments/PR-<num>/<short-desc>
+```