diff --git a/DEVELOPER.md b/DEVELOPER.md index dcc374a..3474a96 100644 --- a/DEVELOPER.md +++ b/DEVELOPER.md @@ -354,6 +354,12 @@ CI enforces that `plugins/asta` stays in sync — PRs fail if someone edits a sk 3. Run `make build-plugins` 4. To promote to core later: remove the `metadata.internal` block and rebuild +#### Validating a behavior change + +When a change affects what an agent *does* — a routing `description:` or a step it follows — validate it before merging: reproduce the target behavior as an eval case and compare baseline vs. change with a paired eval. Where that behavior can't be fully captured in-sandbox, skip the new case and just run the existing cases it could affect, to catch regressions. The [`improve-skills`](plugins/asta-preview/skills/improve-skills/SKILL.md) skill walks this end to end; use it for any skill change you PR. + +Worked examples: [#60](https://github.com/allenai/asta-plugins/pull/60) (description rewrite), [#63](https://github.com/allenai/asta-plugins/pull/63) (new skill + cases), [#67](https://github.com/allenai/asta-plugins/pull/67) (multi-skill fix with ablation). + ### Hooks Hooks in `hooks/` are bash scripts that can auto-approve tool usage: diff --git a/README.md b/README.md index ad96a5e..f67fe2c 100644 --- a/README.md +++ b/README.md @@ -96,46 +96,18 @@ The [`workspace`](plugins/asta-preview/skills/workspace/SKILL.md) skill lets use ## Benchmarking -[`agent-baselines`](https://github.com/allenai/agent-baselines)'s [`inspect-swe`](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe) solver runs Asta skills against any [Inspect](https://inspect.aisi.org.uk/)-compatible eval suite via `-S skills=`. +[`agent-baselines`](https://github.com/allenai/agent-baselines) solvers (e.g. [`inspect-swe`](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe)) run Asta skills against any [Inspect](https://inspect.aisi.org.uk/)-compatible eval suite via `-S skills=`: [Swapping in local skills](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#swapping-in-local-skills) points it at the canonical `plugins/asta-preview/skills` tree (edit it directly). [Demo](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#demo) is a worked example on [AstaBench](https://github.com/allenai/asta-bench), a scientific research suite for AI agents. -To run a benchmark, see [Demo](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#demo), which runs the [`astabench`](https://github.com/allenai/asta-bench) science-agent suite with default skills. For your own edits, use [Swapping in local skills](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#swapping-in-local-skills) (run `make build-plugins` first, then point at the regenerated tree). +## Improving skills (or just reporting a problem) -For measuring the effect of a skill change, run a paired comparison via [Comparing two configurations](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#comparing-two-configurations). Both arms must end up with the same `-S version=` and same `ASTA_IMAGE=…@sha256:…`, so a typical flow is: run baseline with defaults, capture what resolved, pin the PR arm to match. +Invoke the [`improve-skills`](plugins/asta-preview/skills/improve-skills/SKILL.md) skill if you: -```bash -# 1. Run baseline arm — defaults (`:latest`, `version=auto`) are fine. -# See the linked recipe above for the full `astabench eval` command; -# swap in `-S skills=…/asta-plugins-baseline/…` and a baseline log dir. - -# 2. Capture what resolved (these get reused in step 3): -eval "$(inspect log dump logs/baseline/*.eval | jq -er '.samples[0].metadata - | "ASTA_IMAGE=\(.asta_image)\nAGENT_VERSION=\(.agent_version)"')" -export ASTA_IMAGE AGENT_VERSION - -# 3. Run the PR arm. ASTA_IMAGE is read from the env (already exported), so -# no flag is needed for the image. Add `-S version="$AGENT_VERSION"` and -# the PR branch's `-S skills=…/asta-plugins/…`, into a different log dir. - -# 4. Map the @sha256:… digest in $ASTA_IMAGE to a readable release tag -# by matching the eval timestamp against this repo's release tags: -git tag --sort=-creatordate -l 'v*' | head -``` - -Record the pins in the PR description like `claude_code 2.1.142 · sonnet-4-6 · ghcr.io/allenai/asta:v0.17.2` (`@sha256:bf92d6a2…`) — tag for readability, digest for strict reproducibility. - -See [#60](https://github.com/allenai/asta-plugins/pull/60) for a worked example against existing cases, and [#63](https://github.com/allenai/asta-plugins/pull/63) for a worked example of adding new per-skill cases. - -When a comparison includes a configuration that isn't a regular commit on a PR branch (an ablation, an A/B variant, etc.), preserve it as an annotated git tag under `experiments/PR-/` so reviewers can check it out and reproduce. Tag after the PR is open so the number is known: - -```bash -git tag -a experiments/PR-123/workspace-ablate-artifacts-tightening \ - -m "PR #123's workspace branch with plugins/asta-preview/skills/artifacts/SKILL.md reverted to main. Used to measure view_agent_output routing dependency on the artifacts tightening." -git push origin experiments/PR-123/workspace-ablate-artifacts-tightening -``` +- Observed an agent doing the wrong thing while using a skill (or not doing what you asked). +- Want an agent to be able to do something it currently can't (extend a skill, or add a new one). -Tags survive branch deletion. Listable per-PR with `git tag -l 'experiments/PR-123/*'`. Link the tag from the PR description. +Hand off at whatever depth you reach: a reported problem, a failing test for a fixer to pick up, or a fix you've validated with a paired eval. -External contributors push the tag to their fork (no write access here) and link to the fork's tag URL — same convention, different remote. +Contributors changing a skill (including regression-checking before merging) follow the same workflow — see [DEVELOPER.md](DEVELOPER.md#validating-a-behavior-change). ## Development diff --git a/plugins/asta-preview/skills/improve-skills/SKILL.md b/plugins/asta-preview/skills/improve-skills/SKILL.md new file mode 100644 index 0000000..eca8408 --- /dev/null +++ b/plugins/asta-preview/skills/improve-skills/SKILL.md @@ -0,0 +1,142 @@ +--- +name: improve-skills +description: Report or fix a behavior gap in Asta skills. Use when an agent does the wrong thing, doesn't do what was asked, or lacks a capability that's wanted. +metadata: + internal: true +allowed-tools: Bash(git clone *) Bash(git fetch *) Bash(git worktree *) Bash(git checkout *) Bash(git add *) Bash(git commit *) Bash(git tag *) Bash(gh issue view *) Bash(jq *) Bash(eval_env *) Bash(asta auth print-token *) Read Write Edit +--- + +# Improve Asta skills + +## 1. Capture the issue (bug or feature idea) + +Write the gap to `/.asta/improve-skills/.md` (absolute path, dated slug): the skill(s), prompt, current vs. desired behavior, and relevant setup state. Seed it from an existing issue/doc/thread if one exists (`gh issue view --repo allenai/asta-plugins --json body --jq .body > /.asta/improve-skills/.md`). + +> **If stopping here (report):** file it as a new issue (skip if it already is one): `gh issue create --repo allenai/asta-plugins --title "" --body-file /.asta/improve-skills/.md`. + +## 2. Assemble the case set + +Set a dev root for the repos, then add or reuse cases in the `asta_skills` suite (under `$DEV_ROOT/asta-bench-private`): + +```bash +export DEV_ROOT= +[ -d "$DEV_ROOT/asta-bench-private" ] || git clone git@github.com:allenai/asta-bench-private "$DEV_ROOT/asta-bench-private" +``` + +Two groups: + +- **Cases for the behavior** (skip if regression-checking only). Reuse existing cases that fit; otherwise add new entries on a branch in `asta-bench-private` (see the [`asta_skills` eval README](https://github.com/allenai/asta-bench-private/blob/main/astabench/ai2/evals/asta_skills/README.md) for case format and metric handlers). +- **Regression guards.** Existing cases that could be affected. Start by searching the suite's `data.json` for cases referencing the changed skill(s), but use judgment — cross-skill effects matter (a description change can shift routing on cases targeting *other* skills; see asta-plugins#67 for a routing-competition example). + +## 3. Run baseline + +Set up a clean `origin/main` worktree for the baseline: + +```bash +[ -d "$DEV_ROOT/agent-baselines" ] || git clone git@github.com:allenai/agent-baselines "$DEV_ROOT/agent-baselines" +[ -d "$DEV_ROOT/asta-plugins" ] || git clone git@github.com:allenai/asta-plugins "$DEV_ROOT/asta-plugins" +export ASTA_TOKEN="$(asta auth print-token --raw --refresh)" +git -C "$DEV_ROOT/asta-plugins" fetch -q origin +git -C "$DEV_ROOT/asta-plugins" worktree add --detach "$DEV_ROOT/asta-plugins-main" origin/main 2>/dev/null \ + || git -C "$DEV_ROOT/asta-plugins-main" checkout -q --detach origin/main +export ASTA_MAIN="$DEV_ROOT/asta-plugins-main" +``` + +From `$DEV_ROOT/agent-baselines`, define `eval_env` to run each `inspect` command in the `solvers/inspect-swe` project: + +```bash +cd "$DEV_ROOT/agent-baselines" +eval_env() { uv run --project solvers/inspect-swe --no-group astabench \ + --with "$DEV_ROOT/asta-bench-private" --frozen --reinstall-package astabench-private -- "$@"; } +``` + +Run the full case set (behavior + guards) in one eval: + +```bash +eval_env inspect eval astabench/asta_skills \ + --sample-id ,,... \ + --solver agent_baselines/solvers/inspect_swe/agent.py@inspect_swe_solver \ + --sandbox docker:solvers/inspect-swe/sandbox_compose.yaml \ + --model anthropic/claude-sonnet-4-6 \ + --epochs 3 --working-limit 600 \ + -S agent=claude_code \ + -S strict_reproducibility=true \ + -S skills=$ASTA_MAIN/plugins/asta-preview/skills \ + --log-dir logs/baseline +# model & agent above are the asta-plugins low-cost convention +# -S version and the ASTA_IMAGE env var are left unset for baseline -> auto / :latest +``` + +Read per-case-per-metric scores: `eval_env inspect log dump --header-only "$(ls -t logs/baseline/*.eval | head -1)" | jq '.reductions[0].samples'`. For every arm you run (here and in steps 5–6), also read the transcript (`eval_env inspect view --log-dir logs`, or the sample `messages`), not just the score — confirm the metric reflects the intended behavior, not a coincidence (the gap reproduces, any lift is real, a held guard isn't masking change). For a new case, its gap-capturing metric should score below ceiling; if at ceiling, the gap isn't reproducing — investigate. + +> **If stopping here (reproducible test + baseline, TDD red):** file an issue per step 1 with a `## Reproducible test` section — the baseline numbers, the transcript read (so the fixer inherits the *why*, not just the number), and the case(s): link the asta-bench-private PR if you added any (push the branch, open the PR), or name the existing case(s) you reused. + +## 4. Make the skill change(s) + +From the baseline transcript, hypothesize why the skill(s) produce the gap and what will close it. + +```bash +# Worktree for the PR branch: +git -C "$DEV_ROOT/asta-plugins" worktree add "$DEV_ROOT/asta-plugins-" -b +``` + +Then, in `$DEV_ROOT/asta-plugins-`, edit existing skill(s) or add a new one under `plugins/asta-preview/skills//` (the canonical tree — edit it directly). Fixes may span multiple skills. Commit and push: + +```bash +cd "$DEV_ROOT/asta-plugins-" +git add -A && git commit -m "" && git push -u origin +``` + +## 5. Run the PR arm (same case set) + +Back in `$DEV_ROOT/agent-baselines`, pin to what the baseline resolved: + +```bash +eval "$(eval_env inspect log dump "$(ls -t logs/baseline/*.eval | head -1)" | jq -er '.samples[0].metadata + | "ASTA_IMAGE=\(.asta_image)\nAGENT_VERSION=\(.agent_version)"')" +export ASTA_IMAGE # env var the sandbox compose substitutes into `image: ${ASTA_IMAGE}` (no flag sets the image); AGENT_VERSION stays a shell var for -S version below +``` + +Rerun the step-3 eval with the exported `ASTA_IMAGE` and: + +- `-S version="$AGENT_VERSION"` +- `-S skills=$DEV_ROOT/asta-plugins-/plugins/asta-preview/skills` +- `--log-dir logs/pr-arm` + +Confirm new-case metrics fire and the regression guards hold at baseline. If a guard drops, your change regressed that case — find the cause and fix it (or explain it if the drop is benign) before opening the PR. + +## 6. Ablation + +For a multi-part fix, hypothesize which parts carry the result and which may be unnecessary or harmful, then ablate to test (e.g. asta-plugins#67). + +Create the variant — a worktree off the PR branch with one part undone: + +```bash +git -C "$DEV_ROOT/asta-plugins-" worktree add "$DEV_ROOT/asta-plugins-ablate" -b ablate- +cd "$DEV_ROOT/asta-plugins-ablate" +git checkout main -- plugins/asta-preview/skills//SKILL.md # undo one part: a whole file, or hand-edit to undo part of one +git add -A && git commit -m "ablate: " +``` + +Rerun the step-5 eval on the variant (keeping its `-S version` pin and exported `ASTA_IMAGE`) with: + +- `-S skills=$DEV_ROOT/asta-plugins-ablate/plugins/asta-preview/skills` +- `--log-dir logs/ablation` + +A drop from the PR arm means that part is load-bearing — keep it. No drop means it isn't pulling weight — drop it from the PR (or note why it stays). If undoing a part recovers a regressed guard, it was causing that regression — drop or rework it. + +## 7. Open the PR(s) + +Draft a body per repo you touched, in `/.asta/improve-skills/` as `--pr.md` — concise, linking the companion rather than restating it: + +- **asta-plugins** (always): what it fixes — `Resolves #` if one exists, else the gap and the behavior it addresses — plus **Validation**: pinned setup (agent version, model, image tag + `@sha256:`), an arms table (baseline, PR, and ablation if you ran one) across cases and metrics, and the transcript read. +- **asta-bench-private** (only if you changed cases): the case(s) and what they measure; defer results to the companion. + +Create each with `gh pr create --repo allenai/ --body-file /.asta/improve-skills/--pr.md`, then cross-link their numbers with `gh pr edit`. + +If you ran an ablation, tag its variant and link it from the asta-plugins PR's validation section (`https://github.com/allenai/asta-plugins/tree/experiments/PR-/`) so reviewers can reproduce it: + +```bash +git -C "$DEV_ROOT/asta-plugins-ablate" tag -a experiments/PR-/ -m "PR # minus — ablation arm." +git -C "$DEV_ROOT/asta-plugins-ablate" push origin refs/tags/experiments/PR-/ +```