Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions DEVELOPER.md
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,12 @@ CI enforces that `plugins/asta` stays in sync — PRs fail if someone edits a sk
3. Run `make build-plugins`
4. To promote to core later: remove the `metadata.internal` block and rebuild

#### Validating a behavior change

When a change affects what an agent *does* — a routing `description:` or a step it follows — validate it before merging: reproduce the target behavior as an eval case and compare baseline vs. change with a paired eval. Where that behavior can't be fully captured in-sandbox, skip the new case and just run the existing cases it could affect, to catch regressions. The [`improve-skills`](plugins/asta-preview/skills/improve-skills/SKILL.md) skill walks this end to end; use it for any skill change you PR.

Worked examples: [#60](https://github.com/allenai/asta-plugins/pull/60) (description rewrite), [#63](https://github.com/allenai/asta-plugins/pull/63) (new skill + cases), [#67](https://github.com/allenai/asta-plugins/pull/67) (multi-skill fix with ablation).

### Hooks

Hooks in `hooks/` are bash scripts that can auto-approve tool usage:
Expand Down
42 changes: 7 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,46 +96,18 @@ The [`workspace`](plugins/asta-preview/skills/workspace/SKILL.md) skill lets use

## Benchmarking

[`agent-baselines`](https://github.com/allenai/agent-baselines)'s [`inspect-swe`](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe) solver runs Asta skills against any [Inspect](https://inspect.aisi.org.uk/)-compatible eval suite via `-S skills=<path>`.
[`agent-baselines`](https://github.com/allenai/agent-baselines) solvers (e.g. [`inspect-swe`](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe)) run Asta skills against any [Inspect](https://inspect.aisi.org.uk/)-compatible eval suite via `-S skills=<path>`: [Swapping in local skills](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#swapping-in-local-skills) points it at the canonical `plugins/asta-preview/skills` tree (edit it directly). [Demo](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#demo) is a worked example on [AstaBench](https://github.com/allenai/asta-bench), a scientific research suite for AI agents.

To run a benchmark, see [Demo](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#demo), which runs the [`astabench`](https://github.com/allenai/asta-bench) science-agent suite with default skills. For your own edits, use [Swapping in local skills](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#swapping-in-local-skills) (run `make build-plugins` first, then point at the regenerated tree).
## Improving skills (or just reporting a problem)

For measuring the effect of a skill change, run a paired comparison via [Comparing two configurations](https://github.com/allenai/agent-baselines/tree/main/solvers/inspect-swe#comparing-two-configurations). Both arms must end up with the same `-S version=<cc-ver>` and same `ASTA_IMAGE=…@sha256:…`, so a typical flow is: run baseline with defaults, capture what resolved, pin the PR arm to match.
Invoke the [`improve-skills`](plugins/asta-preview/skills/improve-skills/SKILL.md) skill if you:

```bash
# 1. Run baseline arm — defaults (`:latest`, `version=auto`) are fine.
# See the linked recipe above for the full `astabench eval` command;
# swap in `-S skills=…/asta-plugins-baseline/…` and a baseline log dir.

# 2. Capture what resolved (these get reused in step 3):
eval "$(inspect log dump logs/baseline/*.eval | jq -er '.samples[0].metadata
| "ASTA_IMAGE=\(.asta_image)\nAGENT_VERSION=\(.agent_version)"')"
export ASTA_IMAGE AGENT_VERSION

# 3. Run the PR arm. ASTA_IMAGE is read from the env (already exported), so
# no flag is needed for the image. Add `-S version="$AGENT_VERSION"` and
# the PR branch's `-S skills=…/asta-plugins/…`, into a different log dir.

# 4. Map the @sha256:… digest in $ASTA_IMAGE to a readable release tag
# by matching the eval timestamp against this repo's release tags:
git tag --sort=-creatordate -l 'v*' | head
```

Record the pins in the PR description like `claude_code 2.1.142 · sonnet-4-6 · ghcr.io/allenai/asta:v0.17.2` (`@sha256:bf92d6a2…`) — tag for readability, digest for strict reproducibility.

See [#60](https://github.com/allenai/asta-plugins/pull/60) for a worked example against existing cases, and [#63](https://github.com/allenai/asta-plugins/pull/63) for a worked example of adding new per-skill cases.

When a comparison includes a configuration that isn't a regular commit on a PR branch (an ablation, an A/B variant, etc.), preserve it as an annotated git tag under `experiments/PR-<num>/<description>` so reviewers can check it out and reproduce. Tag after the PR is open so the number is known:

```bash
git tag -a experiments/PR-123/workspace-ablate-artifacts-tightening \
-m "PR #123's workspace branch with plugins/asta-preview/skills/artifacts/SKILL.md reverted to main. Used to measure view_agent_output routing dependency on the artifacts tightening."
git push origin experiments/PR-123/workspace-ablate-artifacts-tightening
```
- Observed an agent doing the wrong thing while using a skill (or not doing what you asked).
- Want an agent to be able to do something it currently can't (extend a skill, or add a new one).

Tags survive branch deletion. Listable per-PR with `git tag -l 'experiments/PR-123/*'`. Link the tag from the PR description.
Hand off at whatever depth you reach: a reported problem, a failing test for a fixer to pick up, or a fix you've validated with a paired eval.

External contributors push the tag to their fork (no write access here) and link to the fork's tag URL — same convention, different remote.
Contributors changing a skill (including regression-checking before merging) follow the same workflow — see [DEVELOPER.md](DEVELOPER.md#validating-a-behavior-change).

## Development

Expand Down
142 changes: 142 additions & 0 deletions plugins/asta-preview/skills/improve-skills/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
name: improve-skills
description: Report or fix a behavior gap in Asta skills. Use when an agent does the wrong thing, doesn't do what was asked, or lacks a capability that's wanted.
metadata:
internal: true
allowed-tools: Bash(git clone *) Bash(git fetch *) Bash(git worktree *) Bash(git checkout *) Bash(git add *) Bash(git commit *) Bash(git tag *) Bash(gh issue view *) Bash(jq *) Bash(eval_env *) Bash(asta auth print-token *) Read Write Edit
---

# Improve Asta skills

## 1. Capture the issue (bug or feature idea)

Write the gap to `<project-root>/.asta/improve-skills/<slug>.md` (absolute path, dated slug): the skill(s), prompt, current vs. desired behavior, and relevant setup state. Seed it from an existing issue/doc/thread if one exists (`gh issue view <num> --repo allenai/asta-plugins --json body --jq .body > <project-root>/.asta/improve-skills/<slug>.md`).

> **If stopping here (report):** file it as a new issue (skip if it already is one): `gh issue create --repo allenai/asta-plugins --title "<summary>" --body-file <project-root>/.asta/improve-skills/<slug>.md`.

## 2. Assemble the case set

Set a dev root for the repos, then add or reuse cases in the `asta_skills` suite (under `$DEV_ROOT/asta-bench-private`):

```bash
export DEV_ROOT=<your dev dir>
[ -d "$DEV_ROOT/asta-bench-private" ] || git clone git@github.com:allenai/asta-bench-private "$DEV_ROOT/asta-bench-private"
```

Two groups:

- **Cases for the behavior** (skip if regression-checking only). Reuse existing cases that fit; otherwise add new entries on a branch in `asta-bench-private` (see the [`asta_skills` eval README](https://github.com/allenai/asta-bench-private/blob/main/astabench/ai2/evals/asta_skills/README.md) for case format and metric handlers).
- **Regression guards.** Existing cases that could be affected. Start by searching the suite's `data.json` for cases referencing the changed skill(s), but use judgment — cross-skill effects matter (a description change can shift routing on cases targeting *other* skills; see asta-plugins#67 for a routing-competition example).

## 3. Run baseline

Set up a clean `origin/main` worktree for the baseline:

```bash
[ -d "$DEV_ROOT/agent-baselines" ] || git clone git@github.com:allenai/agent-baselines "$DEV_ROOT/agent-baselines"
[ -d "$DEV_ROOT/asta-plugins" ] || git clone git@github.com:allenai/asta-plugins "$DEV_ROOT/asta-plugins"
export ASTA_TOKEN="$(asta auth print-token --raw --refresh)"
git -C "$DEV_ROOT/asta-plugins" fetch -q origin
git -C "$DEV_ROOT/asta-plugins" worktree add --detach "$DEV_ROOT/asta-plugins-main" origin/main 2>/dev/null \
|| git -C "$DEV_ROOT/asta-plugins-main" checkout -q --detach origin/main
export ASTA_MAIN="$DEV_ROOT/asta-plugins-main"
```

From `$DEV_ROOT/agent-baselines`, define `eval_env` to run each `inspect` command in the `solvers/inspect-swe` project:

```bash
cd "$DEV_ROOT/agent-baselines"
eval_env() { uv run --project solvers/inspect-swe --no-group astabench \
--with "$DEV_ROOT/asta-bench-private" --frozen --reinstall-package astabench-private -- "$@"; }
```

Run the full case set (behavior + guards) in one eval:

```bash
eval_env inspect eval astabench/asta_skills \
--sample-id <case_id_1>,<case_id_2>,... \
--solver agent_baselines/solvers/inspect_swe/agent.py@inspect_swe_solver \
--sandbox docker:solvers/inspect-swe/sandbox_compose.yaml \
--model anthropic/claude-sonnet-4-6 \
--epochs 3 --working-limit 600 \
-S agent=claude_code \
-S strict_reproducibility=true \
-S skills=$ASTA_MAIN/plugins/asta-preview/skills \
--log-dir logs/baseline
# model & agent above are the asta-plugins low-cost convention
# -S version and the ASTA_IMAGE env var are left unset for baseline -> auto / :latest
```

Read per-case-per-metric scores: `eval_env inspect log dump --header-only "$(ls -t logs/baseline/*.eval | head -1)" | jq '.reductions[0].samples'`. For every arm you run (here and in steps 5–6), also read the transcript (`eval_env inspect view --log-dir logs`, or the sample `messages`), not just the score — confirm the metric reflects the intended behavior, not a coincidence (the gap reproduces, any lift is real, a held guard isn't masking change). For a new case, its gap-capturing metric should score below ceiling; if at ceiling, the gap isn't reproducing — investigate.

> **If stopping here (reproducible test + baseline, TDD red):** file an issue per step 1 with a `## Reproducible test` section — the baseline numbers, the transcript read (so the fixer inherits the *why*, not just the number), and the case(s): link the asta-bench-private PR if you added any (push the branch, open the PR), or name the existing case(s) you reused.

## 4. Make the skill change(s)

From the baseline transcript, hypothesize why the skill(s) produce the gap and what will close it.

```bash
# Worktree for the PR branch:
git -C "$DEV_ROOT/asta-plugins" worktree add "$DEV_ROOT/asta-plugins-<pr-branch>" -b <pr-branch>
```

Then, in `$DEV_ROOT/asta-plugins-<pr-branch>`, edit existing skill(s) or add a new one under `plugins/asta-preview/skills/<name>/` (the canonical tree — edit it directly). Fixes may span multiple skills. Commit and push:

```bash
cd "$DEV_ROOT/asta-plugins-<pr-branch>"
git add -A && git commit -m "<summary>" && git push -u origin <pr-branch>
```

## 5. Run the PR arm (same case set)

Back in `$DEV_ROOT/agent-baselines`, pin to what the baseline resolved:

```bash
eval "$(eval_env inspect log dump "$(ls -t logs/baseline/*.eval | head -1)" | jq -er '.samples[0].metadata
| "ASTA_IMAGE=\(.asta_image)\nAGENT_VERSION=\(.agent_version)"')"
export ASTA_IMAGE # env var the sandbox compose substitutes into `image: ${ASTA_IMAGE}` (no flag sets the image); AGENT_VERSION stays a shell var for -S version below
```

Rerun the step-3 eval with the exported `ASTA_IMAGE` and:

- `-S version="$AGENT_VERSION"`
- `-S skills=$DEV_ROOT/asta-plugins-<pr-branch>/plugins/asta-preview/skills`
- `--log-dir logs/pr-arm`

Confirm new-case metrics fire and the regression guards hold at baseline. If a guard drops, your change regressed that case — find the cause and fix it (or explain it if the drop is benign) before opening the PR.

## 6. Ablation

For a multi-part fix, hypothesize which parts carry the result and which may be unnecessary or harmful, then ablate to test (e.g. asta-plugins#67).

Create the variant — a worktree off the PR branch with one part undone:

```bash
git -C "$DEV_ROOT/asta-plugins-<pr-branch>" worktree add "$DEV_ROOT/asta-plugins-ablate" -b ablate-<short-desc>
cd "$DEV_ROOT/asta-plugins-ablate"
git checkout main -- plugins/asta-preview/skills/<skill>/SKILL.md # undo one part: a whole file, or hand-edit to undo part of one
git add -A && git commit -m "ablate: <description>"
```

Rerun the step-5 eval on the variant (keeping its `-S version` pin and exported `ASTA_IMAGE`) with:

- `-S skills=$DEV_ROOT/asta-plugins-ablate/plugins/asta-preview/skills`
- `--log-dir logs/ablation`

A drop from the PR arm means that part is load-bearing — keep it. No drop means it isn't pulling weight — drop it from the PR (or note why it stays). If undoing a part recovers a regressed guard, it was causing that regression — drop or rework it.

## 7. Open the PR(s)

Draft a body per repo you touched, in `<project-root>/.asta/improve-skills/` as `<slug>-<repo>-pr.md` — concise, linking the companion rather than restating it:

- **asta-plugins** (always): what it fixes — `Resolves #<issue>` if one exists, else the gap and the behavior it addresses — plus **Validation**: pinned setup (agent version, model, image tag + `@sha256:`), an arms table (baseline, PR, and ablation if you ran one) across cases and metrics, and the transcript read.
- **asta-bench-private** (only if you changed cases): the case(s) and what they measure; defer results to the companion.

Create each with `gh pr create --repo allenai/<repo> --body-file <project-root>/.asta/improve-skills/<slug>-<repo>-pr.md`, then cross-link their numbers with `gh pr edit`.

If you ran an ablation, tag its variant and link it from the asta-plugins PR's validation section (`https://github.com/allenai/asta-plugins/tree/experiments/PR-<num>/<short-desc>`) so reviewers can reproduce it:

```bash
git -C "$DEV_ROOT/asta-plugins-ablate" tag -a experiments/PR-<num>/<short-desc> -m "PR #<num> minus <part> — ablation arm."
git -C "$DEV_ROOT/asta-plugins-ablate" push origin refs/tags/experiments/PR-<num>/<short-desc>
```
Loading