improve-skills: report, fix, validate — a workflow for Asta skills by jbragg · Pull Request #75 · allenai/asta-plugins

jbragg · 2026-06-10T05:08:20Z

Adds an improve-skills skill (preview, internal) that turns a behavior gap — whether a skill misbehaves or lacks a capability you want — into a reproducing eval case and, carried further, a paired-eval-validated fix, handed off at either the report or the validated-fix stage. Plus an Improving skills README section (the user-facing report/fix path) and a DEVELOPER.md pointer (the contributor flow: changing a skill and regression-checking it before merging). See those for the workflow.

Companion: allenai/asta-bench-private#229 — the validating eval case improve_skills_report_problem.

Validation

claude_code 2.1.153 · sonnet-4-6 · ghcr.io/allenai/asta:v0.18.0 @sha256:fef908d7a573…, 3 epochs, baseline (main) vs this PR.

case	metric	baseline	PR
`improve_skills_report_problem`	`improve_skills_activated`	0/3	3/3
	`capture_written`	0/3	3/3
`view_agent_output` (guard)	`workspace_skill_activated`	0/3	3/3

capture_written is the real signal: with the skill the agent invokes improve-skills and writes a dated capture under .asta/improve-skills/ (then a blocked gh issue create). Without it the agent improvises — in the baseline runs it edited the workspace skill in-sandbox, or just explained the problem in chat — never producing a durable, reportable capture.

The view_agent_output guard (a precaution) holds — workspace fires 3/3 in the PR arm. (Baseline 0/3: the agent asks for context instead of opening the project; with improve-skills present it invokes workspace — not a regression.)

Scope: only the report path (routing + capture) runs in-sandbox; filing the issue (GitHub auth) and the fix's paired-eval loop (the Docker eval harness) need infra the sandbox lacks.

rodneykinney

Do you have a transcript from a session using this skill? I'd be interested to see one. I'd like to give it a trial run myself and get some more informed feedback.

The steps are pretty complicated, so I would be tempted to decompose it into a router + workflow. The existence of "If stopping here" instructions is another clue that this would be helpful.

I'm noticing that we actually have three categories of skills now, and the internal: true/false flag is feeling overloaded. This, and the research-challenge skill are really dev skills, related to the development of the plugins, while the other internal skills are really beta: less-stable skills meant for external users

rodneykinney · 2026-06-10T17:46:53Z

Used this skill to address research-challenge feedback. It created #76, which looks pretty good. The issue it corrected wasn't reproducible in the eval sandbox, so I wasn't able to test that workflow. I'll run another case to understand how the asta-bench integration works

jbragg · 2026-06-11T15:05:21Z

The steps are pretty complicated, so I would be tempted to decompose it into a router + workflow. The existence of "If stopping here" instructions is another clue that this would be helpful.
I'm noticing that we actually have three categories of skills now, and the internal: true/false flag is feeling overloaded. This, and the research-challenge skill are really dev skills, related to the development of the plugins, while the other internal skills are really beta: less-stable skills meant for external users

I agree decomposing could help particularly if frequently stopping early and just reporting problems. But I'm not sure that path will be exercised frequently. It would also be nice to do a proper empirical comparison of performance if/when decomposing.

I think we want end users to be able to share to help improve the system. Maybe this should eventually be made possible through a mechanism that's not a skill, and probably most users won't want to patch our code (but I like allowing for that). Certainly those skills don't need to be active all the time if having them present harms performance.

rodneykinney

I agree that we don't need to restructure the skill until we have some empirical way of validating the change.

We can have an offline discussion about the skill feedback/improvement workflow. Right now, this and the research-challenge skill both reference private repos, so they are not even executable by external users.

improve-skills: report, fix, validate workflow for Asta skills

cf973c3

jbragg force-pushed the improve-skills branch from 6acf6e4 to cf973c3 Compare June 10, 2026 05:28

jbragg requested a review from rodneykinney June 10, 2026 05:29

rodneykinney reviewed Jun 10, 2026

View reviewed changes

rodneykinney self-requested a review June 11, 2026 16:37

rodneykinney approved these changes Jun 11, 2026

View reviewed changes

jbragg merged commit c98f1cd into main Jun 11, 2026
7 checks passed

jbragg deleted the improve-skills branch June 11, 2026 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve-skills: report, fix, validate — a workflow for Asta skills#75

improve-skills: report, fix, validate — a workflow for Asta skills#75
jbragg merged 1 commit into
mainfrom
improve-skills

jbragg commented Jun 10, 2026 •

edited

Loading

Uh oh!

rodneykinney left a comment

Uh oh!

rodneykinney commented Jun 10, 2026

Uh oh!

jbragg commented Jun 11, 2026

Uh oh!

rodneykinney left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jbragg commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation

Uh oh!

rodneykinney left a comment

Choose a reason for hiding this comment

Uh oh!

rodneykinney commented Jun 10, 2026

Uh oh!

jbragg commented Jun 11, 2026

Uh oh!

rodneykinney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jbragg commented Jun 10, 2026 •

edited

Loading