Skip to content

improve-skills: report, fix, validate — a workflow for Asta skills#75

Merged
jbragg merged 1 commit into
mainfrom
improve-skills
Jun 11, 2026
Merged

improve-skills: report, fix, validate — a workflow for Asta skills#75
jbragg merged 1 commit into
mainfrom
improve-skills

Conversation

@jbragg

@jbragg jbragg commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Adds an improve-skills skill (preview, internal) that turns a behavior gap — whether a skill misbehaves or lacks a capability you want — into a reproducing eval case and, carried further, a paired-eval-validated fix, handed off at either the report or the validated-fix stage. Plus an Improving skills README section (the user-facing report/fix path) and a DEVELOPER.md pointer (the contributor flow: changing a skill and regression-checking it before merging). See those for the workflow.

Companion: allenai/asta-bench-private#229 — the validating eval case improve_skills_report_problem.

Validation

claude_code 2.1.153 · sonnet-4-6 · ghcr.io/allenai/asta:v0.18.0 @sha256:fef908d7a573…, 3 epochs, baseline (main) vs this PR.

case metric baseline PR
improve_skills_report_problem improve_skills_activated 0/3 3/3
capture_written 0/3 3/3
view_agent_output (guard) workspace_skill_activated 0/3 3/3

capture_written is the real signal: with the skill the agent invokes improve-skills and writes a dated capture under .asta/improve-skills/ (then a blocked gh issue create). Without it the agent improvises — in the baseline runs it edited the workspace skill in-sandbox, or just explained the problem in chat — never producing a durable, reportable capture.

The view_agent_output guard (a precaution) holds — workspace fires 3/3 in the PR arm. (Baseline 0/3: the agent asks for context instead of opening the project; with improve-skills present it invokes workspace — not a regression.)

Scope: only the report path (routing + capture) runs in-sandbox; filing the issue (GitHub auth) and the fix's paired-eval loop (the Docker eval harness) need infra the sandbox lacks.

@jbragg jbragg requested a review from rodneykinney June 10, 2026 05:29

@rodneykinney rodneykinney left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a transcript from a session using this skill? I'd be interested to see one. I'd like to give it a trial run myself and get some more informed feedback.

The steps are pretty complicated, so I would be tempted to decompose it into a router + workflow. The existence of "If stopping here" instructions is another clue that this would be helpful.

I'm noticing that we actually have three categories of skills now, and the internal: true/false flag is feeling overloaded. This, and the research-challenge skill are really dev skills, related to the development of the plugins, while the other internal skills are really beta: less-stable skills meant for external users

@rodneykinney

Copy link
Copy Markdown
Member

Used this skill to address research-challenge feedback. It created #76, which looks pretty good. The issue it corrected wasn't reproducible in the eval sandbox, so I wasn't able to test that workflow. I'll run another case to understand how the asta-bench integration works

@jbragg

jbragg commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

The steps are pretty complicated, so I would be tempted to decompose it into a router + workflow. The existence of "If stopping here" instructions is another clue that this would be helpful.
I'm noticing that we actually have three categories of skills now, and the internal: true/false flag is feeling overloaded. This, and the research-challenge skill are really dev skills, related to the development of the plugins, while the other internal skills are really beta: less-stable skills meant for external users

I agree decomposing could help particularly if frequently stopping early and just reporting problems. But I'm not sure that path will be exercised frequently. It would also be nice to do a proper empirical comparison of performance if/when decomposing.

I think we want end users to be able to share to help improve the system. Maybe this should eventually be made possible through a mechanism that's not a skill, and probably most users won't want to patch our code (but I like allowing for that). Certainly those skills don't need to be active all the time if having them present harms performance.

@rodneykinney rodneykinney self-requested a review June 11, 2026 16:37

@rodneykinney rodneykinney left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we don't need to restructure the skill until we have some empirical way of validating the change.

We can have an offline discussion about the skill feedback/improvement workflow. Right now, this and the research-challenge skill both reference private repos, so they are not even executable by external users.

@jbragg jbragg merged commit c98f1cd into main Jun 11, 2026
7 checks passed
@jbragg jbragg deleted the improve-skills branch June 11, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants