improve-skills: report, fix, validate — a workflow for Asta skills#75
Conversation
rodneykinney
left a comment
There was a problem hiding this comment.
Do you have a transcript from a session using this skill? I'd be interested to see one. I'd like to give it a trial run myself and get some more informed feedback.
The steps are pretty complicated, so I would be tempted to decompose it into a router + workflow. The existence of "If stopping here" instructions is another clue that this would be helpful.
I'm noticing that we actually have three categories of skills now, and the internal: true/false flag is feeling overloaded. This, and the research-challenge skill are really dev skills, related to the development of the plugins, while the other internal skills are really beta: less-stable skills meant for external users
|
Used this skill to address research-challenge feedback. It created #76, which looks pretty good. The issue it corrected wasn't reproducible in the eval sandbox, so I wasn't able to test that workflow. I'll run another case to understand how the |
I agree decomposing could help particularly if frequently stopping early and just reporting problems. But I'm not sure that path will be exercised frequently. It would also be nice to do a proper empirical comparison of performance if/when decomposing. I think we want end users to be able to share to help improve the system. Maybe this should eventually be made possible through a mechanism that's not a skill, and probably most users won't want to patch our code (but I like allowing for that). Certainly those skills don't need to be active all the time if having them present harms performance. |
rodneykinney
left a comment
There was a problem hiding this comment.
I agree that we don't need to restructure the skill until we have some empirical way of validating the change.
We can have an offline discussion about the skill feedback/improvement workflow. Right now, this and the research-challenge skill both reference private repos, so they are not even executable by external users.
Adds an
improve-skillsskill (preview, internal) that turns a behavior gap — whether a skill misbehaves or lacks a capability you want — into a reproducing eval case and, carried further, a paired-eval-validated fix, handed off at either the report or the validated-fix stage. Plus an Improving skills README section (the user-facing report/fix path) and a DEVELOPER.md pointer (the contributor flow: changing a skill and regression-checking it before merging). See those for the workflow.Companion: allenai/asta-bench-private#229 — the validating eval case
improve_skills_report_problem.Validation
claude_code 2.1.153·sonnet-4-6·ghcr.io/allenai/asta:v0.18.0 @sha256:fef908d7a573…, 3 epochs, baseline (main) vs this PR.improve_skills_report_problemimprove_skills_activatedcapture_writtenview_agent_output(guard)workspace_skill_activatedcapture_writtenis the real signal: with the skill the agent invokes improve-skills and writes a dated capture under.asta/improve-skills/(then a blockedgh issue create). Without it the agent improvises — in the baseline runs it edited the workspace skill in-sandbox, or just explained the problem in chat — never producing a durable, reportable capture.The
view_agent_outputguard (a precaution) holds — workspace fires 3/3 in the PR arm. (Baseline 0/3: the agent asks for context instead of opening the project; with improve-skills present it invokes workspace — not a regression.)Scope: only the report path (routing + capture) runs in-sandbox; filing the issue (GitHub auth) and the fix's paired-eval loop (the Docker eval harness) need infra the sandbox lacks.