Summary
The agent-plugin-review skill passes 6/9 eval tests against pi-cli (mean score 0.722). Three tests fail consistently:
| Test |
Score |
Issue |
| detect-relative-file-paths |
0.500 |
Partially detected — skill mentions leading / but agent doesn't consistently flag it |
| detect-repeated-inputs |
0.000 |
Missed — agent doesn't suggest top-level input for repeated file references |
| detect-missing-hard-gates |
0.000 |
Missed — agent doesn't flag missing artifact existence checks between phases |
Approach
Use the agentv-bench eval-driven iteration loop:
- Analyze the failing test transcripts to understand what the agent does instead
- Identify which SKILL.md instructions are unclear or missing
- Make targeted edits to the skill
- Re-run evals to verify improvement
- Repeat until all 9 pass
Possible improvements
- Relative file paths: Add an explicit checklist item about checking
type: file values in eval YAML
- Repeated inputs: Add guidance about the top-level
input field from AgentV eval docs
- Hard gates: Make the workflow-checklist.md more prescriptive about what to look for (artifact existence checks at the start of each phase skill)
Eval command
bun run --filter @agentv/core build && bun apps/cli/src/cli.ts eval evals/agentic-engineering/agent-plugin-review.eval.yaml --target pi-cli
Note: must rebuild @agentv/core dist before running if core source was modified.
Related
Summary
The
agent-plugin-reviewskill passes 6/9 eval tests against pi-cli (mean score 0.722). Three tests fail consistently:/but agent doesn't consistently flag itApproach
Use the agentv-bench eval-driven iteration loop:
Possible improvements
type: filevalues in eval YAMLinputfield from AgentV eval docsEval command
Note: must rebuild
@agentv/coredist before running if core source was modified.Related