Skip to content

Improve agent-plugin-review skill to pass remaining 3 eval tests #779

@christso

Description

@christso

Summary

The agent-plugin-review skill passes 6/9 eval tests against pi-cli (mean score 0.722). Three tests fail consistently:

Test Score Issue
detect-relative-file-paths 0.500 Partially detected — skill mentions leading / but agent doesn't consistently flag it
detect-repeated-inputs 0.000 Missed — agent doesn't suggest top-level input for repeated file references
detect-missing-hard-gates 0.000 Missed — agent doesn't flag missing artifact existence checks between phases

Approach

Use the agentv-bench eval-driven iteration loop:

  1. Analyze the failing test transcripts to understand what the agent does instead
  2. Identify which SKILL.md instructions are unclear or missing
  3. Make targeted edits to the skill
  4. Re-run evals to verify improvement
  5. Repeat until all 9 pass

Possible improvements

  • Relative file paths: Add an explicit checklist item about checking type: file values in eval YAML
  • Repeated inputs: Add guidance about the top-level input field from AgentV eval docs
  • Hard gates: Make the workflow-checklist.md more prescriptive about what to look for (artifact existence checks at the start of each phase skill)

Eval command

bun run --filter @agentv/core build && bun apps/cli/src/cli.ts eval evals/agentic-engineering/agent-plugin-review.eval.yaml --target pi-cli

Note: must rebuild @agentv/core dist before running if core source was modified.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions