Skip to content

bench: re-run with-superpowers vs without-superpowers after grader fix #989

@christso

Description

@christso

Objective

Re-run the with-superpowers vs without-superpowers skill benchmarks now that the LLM grader bug (#982, PR #983) is fixed.

Context

The initial benchmark runs in agentv-bench-skills produced misleading results because the with-superpowers eval used prompt: in assertions, which triggered bug #982 (bare prompt replacing entire grader template). The scores showed with-superpowers performing worse, but this was a grader artifact, not a real skill degradation.

Previous results (tainted by bug)

Experiment Target Pass Rate Avg Score
without-superpowers gemini 100% 1.000
without-superpowers azure 50% 0.735
with-superpowers gemini 50% 0.750
with-superpowers azure 0% 0.250

Steps

  1. Merge PR fix(core): extract system messages in prompt builder for LLM grader #983 (grader fix)
  2. Install superpowers: cd agentv-bench-skills && ./scripts/setup-superpowers.sh
  3. Re-run both experiments against at least 2 providers:
    agentv eval evals/reasoning/logic-puzzle.EVAL.yaml --target gemini --experiment without-superpowers
    agentv eval evals/reasoning/logic-puzzle.EVAL.yaml --target gemini --experiment with-superpowers
    agentv eval evals/reasoning/logic-puzzle.EVAL.yaml --target azure --experiment without-superpowers
    agentv eval evals/reasoning/logic-puzzle.EVAL.yaml --target azure --experiment with-superpowers
  4. Verify the comparison view in studio shows accurate results
  5. Document findings

Blocked by

Acceptance criteria

  • Clean benchmark data with grader fix applied
  • Comparison matrix in studio shows valid with/without-superpowers comparison
  • Results committed to agentv-bench-skills repo

Metadata

Metadata

Assignees

No one assigned

    Labels

    in-progressClaimed by an agent — do not duplicate work

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions