bench: re-run with-superpowers vs without-superpowers after grader fix

## Objective

Re-run the `with-superpowers` vs `without-superpowers` skill benchmarks now that the LLM grader bug (#982, PR #983) is fixed.

## Context

The initial benchmark runs in `agentv-bench-skills` produced misleading results because the `with-superpowers` eval used `prompt:` in assertions, which triggered bug #982 (bare prompt replacing entire grader template). The scores showed `with-superpowers` performing worse, but this was a grader artifact, not a real skill degradation.

## Previous results (tainted by bug)

| Experiment | Target | Pass Rate | Avg Score |
|---|---|---|---|
| without-superpowers | gemini | 100% | 1.000 |
| without-superpowers | azure | 50% | 0.735 |
| with-superpowers | gemini | 50% | 0.750 |
| with-superpowers | azure | 0% | 0.250 |

## Steps

1. Merge PR #983 (grader fix)
2. Install superpowers: `cd agentv-bench-skills && ./scripts/setup-superpowers.sh`
3. Re-run both experiments against at least 2 providers:
   ```bash
   agentv eval evals/reasoning/logic-puzzle.EVAL.yaml --target gemini --experiment without-superpowers
   agentv eval evals/reasoning/logic-puzzle.EVAL.yaml --target gemini --experiment with-superpowers
   agentv eval evals/reasoning/logic-puzzle.EVAL.yaml --target azure --experiment without-superpowers
   agentv eval evals/reasoning/logic-puzzle.EVAL.yaml --target azure --experiment with-superpowers
   ```
4. Verify the comparison view in studio shows accurate results
5. Document findings

## Blocked by

- #982 / PR #983 (grader fix must be merged first)

## Acceptance criteria

- [ ] Clean benchmark data with grader fix applied
- [ ] Comparison matrix in studio shows valid with/without-superpowers comparison
- [ ] Results committed to agentv-bench-skills repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: re-run with-superpowers vs without-superpowers after grader fix #989

Objective

Context

Previous results (tainted by bug)

Steps

Blocked by

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Experiment	Target	Pass Rate	Avg Score
without-superpowers	gemini	100%	1.000
without-superpowers	azure	50%	0.735
with-superpowers	gemini	50%	0.750
with-superpowers	azure	0%	0.250

bench: re-run with-superpowers vs without-superpowers after grader fix #989

Description

Objective

Context

Previous results (tainted by bug)

Steps

Blocked by

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions