You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Re-run the with-superpowers vs without-superpowers skill benchmarks now that the LLM grader bug (#982, PR #983) is fixed.
Context
The initial benchmark runs in agentv-bench-skills produced misleading results because the with-superpowers eval used prompt: in assertions, which triggered bug #982 (bare prompt replacing entire grader template). The scores showed with-superpowers performing worse, but this was a grader artifact, not a real skill degradation.
Objective
Re-run the
with-superpowersvswithout-superpowersskill benchmarks now that the LLM grader bug (#982, PR #983) is fixed.Context
The initial benchmark runs in
agentv-bench-skillsproduced misleading results because thewith-superpowerseval usedprompt:in assertions, which triggered bug #982 (bare prompt replacing entire grader template). The scores showedwith-superpowersperforming worse, but this was a grader artifact, not a real skill degradation.Previous results (tainted by bug)
Steps
cd agentv-bench-skills && ./scripts/setup-superpowers.shBlocked by
Acceptance criteria