Problem Statement
ASM is indexing more and more skill sources, but there is no standard benchmark set of prompts to compare how different skill repos perform on common tasks.
Without a benchmark, it is hard to answer questions like:
- which repo is strongest for code review?
- which repo is strongest for planning / product / docs / release tasks?
- which repos overlap heavily?
- which repo should a user prefer for a certain kind of task?
Proposed Solution
Add a prompt-based benchmark suite for popular tasks, then use it to compare skill repos.
Example benchmark categories:
- code review
- bug triage
- implementation planning
- product/PRD work
- release management
- documentation generation
- research / analysis
The benchmark should:
- define representative prompts for each task category
- run/evaluate them against different skill repos
- track which skills are selected and how consistently
- produce comparison output across repos
Example repos to compare:
- official Anthropic skills
- Superpower
- other major public skill repos indexed by ASM
Suggested Output
For each benchmark task:
- prompt text / task category
- repo(s) evaluated
- selected skill(s)
- frequency / consistency across runs
- overlap / ambiguity notes
- summary comparison by repo
Optional future metrics:
- success/fit rating
- routing entropy / confidence proxy
- task coverage score by repo
- recommended repo(s) by task type
Alternatives Considered
- manually testing repos one by one
- comparing only repo descriptions/readmes
- relying only on search/routing heuristics
These approaches do not provide a repeatable benchmark for real usage scenarios.
Use Cases
- A user wants to know which skill repo is best for a common task like code review
- ASM wants stronger data for ranking and recommendation
- Maintainers want to compare overlap and specialization across repos
- Skill authors want a public benchmark for routing quality and task coverage
Additional Context
This could become the foundation for:
- repo comparison pages in the catalog
- “best repo for this task” recommendations
- benchmark-driven discovery/ranking
- evaluation datasets for skill routing behavior
Problem Statement
ASM is indexing more and more skill sources, but there is no standard benchmark set of prompts to compare how different skill repos perform on common tasks.
Without a benchmark, it is hard to answer questions like:
Proposed Solution
Add a prompt-based benchmark suite for popular tasks, then use it to compare skill repos.
Example benchmark categories:
The benchmark should:
Example repos to compare:
Suggested Output
For each benchmark task:
Optional future metrics:
Alternatives Considered
These approaches do not provide a repeatable benchmark for real usage scenarios.
Use Cases
Additional Context
This could become the foundation for: