[FEATURE] Add prompt benchmark suite to compare skill repos on popular tasks

## Problem Statement

ASM is indexing more and more skill sources, but there is no standard benchmark set of prompts to compare how different skill repos perform on common tasks.

Without a benchmark, it is hard to answer questions like:

- which repo is strongest for code review?
- which repo is strongest for planning / product / docs / release tasks?
- which repos overlap heavily?
- which repo should a user prefer for a certain kind of task?

## Proposed Solution

Add a prompt-based benchmark suite for popular tasks, then use it to compare skill repos.

Example benchmark categories:

- code review
- bug triage
- implementation planning
- product/PRD work
- release management
- documentation generation
- research / analysis

The benchmark should:

1. define representative prompts for each task category
2. run/evaluate them against different skill repos
3. track which skills are selected and how consistently
4. produce comparison output across repos

Example repos to compare:

- official Anthropic skills
- Superpower
- other major public skill repos indexed by ASM

## Suggested Output

For each benchmark task:

- prompt text / task category
- repo(s) evaluated
- selected skill(s)
- frequency / consistency across runs
- overlap / ambiguity notes
- summary comparison by repo

Optional future metrics:

- success/fit rating
- routing entropy / confidence proxy
- task coverage score by repo
- recommended repo(s) by task type

## Alternatives Considered

- manually testing repos one by one
- comparing only repo descriptions/readmes
- relying only on search/routing heuristics

These approaches do not provide a repeatable benchmark for real usage scenarios.

## Use Cases

1. A user wants to know which skill repo is best for a common task like code review
2. ASM wants stronger data for ranking and recommendation
3. Maintainers want to compare overlap and specialization across repos
4. Skill authors want a public benchmark for routing quality and task coverage

## Additional Context

This could become the foundation for:

- repo comparison pages in the catalog
- “best repo for this task” recommendations
- benchmark-driven discovery/ranking
- evaluation datasets for skill routing behavior

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add prompt benchmark suite to compare skill repos on popular tasks #114

Problem Statement

Proposed Solution

Suggested Output

Alternatives Considered

Use Cases

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[FEATURE] Add prompt benchmark suite to compare skill repos on popular tasks #114

Description

Problem Statement

Proposed Solution

Suggested Output

Alternatives Considered

Use Cases

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions