Skip to content

[FEATURE] Add prompt benchmark suite to compare skill repos on popular tasks #114

@luongnv89

Description

@luongnv89

Problem Statement

ASM is indexing more and more skill sources, but there is no standard benchmark set of prompts to compare how different skill repos perform on common tasks.

Without a benchmark, it is hard to answer questions like:

  • which repo is strongest for code review?
  • which repo is strongest for planning / product / docs / release tasks?
  • which repos overlap heavily?
  • which repo should a user prefer for a certain kind of task?

Proposed Solution

Add a prompt-based benchmark suite for popular tasks, then use it to compare skill repos.

Example benchmark categories:

  • code review
  • bug triage
  • implementation planning
  • product/PRD work
  • release management
  • documentation generation
  • research / analysis

The benchmark should:

  1. define representative prompts for each task category
  2. run/evaluate them against different skill repos
  3. track which skills are selected and how consistently
  4. produce comparison output across repos

Example repos to compare:

  • official Anthropic skills
  • Superpower
  • other major public skill repos indexed by ASM

Suggested Output

For each benchmark task:

  • prompt text / task category
  • repo(s) evaluated
  • selected skill(s)
  • frequency / consistency across runs
  • overlap / ambiguity notes
  • summary comparison by repo

Optional future metrics:

  • success/fit rating
  • routing entropy / confidence proxy
  • task coverage score by repo
  • recommended repo(s) by task type

Alternatives Considered

  • manually testing repos one by one
  • comparing only repo descriptions/readmes
  • relying only on search/routing heuristics

These approaches do not provide a repeatable benchmark for real usage scenarios.

Use Cases

  1. A user wants to know which skill repo is best for a common task like code review
  2. ASM wants stronger data for ranking and recommendation
  3. Maintainers want to compare overlap and specialization across repos
  4. Skill authors want a public benchmark for routing quality and task coverage

Additional Context

This could become the foundation for:

  • repo comparison pages in the catalog
  • “best repo for this task” recommendations
  • benchmark-driven discovery/ranking
  • evaluation datasets for skill routing behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or requestskill-discoverySkill search and discovery

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions