Skip to content

[FEATURE] Add prompt-based skill routing evaluation #113

@luongnv89

Description

@luongnv89

Problem Statement

Users want to know which skill is actually selected when they give a certain kind of prompt.

Today, skill discovery is mostly static:

  • browse/search skills
  • inspect descriptions
  • guess which one might be selected

But there is no practical way to evaluate real skill-routing behavior from a prompt and measure how stable or ambiguous that routing is.

This matters because LLM-based selection is stochastic. A single run does not tell us enough. We need repeated runs and measurable routing output.

Proposed Solution

Add a prompt-based skill routing evaluation feature.

Example workflow:

  1. User provides a prompt
  2. ASM runs that prompt through a selected tool/runtime multiple times
  3. ASM records which skills are selected across runs
  4. ASM reports the results as a routing/evaluation summary

Example command shape:

  • asm eval-route --prompt "..." --tool openclaw --runs 20

or similar.

Suggested Output

For each evaluated prompt, ASM should report:

  • selected skill(s)
  • frequency across runs
  • top-1 / top-k selection rate
  • variance / ambiguity indicators
  • tool/runtime used
  • run count and evaluation settings

Optional future metrics:

  • entropy / routing confidence proxy
  • confusion between overlapping skills
  • comparison across multiple tools/models

Why This Matters

This would turn skill selection into something measurable instead of guesswork.

It helps:

  • users understand what skill will likely be used
  • authors evaluate whether a skill is being selected as intended
  • maintainers detect overlap and routing ambiguity
  • ASM build real-world routing data for search, ranking, and trigger improvements

Alternatives Considered

  • manually testing prompts one by one
  • guessing from skill descriptions alone
  • relying only on semantic search or trigger text matching

These approaches do not show real routing behavior under repeated usage.

Use Cases

  1. A user wants to know which skill is most likely to be used for a task
  2. A skill author wants to verify routing behavior for representative prompts
  3. A maintainer wants to detect competing/ambiguous skills
  4. ASM wants to build a benchmark-style dataset for skill routing quality

Additional Context

This is similar in spirit to recent tool-use / routing evaluation work:

  • repeated-run evaluation matters because LLM routing is stochastic
  • benchmark-style tool invocation evaluation is now common in agent/tooling research

So this should be designed as a lightweight routing-eval feature, not just a one-off debug command.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or requestskill-discoverySkill search and discovery

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions