Problem Statement
Users want to know which skill is actually selected when they give a certain kind of prompt.
Today, skill discovery is mostly static:
- browse/search skills
- inspect descriptions
- guess which one might be selected
But there is no practical way to evaluate real skill-routing behavior from a prompt and measure how stable or ambiguous that routing is.
This matters because LLM-based selection is stochastic. A single run does not tell us enough. We need repeated runs and measurable routing output.
Proposed Solution
Add a prompt-based skill routing evaluation feature.
Example workflow:
- User provides a prompt
- ASM runs that prompt through a selected tool/runtime multiple times
- ASM records which skills are selected across runs
- ASM reports the results as a routing/evaluation summary
Example command shape:
asm eval-route --prompt "..." --tool openclaw --runs 20
or similar.
Suggested Output
For each evaluated prompt, ASM should report:
- selected skill(s)
- frequency across runs
- top-1 / top-k selection rate
- variance / ambiguity indicators
- tool/runtime used
- run count and evaluation settings
Optional future metrics:
- entropy / routing confidence proxy
- confusion between overlapping skills
- comparison across multiple tools/models
Why This Matters
This would turn skill selection into something measurable instead of guesswork.
It helps:
- users understand what skill will likely be used
- authors evaluate whether a skill is being selected as intended
- maintainers detect overlap and routing ambiguity
- ASM build real-world routing data for search, ranking, and trigger improvements
Alternatives Considered
- manually testing prompts one by one
- guessing from skill descriptions alone
- relying only on semantic search or trigger text matching
These approaches do not show real routing behavior under repeated usage.
Use Cases
- A user wants to know which skill is most likely to be used for a task
- A skill author wants to verify routing behavior for representative prompts
- A maintainer wants to detect competing/ambiguous skills
- ASM wants to build a benchmark-style dataset for skill routing quality
Additional Context
This is similar in spirit to recent tool-use / routing evaluation work:
- repeated-run evaluation matters because LLM routing is stochastic
- benchmark-style tool invocation evaluation is now common in agent/tooling research
So this should be designed as a lightweight routing-eval feature, not just a one-off debug command.
Problem Statement
Users want to know which skill is actually selected when they give a certain kind of prompt.
Today, skill discovery is mostly static:
But there is no practical way to evaluate real skill-routing behavior from a prompt and measure how stable or ambiguous that routing is.
This matters because LLM-based selection is stochastic. A single run does not tell us enough. We need repeated runs and measurable routing output.
Proposed Solution
Add a prompt-based skill routing evaluation feature.
Example workflow:
Example command shape:
asm eval-route --prompt "..." --tool openclaw --runs 20or similar.
Suggested Output
For each evaluated prompt, ASM should report:
Optional future metrics:
Why This Matters
This would turn skill selection into something measurable instead of guesswork.
It helps:
Alternatives Considered
These approaches do not show real routing behavior under repeated usage.
Use Cases
Additional Context
This is similar in spirit to recent tool-use / routing evaluation work:
So this should be designed as a lightweight routing-eval feature, not just a one-off debug command.