[FEATURE] Add prompt-based skill routing evaluation

## Problem Statement

Users want to know which skill is actually selected when they give a certain kind of prompt.

Today, skill discovery is mostly static:
- browse/search skills
- inspect descriptions
- guess which one might be selected

But there is no practical way to evaluate real skill-routing behavior from a prompt and measure how stable or ambiguous that routing is.

This matters because LLM-based selection is stochastic. A single run does not tell us enough. We need repeated runs and measurable routing output.

## Proposed Solution

Add a prompt-based skill routing evaluation feature.

Example workflow:

1. User provides a prompt
2. ASM runs that prompt through a selected tool/runtime multiple times
3. ASM records which skills are selected across runs
4. ASM reports the results as a routing/evaluation summary

Example command shape:

- `asm eval-route --prompt "..." --tool openclaw --runs 20`

or similar.

## Suggested Output

For each evaluated prompt, ASM should report:

- selected skill(s)
- frequency across runs
- top-1 / top-k selection rate
- variance / ambiguity indicators
- tool/runtime used
- run count and evaluation settings

Optional future metrics:

- entropy / routing confidence proxy
- confusion between overlapping skills
- comparison across multiple tools/models

## Why This Matters

This would turn skill selection into something measurable instead of guesswork.

It helps:

- users understand what skill will likely be used
- authors evaluate whether a skill is being selected as intended
- maintainers detect overlap and routing ambiguity
- ASM build real-world routing data for search, ranking, and trigger improvements

## Alternatives Considered

- manually testing prompts one by one
- guessing from skill descriptions alone
- relying only on semantic search or trigger text matching

These approaches do not show real routing behavior under repeated usage.

## Use Cases

1. A user wants to know which skill is most likely to be used for a task
2. A skill author wants to verify routing behavior for representative prompts
3. A maintainer wants to detect competing/ambiguous skills
4. ASM wants to build a benchmark-style dataset for skill routing quality

## Additional Context

This is similar in spirit to recent tool-use / routing evaluation work:
- repeated-run evaluation matters because LLM routing is stochastic
- benchmark-style tool invocation evaluation is now common in agent/tooling research

So this should be designed as a lightweight routing-eval feature, not just a one-off debug command.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add prompt-based skill routing evaluation #113

Problem Statement

Proposed Solution

Suggested Output

Why This Matters

Alternatives Considered

Use Cases

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[FEATURE] Add prompt-based skill routing evaluation #113

Description

Problem Statement

Proposed Solution

Suggested Output

Why This Matters

Alternatives Considered

Use Cases

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions