Skip to content

fix: improve skill trigger routing accuracy across models#1178

Merged
stack72 merged 2 commits intomainfrom
worktree-80
Apr 14, 2026
Merged

fix: improve skill trigger routing accuracy across models#1178
stack72 merged 2 commits intomainfrom
worktree-80

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented Apr 14, 2026

Summary

  • Sharpen skill descriptions for swamp-report, swamp-extension-model, swamp-extension-driver, and swamp-model to reduce cross-model misrouting identified by multi-model eval runs
  • Add differentiating keywords (dataRepository, UnifiedDataRepository, dataHandles, MethodReportContext) to swamp-report so report API queries stop routing to troubleshooting
  • Narrow extension-model and extension-driver descriptions to creation-only scope with explicit exclusions for adjacent skills (swamp-model, swamp-workflow, swamp-troubleshooting, swamp-extension-publish)
  • Add workflow-orchestration exclusion to swamp-model
  • Strengthen eval system prompt against text-only responses from Opus and Gemini

Eval Results

Validated against multi-model eval suite (run 1, run 2):

Model Before After
Sonnet 99.0% (200/202) 97.5% (197/202)
GPT-5.4 98.0% (198/202) 98.0% (198/202)
Opus 94.1% (190/202) 96.5% (195/202)
Gemini 91.6% (185/202) 93.8% (120/128*)

*Gemini rate-limited, 128/202 tests completed.

Original cross-model failures from issue:

Failure Before After
report SHOULD trigger "UnifiedDataRepository" 4/4 fail 0/4 — FIXED
model NOT for "chain into workflow" 2/4 fail 0/4 — FIXED
extension-driver NOT for "Kubernetes cluster" 3/4 fail 0/4 — FIXED
extension-model NOT for "custom model in workflow" 3/4 fail 2/4 — improved
workflow NOT for "erroring on second step" 2/4 fail 2/4 — same

Closes #80

Test plan

  • deno check passes
  • deno lint passes
  • deno fmt passes
  • All 4346 tests pass
  • deno run compile succeeds
  • Multi-model eval suite passes (all models above 90% threshold)

🤖 Generated with Claude Code

stack72 and others added 2 commits April 14, 2026 22:50
Sharpen skill descriptions to reduce cross-model misrouting identified by
multi-model eval runs. Adds explicit exclusions to extension-model,
extension-driver, and model skills. Enriches report description with
differentiating keywords. Strengthens eval system prompt against text-only
responses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extension-model was incorrectly triggering on publishing queries like
"Prepare my extension for publishing" and "Publish my model to the registry"
after the initial description narrowing. Add explicit exclusion directing
these to swamp-extension-publish.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Blocking Issues

None.

Suggestions

  1. Pre-existing body line limit: swamp-extension-model/SKILL.md is ~509 body lines (529 total minus ~20 lines of frontmatter), slightly exceeding the 500-line body limit in CLAUDE.md. This is pre-existing and not introduced by this PR, but worth noting for a future cleanup pass to split content into references/ files.

Summary

Clean, focused PR that sharpens skill frontmatter descriptions to reduce cross-model misrouting. The changes are well-structured:

  • Explicit "Do NOT use for X — that is Y" exclusions on swamp-extension-driver, swamp-extension-model, swamp-model, and swamp-report provide clear disambiguation guidance
  • Differentiating API keywords (dataRepository, UnifiedDataRepository, dataHandles, MethodReportContext) added to swamp-report to fix report routing failures
  • "remote execution" trigger correctly removed from swamp-extension-driver to prevent workflow misrouting
  • Eval system prompt reinforcement ("A text-only response with no tool call is ALWAYS wrong") is a good nudge for models that default to text responses
  • All YAML frontmatter uses valid > block scalars with correct name/description-only fields
  • CI passes: lint, test, format, skill review, and skill trigger eval all green
  • Eval results show clear improvement (Opus 94.1% → 96.5%, targeted failures fixed)

@stack72 stack72 merged commit ebbbc00 into main Apr 14, 2026
15 checks passed
@stack72 stack72 deleted the worktree-80 branch April 14, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant