/crew:do currently classifies with a Haiku call against a hand-crafted prompt that encodes the Coase/Hayek-topology paper's decision table. No validation.
Proposed
Build a labeled dataset of ~100–200 tasks tagged with their "correct" topology:
- solo: refactor with cross-file invariants, long-horizon feature, schema migration.
- hub-spoke: code review, security audit, multi-file linting.
- market: LeetCode-medium with tests, subtle regex, dense math with cheap oracle.
- hybrid: large feature with mix of reasoning + review sub-steps.
Run the classifier against the dataset, measure accuracy per category, iterate on the prompt. Check-in: would we use a labeled dataset within the repo, or externalize to a fixtures file?
Why this matters
Without validation, /crew:do mispicks at some unknown rate. Users who don't know to second-guess it get the wrong topology for their task. If accuracy is, say, 60%, the skill is worse than flipping a coin on 4-way classification. If 90%, it's a clear win.
Classifier cost is ~\$0.001/call so an ensemble vote or a more expensive model (sonnet) is worth considering IF calibration shows haiku is under-performing.
Originally flagged in PR #61's follow-ups list and in skills/do/SKILL.md's "Known limitations".
/crew:docurrently classifies with a Haiku call against a hand-crafted prompt that encodes the Coase/Hayek-topology paper's decision table. No validation.Proposed
Build a labeled dataset of ~100–200 tasks tagged with their "correct" topology:
Run the classifier against the dataset, measure accuracy per category, iterate on the prompt. Check-in: would we use a labeled dataset within the repo, or externalize to a fixtures file?
Why this matters
Without validation,
/crew:domispicks at some unknown rate. Users who don't know to second-guess it get the wrong topology for their task. If accuracy is, say, 60%, the skill is worse than flipping a coin on 4-way classification. If 90%, it's a clear win.Classifier cost is ~\$0.001/call so an ensemble vote or a more expensive model (sonnet) is worth considering IF calibration shows haiku is under-performing.
Originally flagged in PR #61's follow-ups list and in
skills/do/SKILL.md's "Known limitations".