Skip to content

crew: empirical calibration for /crew:do's topology classifier #64

@jdidion

Description

@jdidion

/crew:do currently classifies with a Haiku call against a hand-crafted prompt that encodes the Coase/Hayek-topology paper's decision table. No validation.

Proposed

Build a labeled dataset of ~100–200 tasks tagged with their "correct" topology:

  • solo: refactor with cross-file invariants, long-horizon feature, schema migration.
  • hub-spoke: code review, security audit, multi-file linting.
  • market: LeetCode-medium with tests, subtle regex, dense math with cheap oracle.
  • hybrid: large feature with mix of reasoning + review sub-steps.

Run the classifier against the dataset, measure accuracy per category, iterate on the prompt. Check-in: would we use a labeled dataset within the repo, or externalize to a fixtures file?

Why this matters

Without validation, /crew:do mispicks at some unknown rate. Users who don't know to second-guess it get the wrong topology for their task. If accuracy is, say, 60%, the skill is worse than flipping a coin on 4-way classification. If 90%, it's a clear win.

Classifier cost is ~\$0.001/call so an ensemble vote or a more expensive model (sonnet) is worth considering IF calibration shows haiku is under-performing.

Originally flagged in PR #61's follow-ups list and in skills/do/SKILL.md's "Known limitations".

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions