Skip to content

crew: fallback chains per roster slot on upstream transient errors #63

@jdidion

Description

@jdidion

Today if gemini-3.1-pro hits Cursor's "Provider Error" (transient backend outage), that reviewer slot just fails — the human has to notice and manually swap in another model.

Proposed shape

Let each roster slot carry an optional fallback list. Either in SKILL.md's default roster:

```md
Default roster when no models are specified:

  1. Claude — via the code-reviewer agent. (no fallback; core requirement)
  2. gpt-5.2 — via Cursor. Fallbacks: gpt-5.4-high, gpt-5.5-high
  3. gemini-3.1-pro — via Cursor. Fallbacks: gemini-3-flash, grok-4-20-thinking, composer-2
    ```

Or in ~/.config/crew/config.toml:

```toml
[[roster]]
name = "gemini-3.1-pro"
fallbacks = ["gemini-3-flash", "grok-4-20-thinking", "composer-2"]
```

When a slot's primary errors, walk its fallback list until one succeeds or the list is exhausted. Prefer different families (preserve the "three different families" property — don't fall back to a second Anthropic reviewer since the Claude leg already covers that).

Failure-classification prerequisite

The wrapper needs to distinguish:

  • exit 2: MODEL_NOT_AVAILABLE (permanent — walk immediately)
  • exit 3: PROVIDER_ERROR (transient — retry once with backoff, then walk)
  • exit 1: generic (walk)

Today each backend script just returns exit 1 for both flavors.

Applies to

  • /crew:review roster (per-slot fallback chain)
  • /crew:market roster (if a candidate's backend fails, spawn a fallback in its slot rather than losing the diversity)

Originally flagged in PR #61's follow-ups list.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions