Skip to content

Research: evaluate Claude Fable 5 / Mythos 5 for wizard model recommendation #385

Description

@BaseInfinity

Context

Claude Fable 5 and Mythos 5 launched June 9, 2026 — the first public Mythos-class model. This is a research/evaluation issue, not an adoption issue. Same discipline as the Opus 4.8 evaluation (#365): field signal first, recommendation second.

What we know so far (launch day)

Benchmarks (positive):

  • SWE-bench Verified: 95.0% vs Opus 4.8's 88.6%
  • SWE-bench Pro: 80.0% vs 69.2%
  • FrontierCode Diamond: 29.3% vs 13.4% (2× Opus 4.8)
  • "Even at medium effort, Fable 5 outperforms every other model at any effort level"
  • Stripe: compressed months of engineering into days on a 50M-line Ruby codebase

Token efficiency (potentially positive):

  • HN early testers: "better results with about half the tokens, making it cost ~same as Opus 4.8 price-wise"
  • $10/$50 per Mtok (2× Opus), but fewer turns → similar effective cost per task

Concerns (need monitoring):

  • Safety classifiers "super aggressive and sensitive" for benign coding tasks — fallback to Opus 4.8 on sensitive queries (not 4.6)
  • "Didn't really notice a difference vs 4.8" for standard conversation/assistant tasks
  • Free on Max through June 22 only — credits may be required after June 23
  • No community field data yet (launched TODAY)
  • No Andon Labs / Vending-Bench data
  • No effort-level regression data (does max overthink like 4.8?)

What the wizard should evaluate before recommending

Same gates as #365, but this time actually run them:

  1. Gate 1: proof of life — does claude --model claude-fable-5 resolve? Does /model show it in the picker?
  2. Gate 2: A/B coder quality — run the same task on Fable 5 vs 4.6 max on a real PR. Compare token spend, quality, context exhaustion
  3. Gate 3: dogfood for 24h — maintainer runs Fable 5 as daily driver for a full day
  4. Gate 4: effort-level behavior — does max overthink on Fable 5 like it does on 4.7/4.8? Or does it behave more like 4.6?
  5. Gate 5: community signal (7-day wait) — monitor HN, Reddit, GitHub issues for field reports after launch-day hype settles
  6. Gate 6: Andon Labs / independent benchmarks — wait for Vending-Bench arena results

Current wizard state

v1.80.0 recommends Opus 4.6 max as flagship default. That was the right call based on 12 days of 4.8 post-launch data. Fable 5 is a new ceiling — but it's also $10/$50 (2× the cost) and day-zero. No rush.

Pricing tier question

If Fable 5 validates, it would be a new tier above flagship (maybe "Frontier" or "Premium+"), not a replacement for 4.6 max. The $10/$50 pricing makes it a conscious choice, not a default.

Timeline

  • June 9-22: free trial window on Max plans (13 days to test at no cost)
  • June 23+: may require credits — pricing becomes load-bearing
  • June 16+ (7 days post-launch): earliest to evaluate community signal
  • July: earliest to consider a wizard recommendation if all gates pass

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions