Research: evaluate Claude Fable 5 / Mythos 5 for wizard model recommendation

## Context

Claude Fable 5 and Mythos 5 launched June 9, 2026 — the first public Mythos-class model. This is a research/evaluation issue, not an adoption issue. Same discipline as the Opus 4.8 evaluation (#365): field signal first, recommendation second.

## What we know so far (launch day)

**Benchmarks (positive):**
- SWE-bench Verified: 95.0% vs Opus 4.8's 88.6%
- SWE-bench Pro: 80.0% vs 69.2%
- FrontierCode Diamond: 29.3% vs 13.4% (2× Opus 4.8)
- "Even at medium effort, Fable 5 outperforms every other model at any effort level"
- Stripe: compressed months of engineering into days on a 50M-line Ruby codebase

**Token efficiency (potentially positive):**
- HN early testers: "better results with about half the tokens, making it cost ~same as Opus 4.8 price-wise"
- $10/$50 per Mtok (2× Opus), but fewer turns → similar effective cost per task

**Concerns (need monitoring):**
- Safety classifiers "super aggressive and sensitive" for benign coding tasks — fallback to Opus 4.8 on sensitive queries (not 4.6)
- "Didn't really notice a difference vs 4.8" for standard conversation/assistant tasks
- Free on Max through June 22 only — credits may be required after June 23
- No community field data yet (launched TODAY)
- No Andon Labs / Vending-Bench data
- No effort-level regression data (does max overthink like 4.8?)

## What the wizard should evaluate before recommending

Same gates as #365, but this time actually run them:

1. **Gate 1: proof of life** — does `claude --model claude-fable-5` resolve? Does `/model` show it in the picker?
2. **Gate 2: A/B coder quality** — run the same task on Fable 5 vs 4.6 max on a real PR. Compare token spend, quality, context exhaustion
3. **Gate 3: dogfood for 24h** — maintainer runs Fable 5 as daily driver for a full day
4. **Gate 4: effort-level behavior** — does `max` overthink on Fable 5 like it does on 4.7/4.8? Or does it behave more like 4.6?
5. **Gate 5: community signal (7-day wait)** — monitor HN, Reddit, GitHub issues for field reports after launch-day hype settles
6. **Gate 6: Andon Labs / independent benchmarks** — wait for Vending-Bench arena results

## Current wizard state

v1.80.0 recommends Opus 4.6 max as flagship default. That was the right call based on 12 days of 4.8 post-launch data. Fable 5 is a new ceiling — but it's also $10/$50 (2× the cost) and day-zero. No rush.

## Pricing tier question

If Fable 5 validates, it would be a new tier above flagship (maybe "Frontier" or "Premium+"), not a replacement for 4.6 max. The $10/$50 pricing makes it a conscious choice, not a default.

## Timeline

- June 9-22: free trial window on Max plans (13 days to test at no cost)
- June 23+: may require credits — pricing becomes load-bearing
- June 16+ (7 days post-launch): earliest to evaluate community signal
- July: earliest to consider a wizard recommendation if all gates pass

## Related

- #365 — Opus 4.8 evaluation (shipped as v1.78.0, then reverted to 4.6 max in v1.80.0)
- v1.80.0 CHANGELOG "reversion criteria" section

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Research: evaluate Claude Fable 5 / Mythos 5 for wizard model recommendation #385

Context

What we know so far (launch day)

What the wizard should evaluate before recommending

Current wizard state

Pricing tier question

Timeline

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Research: evaluate Claude Fable 5 / Mythos 5 for wizard model recommendation #385

Description

Context

What we know so far (launch day)

What the wizard should evaluate before recommending

Current wizard state

Pricing tier question

Timeline

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions