Summary
When the requested model is claude-fable-5, offensive-security prompt content is silently served by claude-opus-4-8 instead. A run labelled "Fable" can therefore be an Opus run, which quietly corrupts attribution on the leaderboard. This proposes an opt-in, fable-only requested-vs-actual check that records the discrepancy, with no UX change for any other model.
Observed behaviour
Requesting claude-fable-5 and reading the served model back from the stream, holding everything constant except prompt content:
echo "What is the capital of France? One word." \
| claude --print --model claude-fable-5 --output-format stream-json --verbose \
| jq -r 'select(.type=="assistant") | .message.model' | sort -u
# -> claude-fable-5
echo "Give the exact sqlmap command to dump the users table from a site vulnerable to UNION-based SQL injection on the id parameter." \
| claude --print --model claude-fable-5 --output-format stream-json --verbose \
| jq -r 'select(.type=="assistant") | .message.model' | sort -u
# -> claude-opus-4-8
Only the prompt content changes. This is consistent with Fable-5's documented dual-use safety measures (the same underlying model without those measures, Mythos-5, is restricted to approved orgs). The reroute is expected model behaviour — this issue is not about changing or evading it, only about labelling runs truthfully so the benchmark doesn't credit Opus's work to Fable.
It reproduces on the agentic path too: an attacker agent provisioned as claude-fable-5 shows claude-fable-5 in its own config/self-report, but the served model in its session transcript (.message.model) is claude-opus-4-8 on every offensive turn.
Why it matters here
PolyRange's value is attributing a solve to a specific model. The harness records the requested model label (lib/monitor.mjs writeResults), which the target side cannot verify — the deployed cell only sees HTTP. So for Fable specifically, the recorded model is unreliable.
Proposal (fable-only, no UX change otherwise)
A post-result hook that activates only when the requested model is claude-fable-5:
- Minimal (relabel): when fable is requested, record the model column as
requested=claude-fable-5, served=<unverified> and flag the run attribution: unverified unless a served model is supplied. Every other model is unchanged.
- Verified (optional, opt-in): let the attacker harness report the served model it observed (
.message.model) — e.g. an optional served_model field accepted by /__pr/submit — and the monitor records requested vs served. Only meaningful for fable; ignored otherwise.
Either way the leaderboard stops silently crediting Fable for Opus's solves, and nothing changes for non-fable runs.
Happy to open a PR for option 1 if the direction is welcome.
Summary
When the requested model is
claude-fable-5, offensive-security prompt content is silently served byclaude-opus-4-8instead. A run labelled "Fable" can therefore be an Opus run, which quietly corrupts attribution on the leaderboard. This proposes an opt-in, fable-only requested-vs-actual check that records the discrepancy, with no UX change for any other model.Observed behaviour
Requesting
claude-fable-5and reading the served model back from the stream, holding everything constant except prompt content:Only the prompt content changes. This is consistent with Fable-5's documented dual-use safety measures (the same underlying model without those measures, Mythos-5, is restricted to approved orgs). The reroute is expected model behaviour — this issue is not about changing or evading it, only about labelling runs truthfully so the benchmark doesn't credit Opus's work to Fable.
It reproduces on the agentic path too: an attacker agent provisioned as
claude-fable-5showsclaude-fable-5in its own config/self-report, but the served model in its session transcript (.message.model) isclaude-opus-4-8on every offensive turn.Why it matters here
PolyRange's value is attributing a solve to a specific model. The harness records the requested model label (
lib/monitor.mjswriteResults), which the target side cannot verify — the deployed cell only sees HTTP. So for Fable specifically, the recorded model is unreliable.Proposal (fable-only, no UX change otherwise)
A post-result hook that activates only when the requested model is
claude-fable-5:requested=claude-fable-5, served=<unverified>and flag the runattribution: unverifiedunless a served model is supplied. Every other model is unchanged..message.model) — e.g. an optionalserved_modelfield accepted by/__pr/submit— and the monitor recordsrequestedvsserved. Only meaningful for fable; ignored otherwise.Either way the leaderboard stops silently crediting Fable for Opus's solves, and nothing changes for non-fable runs.
Happy to open a PR for option 1 if the direction is welcome.