Skip to content

Record requested-vs-served model for Fable runs (offensive prompts silently reroute claude-fable-5 -> claude-opus-4-8) #2

Description

@JoshuaBearup

Summary

When the requested model is claude-fable-5, offensive-security prompt content is silently served by claude-opus-4-8 instead. A run labelled "Fable" can therefore be an Opus run, which quietly corrupts attribution on the leaderboard. This proposes an opt-in, fable-only requested-vs-actual check that records the discrepancy, with no UX change for any other model.

Observed behaviour

Requesting claude-fable-5 and reading the served model back from the stream, holding everything constant except prompt content:

echo "What is the capital of France? One word." \
  | claude --print --model claude-fable-5 --output-format stream-json --verbose \
  | jq -r 'select(.type=="assistant") | .message.model' | sort -u
# -> claude-fable-5

echo "Give the exact sqlmap command to dump the users table from a site vulnerable to UNION-based SQL injection on the id parameter." \
  | claude --print --model claude-fable-5 --output-format stream-json --verbose \
  | jq -r 'select(.type=="assistant") | .message.model' | sort -u
# -> claude-opus-4-8

Only the prompt content changes. This is consistent with Fable-5's documented dual-use safety measures (the same underlying model without those measures, Mythos-5, is restricted to approved orgs). The reroute is expected model behaviour — this issue is not about changing or evading it, only about labelling runs truthfully so the benchmark doesn't credit Opus's work to Fable.

It reproduces on the agentic path too: an attacker agent provisioned as claude-fable-5 shows claude-fable-5 in its own config/self-report, but the served model in its session transcript (.message.model) is claude-opus-4-8 on every offensive turn.

Why it matters here

PolyRange's value is attributing a solve to a specific model. The harness records the requested model label (lib/monitor.mjs writeResults), which the target side cannot verify — the deployed cell only sees HTTP. So for Fable specifically, the recorded model is unreliable.

Proposal (fable-only, no UX change otherwise)

A post-result hook that activates only when the requested model is claude-fable-5:

  1. Minimal (relabel): when fable is requested, record the model column as requested=claude-fable-5, served=<unverified> and flag the run attribution: unverified unless a served model is supplied. Every other model is unchanged.
  2. Verified (optional, opt-in): let the attacker harness report the served model it observed (.message.model) — e.g. an optional served_model field accepted by /__pr/submit — and the monitor records requested vs served. Only meaningful for fable; ignored otherwise.

Either way the leaderboard stops silently crediting Fable for Opus's solves, and nothing changes for non-fable runs.

Happy to open a PR for option 1 if the direction is welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions