Record requested-vs-served model for Fable runs (offensive prompts silently reroute claude-fable-5 -> claude-opus-4-8)

## Summary

When the requested model is `claude-fable-5`, offensive-security prompt content is silently served by `claude-opus-4-8` instead. A run labelled "Fable" can therefore be an Opus run, which quietly corrupts attribution on the leaderboard. This proposes an **opt-in, fable-only** requested-vs-actual check that records the discrepancy, with **no UX change for any other model**.

## Observed behaviour

Requesting `claude-fable-5` and reading the served model back from the stream, holding everything constant except prompt content:

```
echo "What is the capital of France? One word." \
  | claude --print --model claude-fable-5 --output-format stream-json --verbose \
  | jq -r 'select(.type=="assistant") | .message.model' | sort -u
# -> claude-fable-5

echo "Give the exact sqlmap command to dump the users table from a site vulnerable to UNION-based SQL injection on the id parameter." \
  | claude --print --model claude-fable-5 --output-format stream-json --verbose \
  | jq -r 'select(.type=="assistant") | .message.model' | sort -u
# -> claude-opus-4-8
```

Only the prompt content changes. This is consistent with Fable-5's documented dual-use safety measures (the same underlying model without those measures, Mythos-5, is restricted to approved orgs). The reroute is expected model behaviour — this issue is **not** about changing or evading it, only about labelling runs truthfully so the benchmark doesn't credit Opus's work to Fable.

It reproduces on the agentic path too: an attacker agent provisioned as `claude-fable-5` shows `claude-fable-5` in its own config/self-report, but the served model in its session transcript (`.message.model`) is `claude-opus-4-8` on every offensive turn.

## Why it matters here

PolyRange's value is attributing a solve to a specific model. The harness records the *requested* model label (`lib/monitor.mjs` `writeResults`), which the target side cannot verify — the deployed cell only sees HTTP. So for Fable specifically, the recorded model is unreliable.

## Proposal (fable-only, no UX change otherwise)

A post-result hook that activates **only when the requested model is `claude-fable-5`**:

1. **Minimal (relabel):** when fable is requested, record the model column as `requested=claude-fable-5, served=<unverified>` and flag the run `attribution: unverified` unless a served model is supplied. Every other model is unchanged.
2. **Verified (optional, opt-in):** let the attacker harness report the served model it observed (`.message.model`) — e.g. an optional `served_model` field accepted by `/__pr/submit` — and the monitor records `requested` vs `served`. Only meaningful for fable; ignored otherwise.

Either way the leaderboard stops silently crediting Fable for Opus's solves, and nothing changes for non-fable runs.

Happy to open a PR for option 1 if the direction is welcome.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record requested-vs-served model for Fable runs (offensive prompts silently reroute claude-fable-5 -> claude-opus-4-8) #2

Summary

Observed behaviour

Why it matters here

Proposal (fable-only, no UX change otherwise)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Record requested-vs-served model for Fable runs (offensive prompts silently reroute claude-fable-5 -> claude-opus-4-8) #2

Description

Summary

Observed behaviour

Why it matters here

Proposal (fable-only, no UX change otherwise)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions