v0.7.5: reasoning_replay knob (default none) + replay eval grid + Anthropic thinking-on baseline by antoinezambelli · Pull Request #105 · antoinezambelli/forge

antoinezambelli · 2026-06-12T01:42:31Z

What

Reasoning replay becomes a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn, with no way to turn it off. This PR adds a reasoning_replay {full, keep-last, none} knob to WorkflowRunner and the proxy — and, backed by a full re-sweep of the published eval grid, makes none the default.

Why `none`

The v0.7.5 grid re-swept the full 8–14B lineup across all three policies × both ablations × native/prompt (~170k runs, scenario as the sampling unit, paired against the v0.7.0 legacy baseline):

full reproduces the pre-knob baseline everywhere — the knob is a clean superset of legacy behavior.
none is statistically indistinguishable from replay-all in aggregate (+0.49pp, p=0.17; reforged-only −0.35pp, p=0.45) while saving the replayed tokens every turn.
No per-config regression survives multiple-comparison correction (closest: a mild raw drop on Ministral-3 14B Reasoning Q4, where none ≈ keep-last).
Wire-level validation: none → exactly 0 replayed reasoning across every run; keep-last ∈ {0, 1}.

Full rationale and alternatives: ADR-017. Per-config tables: reasoning-replay view.

Changes

reasoning_replay knob — WorkflowRunner(reasoning_replay=…) / proxy --reasoning-replay. Serialization-only: reasoning is still captured and still surfaces in on_message and internal history. In OpenAI-compatible proxy responses, keep-last exposes current reasoning as reasoning_content rather than assistant content; under the default none, responses omit captured reasoning.
Default behavior change — pre-0.7.5 ≡ full; now none. Inert for models that emit no reasoning. Migration: --reasoning-replay full or WorkflowRunner(reasoning_replay="full") restores the historical behavior.
Eval/report/dashboard — the policy is part of the eval resume key and a first-class report/dashboard dimension: row labels tagged :keep-last / :full (untagged = none; pre-knob rows count as full), a Reasoning Replay dashboard filter, a --reasoning-replay report filter, and a dedicated policy-comparison view. Design note for review: the policy is in ConfigKey (display identity) but deliberately not in _config_tuple (dedup identity), so latest-generation-wins still supersedes pre-knob rows whole-config instead of keeping them as stale :full duplicates.
Anthropic extended thinking — AnthropicClient(thinking=…) (e.g. {"type": "adaptive"}); forced tool_choice suppressed when thinking is on (API requires auto). The Claude eval baseline now runs Sonnet 4.6 / Opus 4.8 with adaptive thinking (all prior Claude rows had thinking off); Haiku 4.5 has no adaptive support and stays non-thinking.
Anthropic prompt caching — AnthropicClient(prompt_caching=True) marks a static ephemeral cache breakpoint over tools + system; TokenUsage gains generic cache_creation_input_tokens / cache_read_input_tokens; eval cost accounting prices cache writes (1.25×) and reads (0.1×).
Dataset — eval_results_v0.7.5.jsonl (170,300 rows, new eval generation). Haiku rows re-stamped keep-last → none (label hygiene: they ran before the default decision; the knob is request-inert for Claude rows). Validated: 0 bad JSON, 0 duplicate run keys.
Docs — README / User Guide / Backend Setup document the knob and default; ADR-017; Model Registry updates (Claude thinking-on re-baseline footnote; Qwen3 8B Q8_0 flagged for future retirement on compute-cost vs signal-value grounds — it was ~23% of the full sweep for a small Q4/Q8 delta).
Release prep — pyproject.toml → 0.7.5, CHANGELOG section (dated 2026-06-11 — touch up if the release slips).

Validation

Unit suite: 1157 passed (new default-behavior tests; fold/exposure mechanics re-pinned under explicit keep-last).
Reports + dashboard regenerated from all four dataset files; dashboard TypeScript builds clean.
Dataset integrity re-verified after the Haiku re-stamp (line count, JSON validity, duplicate run keys).

🤖 Generated with Claude Code

Bound reasoning accumulation in the forge->backend direction. Adds core/reasoning.py policy module and threads reasoning_replay through inference, runner, and proxy convert/handler paths. keep-last emits reasoning via reasoning_content for round-trip re-capture and trims older reasoning; none strips it; full preserves prior behavior. The Anthropic path drops reasoning under keep-last (no signable channel). Includes docs (README, BACKEND_SETUP, USER_GUIDE) and unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make reasoning_replay a first-class, resumable, recorded eval axis so the re-sweep can run none-on-all (regression) and keep-last/full on reasoning models without collisions. - EvalConfig gains reasoning_replay; run_scenario passes it to WorkflowRunner (the backend pipeline already consumes it). run_eval propagates it. - batch_eval: run-wide --reasoning-replay choice (mirrors --ablation), threaded into run_batch, every EvalConfig, and the JSONL row. - Centralize the 6 inline resume keys into _run_key(); reasoning_replay is now part of the key so distinct policies for the same model+scenario are independent runs. _count_completed_runs defaults pre-knob rows (no field) to keep-last, so old dumps resume cleanly under the default. - Both CLIs expose --reasoning-replay {full,keep-last,none}; banners print it. - Add test_batch_eval_resume.py covering key distinctness, row recording, and per-policy resume counting incl. the legacy-default fold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add count_wire_reasoning() (eval-only, no src change): serialize the recorded transcript through the real fold_and_serialize choke point and count which reasoning blocks survive onto the backend wire. Emit reasoning_wire (survived) and reasoning_wire_total (non-empty blocks) per batch_eval row, so the sweep records an actual replay rate. Validated on a reasoning model (N=10, all 3 policies, 26 scenarios): none -> 0 on the wire across all 260 rows; keep-last in {0,1}; full in [0, total]. Surfaces that legacy/full replay is itself lossy (~29% of generated reasoning reaches the wire) due to consecutive-block collapse and empty-reasoning omission in fold_and_serialize. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cache the re-sent tool defs + system prompt on the Anthropic eval path so the repeated input prefix bills at 0.1x (read) instead of full price every turn. Billing-only: identical model behavior, accuracy, and iteration counts (safe for cross-run comparability). - AnthropicClient gains opt-in `prompt_caching` (default off, so the proxy verbatim path and existing request shape are untouched). When on, a static ephemeral breakpoint marks the tools + system prefix in the rebuild path. - Static-only on purpose: a rolling conversation breakpoint is NOT placed. The default reasoning_replay="keep-last" re-serializes earlier tool-call messages each turn, which busts a rolling prefix cache (1.25x writes, no reads). The conversation prefix is only stable under none/full, and reasoning_replay is a measured variable we won't pin, so caching is confined to the always-stable tools+system region. - TokenUsage carries cache_creation/cache_read counts (additive, defaults 0); captured in send() and send_stream(); accumulated through CountingClientWrapper and RunResult into the JSONL row. - _compute_cost is cache-aware (write 1.25x, read 0.1x of input rate); applied at the row and both eval_runner cost summaries. - Enabled by default for batch_eval sweeps; eval_runner gains --no-anthropic-cache for a cache-free cost-floor comparison. - Bump claude-opus-4-6 -> claude-opus-4-8 (configs + pricing, $5/$25 verified). Validated: 1148 unit tests pass (incl. new cache tests) + a live one-run smoke on compaction_chain_baseline (20,523 cache reads, behavior unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…k runs) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds the remaining single-GPU sweep partition (33.8k runs, 26 config cells) to the existing dual-GPU results, completing the full reasoning_replay grid across all 14 models x {none,keep-last,full} x {bare,reforged} x {native,prompt}. All 78 cells verified complete (26 scenarios x 50 runs each), zero duplicate run-keys. Rows stamped gen=3 (v0.6.0=1, v0.7.0=2) so cross-generation report dedup keeps this suite over older generations of the same config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge the catch-up reasoning-replay eval rows into the canonical v0.7.5 dataset and remove the temporary GPU-A shard. Add FORGE_EVAL_PORT so concurrent local eval workers can use separate llama-server ports.

The v0.7.5 eval grid showed dropping replayed reasoning is statistically indistinguishable from replay-all on score while saving the replayed tokens every turn, so the bounded policy becomes the default. Help strings, the resume-fold docstring, and the anthropic prompt-caching rationale updated to match; default-behavior tests now assert omission, with fold/exposure mechanics re-pinned under explicit keep-last. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…none The Haiku baseline ran before the default-policy decision and recorded keep-last; Sonnet/Opus recorded none. The knob is request-inert for Claude rows (no captured reasoning is replayed), so the field is a label, not a behavioral difference - re-stamped for a consistent board. Targeted byte-level edit of the 3,900 Haiku rows; all other lines byte-identical. Post-edit validation: 170,300 rows, 0 bad JSON, 0 duplicate run keys. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ConfigKey (display identity) gains the policy so none/keep-last/full render as separate rows, tagged :keep-last/:full (untagged = the none default; pre-knob rows count as full - that is what they ran). The dedup identity (_config_tuple) deliberately excludes it so latest-gen-wins still supersedes pre-knob rows whole-config instead of keeping them as stale :full duplicates. Adds the reasoning-replay.md policy-comparison view, a --reasoning-replay report filter, a Reasoning Replay dashboard filter dimension with canonical ordering, and the gen-3 legend entry (tag ref v0.7.5; the squash SHA does not exist pre-merge). Reports and dashboard regenerated from all four dataset files. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Document the knob and the new none default across README, User Guide, and Backend Setup, with links to the eval evidence. ADR-017 records the policy design, the grid results behind the default, and the alternatives considered. Model Registry: Claude footnote updated for the v0.7.5 thinking-on re-baseline (Sonnet 4.6 / Opus 4.8; Opus 4.6 and the deep-ablation rows stay carried forward), and Qwen3 8B Q8_0 is flagged for future retirement on compute-cost vs signal-value grounds (~23% of the full sweep for a small Q4/Q8 delta). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

antoinezambelli and others added 14 commits June 3, 2026 01:49

Add v0.7.5 reasoning-replay eval results (rig-02 dual-GPU sweep, 67.6…

b1217c7

…k runs) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add Anthropic v0.7.5 eval rows

694110b

Add GPU-A catch-up replay eval shard

2268c99

eval: merge reasoning replay catch-up results

ab6f5e5

Merge the catch-up reasoning-replay eval rows into the canonical v0.7.5 dataset and remove the temporary GPU-A shard. Add FORGE_EVAL_PORT so concurrent local eval workers can use separate llama-server ports.

release: v0.7.5 version bump + changelog

8ba0f1e

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

antoinezambelli merged commit 5d01dfd into main Jun 12, 2026
2 checks passed

antoinezambelli deleted the az/reasoning-canonical branch June 12, 2026 01:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.5: reasoning_replay knob (default none) + replay eval grid + Anthropic thinking-on baseline#105

v0.7.5: reasoning_replay knob (default none) + replay eval grid + Anthropic thinking-on baseline#105
antoinezambelli merged 14 commits into
mainfrom
az/reasoning-canonical

antoinezambelli commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antoinezambelli commented Jun 12, 2026

What

Why none

Changes

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why `none`