v0.7.5: reasoning_replay knob (default none) + replay eval grid + Anthropic thinking-on baseline#105
Merged
Merged
Conversation
Bound reasoning accumulation in the forge->backend direction. Adds core/reasoning.py policy module and threads reasoning_replay through inference, runner, and proxy convert/handler paths. keep-last emits reasoning via reasoning_content for round-trip re-capture and trims older reasoning; none strips it; full preserves prior behavior. The Anthropic path drops reasoning under keep-last (no signable channel). Includes docs (README, BACKEND_SETUP, USER_GUIDE) and unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make reasoning_replay a first-class, resumable, recorded eval axis so the
re-sweep can run none-on-all (regression) and keep-last/full on reasoning
models without collisions.
- EvalConfig gains reasoning_replay; run_scenario passes it to WorkflowRunner
(the backend pipeline already consumes it). run_eval propagates it.
- batch_eval: run-wide --reasoning-replay choice (mirrors --ablation),
threaded into run_batch, every EvalConfig, and the JSONL row.
- Centralize the 6 inline resume keys into _run_key(); reasoning_replay is
now part of the key so distinct policies for the same model+scenario are
independent runs. _count_completed_runs defaults pre-knob rows (no field)
to keep-last, so old dumps resume cleanly under the default.
- Both CLIs expose --reasoning-replay {full,keep-last,none}; banners print it.
- Add test_batch_eval_resume.py covering key distinctness, row recording,
and per-policy resume counting incl. the legacy-default fold.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add count_wire_reasoning() (eval-only, no src change): serialize the
recorded transcript through the real fold_and_serialize choke point and
count which reasoning blocks survive onto the backend wire. Emit
reasoning_wire (survived) and reasoning_wire_total (non-empty blocks)
per batch_eval row, so the sweep records an actual replay rate.
Validated on a reasoning model (N=10, all 3 policies, 26 scenarios):
none -> 0 on the wire across all 260 rows; keep-last in {0,1};
full in [0, total]. Surfaces that legacy/full replay is itself lossy
(~29% of generated reasoning reaches the wire) due to consecutive-block
collapse and empty-reasoning omission in fold_and_serialize.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cache the re-sent tool defs + system prompt on the Anthropic eval path so the repeated input prefix bills at 0.1x (read) instead of full price every turn. Billing-only: identical model behavior, accuracy, and iteration counts (safe for cross-run comparability). - AnthropicClient gains opt-in `prompt_caching` (default off, so the proxy verbatim path and existing request shape are untouched). When on, a static ephemeral breakpoint marks the tools + system prefix in the rebuild path. - Static-only on purpose: a rolling conversation breakpoint is NOT placed. The default reasoning_replay="keep-last" re-serializes earlier tool-call messages each turn, which busts a rolling prefix cache (1.25x writes, no reads). The conversation prefix is only stable under none/full, and reasoning_replay is a measured variable we won't pin, so caching is confined to the always-stable tools+system region. - TokenUsage carries cache_creation/cache_read counts (additive, defaults 0); captured in send() and send_stream(); accumulated through CountingClientWrapper and RunResult into the JSONL row. - _compute_cost is cache-aware (write 1.25x, read 0.1x of input rate); applied at the row and both eval_runner cost summaries. - Enabled by default for batch_eval sweeps; eval_runner gains --no-anthropic-cache for a cache-free cost-floor comparison. - Bump claude-opus-4-6 -> claude-opus-4-8 (configs + pricing, $5/$25 verified). Validated: 1148 unit tests pass (incl. new cache tests) + a live one-run smoke on compaction_chain_baseline (20,523 cache reads, behavior unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k runs) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the remaining single-GPU sweep partition (33.8k runs, 26 config
cells) to the existing dual-GPU results, completing the full
reasoning_replay grid across all 14 models x {none,keep-last,full} x
{bare,reforged} x {native,prompt}. All 78 cells verified complete
(26 scenarios x 50 runs each), zero duplicate run-keys.
Rows stamped gen=3 (v0.6.0=1, v0.7.0=2) so cross-generation report
dedup keeps this suite over older generations of the same config.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merge the catch-up reasoning-replay eval rows into the canonical v0.7.5 dataset and remove the temporary GPU-A shard. Add FORGE_EVAL_PORT so concurrent local eval workers can use separate llama-server ports.
The v0.7.5 eval grid showed dropping replayed reasoning is statistically indistinguishable from replay-all on score while saving the replayed tokens every turn, so the bounded policy becomes the default. Help strings, the resume-fold docstring, and the anthropic prompt-caching rationale updated to match; default-behavior tests now assert omission, with fold/exposure mechanics re-pinned under explicit keep-last. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…none The Haiku baseline ran before the default-policy decision and recorded keep-last; Sonnet/Opus recorded none. The knob is request-inert for Claude rows (no captured reasoning is replayed), so the field is a label, not a behavioral difference - re-stamped for a consistent board. Targeted byte-level edit of the 3,900 Haiku rows; all other lines byte-identical. Post-edit validation: 170,300 rows, 0 bad JSON, 0 duplicate run keys. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ConfigKey (display identity) gains the policy so none/keep-last/full render as separate rows, tagged :keep-last/:full (untagged = the none default; pre-knob rows count as full - that is what they ran). The dedup identity (_config_tuple) deliberately excludes it so latest-gen-wins still supersedes pre-knob rows whole-config instead of keeping them as stale :full duplicates. Adds the reasoning-replay.md policy-comparison view, a --reasoning-replay report filter, a Reasoning Replay dashboard filter dimension with canonical ordering, and the gen-3 legend entry (tag ref v0.7.5; the squash SHA does not exist pre-merge). Reports and dashboard regenerated from all four dataset files. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Document the knob and the new none default across README, User Guide, and Backend Setup, with links to the eval evidence. ADR-017 records the policy design, the grid results behind the default, and the alternatives considered. Model Registry: Claude footnote updated for the v0.7.5 thinking-on re-baseline (Sonnet 4.6 / Opus 4.8; Opus 4.6 and the deep-ablation rows stay carried forward), and Qwen3 8B Q8_0 is flagged for future retirement on compute-cost vs signal-value grounds (~23% of the full sweep for a small Q4/Q8 delta). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Reasoning replay becomes a measured, bounded policy. Reasoning-capable backends return hidden reasoning alongside tool calls, and forge previously re-serialized all of it into backend-facing history on every later turn, with no way to turn it off. This PR adds a
reasoning_replay {full, keep-last, none}knob toWorkflowRunnerand the proxy — and, backed by a full re-sweep of the published eval grid, makesnonethe default.Why
noneThe v0.7.5 grid re-swept the full 8–14B lineup across all three policies × both ablations × native/prompt (~170k runs, scenario as the sampling unit, paired against the v0.7.0 legacy baseline):
fullreproduces the pre-knob baseline everywhere — the knob is a clean superset of legacy behavior.noneis statistically indistinguishable from replay-all in aggregate (+0.49pp, p=0.17; reforged-only −0.35pp, p=0.45) while saving the replayed tokens every turn.none≈keep-last).none→ exactly 0 replayed reasoning across every run;keep-last∈ {0, 1}.Full rationale and alternatives: ADR-017. Per-config tables: reasoning-replay view.
Changes
reasoning_replayknob —WorkflowRunner(reasoning_replay=…)/ proxy--reasoning-replay. Serialization-only: reasoning is still captured and still surfaces inon_messageand internal history. In OpenAI-compatible proxy responses,keep-lastexposes current reasoning asreasoning_contentrather than assistantcontent; under the defaultnone, responses omit captured reasoning.full; nownone. Inert for models that emit no reasoning. Migration:--reasoning-replay fullorWorkflowRunner(reasoning_replay="full")restores the historical behavior.:keep-last/:full(untagged =none; pre-knob rows count asfull), a Reasoning Replay dashboard filter, a--reasoning-replayreport filter, and a dedicated policy-comparison view. Design note for review: the policy is inConfigKey(display identity) but deliberately not in_config_tuple(dedup identity), so latest-generation-wins still supersedes pre-knob rows whole-config instead of keeping them as stale:fullduplicates.AnthropicClient(thinking=…)(e.g.{"type": "adaptive"}); forcedtool_choicesuppressed when thinking is on (API requiresauto). The Claude eval baseline now runs Sonnet 4.6 / Opus 4.8 with adaptive thinking (all prior Claude rows had thinking off); Haiku 4.5 has no adaptive support and stays non-thinking.AnthropicClient(prompt_caching=True)marks a static ephemeral cache breakpoint over tools + system;TokenUsagegains genericcache_creation_input_tokens/cache_read_input_tokens; eval cost accounting prices cache writes (1.25×) and reads (0.1×).eval_results_v0.7.5.jsonl(170,300 rows, new eval generation). Haiku rows re-stampedkeep-last→none(label hygiene: they ran before the default decision; the knob is request-inert for Claude rows). Validated: 0 bad JSON, 0 duplicate run keys.pyproject.toml→ 0.7.5, CHANGELOG section (dated 2026-06-11 — touch up if the release slips).Validation
keep-last).🤖 Generated with Claude Code