v0.7.4: malformed-args → tool-error channel + 32GB eval tier#102
Merged
Conversation
Models occasionally emit a structurally valid tool call with malformed
args content (e.g. arguments="" instead of arguments="{}"). Pydantic
rejected at ToolCall construction, crashing the stage with
ValidationError. Observed at 86% of error rows on Qwen3-Next prompt
mode (77/89), same family on Qwen3.6 (rig-02).
This is conceptually "tool called with bad args" — the call exists,
the inputs are wrong — same as FileNotFoundError at runtime. Should
ride the tool-error channel with max_tool_errors=2 budget, not crash.
- ToolCall / TextResponse: BaseModel → @DataClass. args is no longer
validated at construction; ResponseValidator enforces dict-shape.
Audit: no .model_* API on ToolCall anywhere in forge.
- ResponseValidator: new args-shape branch after unknown-tool check.
Unknown-tool runs first (cheap; no point validating args on a
hallucinated tool name).
- nudges.tool_arg_validation_nudge: schema-derived message naming the
tool, the received args type, and the required JSON-object shape.
- inference: parse-error nudges drain max_tool_errors (record_result)
not max_retries (record_retry). Message prefix
[ToolArgValidationError] vs [UnknownTool].
- Exhaustion message simplified: includes which budget and nudge kind.
Smoke (Ministral-3-14B-Reasoning, 26 scenarios × 25 runs prompt-mode):
score 78.77% (vs v0.7.0 baseline 79.5% at n=50). Delta within
±1.6% noise band — no regression. Patch never tripped on this model;
this is a regression check, full bake on a model that hits the path
to follow.
884 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…UD, Nemotron-3-Nano GGUF + server-flag + sampling-default entries for the three rig-02 32GB-tier models added for the v0.7.1 run. Launch config of record for eval_results_rig-02_v0.7.1.jsonl (31,200 rows).
Unify all OpenAI-shape clients on one decode_tool_args helper (clients/base.py): JSON-string args are parsed; malformed or non-dict payloads ride through on the ToolCall as raw (non-dict) args instead of collapsing to a TextResponse (openai_compat, vllm, llamafile) or raising (anthropic streaming). ResponseValidator's args-shape check then routes them to the tool-error channel + max_tool_errors budget — the same lane as a runtime tool error — rather than a retry nudge. Completes the client normalization #86 started (one decoder, all clients) and keeps fail-loud (never coerced to {}). Also closes an unguarded json.loads crash in the anthropic streaming finalize. Behavior change: structural malformed-args now drains max_tool_errors (2), not max_retries (3), in proxy mode. Wire-invisible; no public signature changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…le malformed test From the helper-vs-inline structural review (verdict: keep the shared decode_tool_args helper). Two correctness/honesty caveats actioned: - ToolCall.args annotated dict[str, Any] while the runtime contract now intentionally allows non-dicts (the docstring already says so). Widen to Any so the type stops lying. - Stale comments in openai_compat/vllm streaming finalize still claimed malformed args yield a retry-driving TextResponse; corrected to the raw-args → tool-error-channel routing (they invited the exact drift the helper prevents). - Add an explicit llamafile malformed-args test (non-stream native path). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… expose proxy max_tool_errors
From the H1/H2 design review. Two consistency gaps closed:
1. The Guardrails middleware facade recorded EVERY validator failure as a
retry (max_retries) and returned action='retry' — diverging from
run_inference/proxy, where malformed args drain the tool-error budget
and tool-call faults ride the tool channel. The facade now:
- routes malformed args (tool_arg_validation) to max_tool_errors,
- returns a new action='tool_error' for tool-call faults (unknown
tool name OR malformed args), with nudge.role='tool' so callers
emit the correction on the tool-result channel.
Channel vs budget are now two explicit kind-sets in nudge.py
(TOOL_CHANNEL_KINDS ⊃ TOOL_ERROR_KINDS); unknown-tool rides the tool
channel but still drains the retry budget, matching run_inference.
_TOOL_ERROR_KINDS moved from inference.py to the shared nudge module.
2. Proxy exposed max_retries but not max_tool_errors, hiding the budget
exactly where malformed-arg recovery now matters. Added --max-tool-errors
(default 2) threaded ProxyServer → HTTPServer → handler → ErrorTracker.
Nobody depends on the middleware facade yet, so the CheckResult.action
addition is free. Channel parity in run_inference unchanged (it emits
role=tool for list-branch corrections regardless of nudge.role).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Closes the last crash vector from the design review (hole 4): StepTracker.check_prerequisites did args.get(match_arg), which raises on a non-dict args. ResponseValidator fences this before dispatch in the runner/proxy, but a granular caller that bypasses check() could reach it directly. Treat non-dict args as unsatisfied (block, don't crash). ADR-016 records the malformed-args → tool-error-channel decision in its honest framing: a native-mode conditioning bet, not an ontology claim; prompt mode degrades to the prior retry shape; the tool-error budget coupling is deliberate but revisitable. CHANGELOG held for release time. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Inject a per-row `gen` field so one dashboard can fold eval waves run against different code states. gen is a comparability epoch, not a release version: v0.6.0 -> gen 1 (carries the Anthropic ablation + Retired-tier models, neither re-run since), and v0.7.0 plus the new 32GB tier -> gen 2. Rename the 32GB wave to eval_results_v0.7.4.jsonl (its landing release) and keep it as a separate file beside v0.7.0 — same gen, distinct wave, so each wave keeps its own landing commit for reproducibility. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
report.py now accepts multiple result files and keeps the newest gen per config (dedup_latest_gen), so the board folds all generations into one view. Lagging rows (gen < newest) get a superscript badge backed by a commit/date legend; Retired-tier models are carried forward but hidden by default (--include-retired, or a sidebar checkbox in the HTML). Adds MODEL_FAMILIES entries for the 6 32GB models so they render clean family names and cross-backend keys instead of raw GGUF stems. React dashboard: Show-retired checkbox (dimmed rows + a "retired" pill), superscript gen badges with provenance tooltips from the data blob. Regenerated docs/results/ from the three gen-tagged datasets. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bump 0.7.3 -> 0.7.4 and add the 0.7.4 CHANGELOG entry (malformed args -> tool-error channel; 32GB eval tier + dashboard eval-generations). Move the 6 32GB models (Mistral-Small-3.2, Qwen3.5/3.6 27-35B, Nemotron-3 Nano) from Unpublished to Current now that they're in the published eval, and reword the tier definitions for the dashboard's eval generations. Also scrub a stale bring-up note from the Qwen3.5-122B footnote (it leaked operator smoke-probe process into a public doc) and exclude the built dashboard dist/ from the hatchling sdist sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ToolCall/TextResponse pydantic->dataclass move only breaks callers who serialized these via the pydantic .model_* API or relied on construction-time validation; attribute reads and keyword construction are unchanged. Reserving BREAKING for forced-migration changes (cf. 0.7.3 --mode rename) keeps the badge meaningful.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v0.7.4 — malformed-args → tool-error channel + 32GB eval tier
Two self-scoped changes, plus the dashboard infra to surface the larger models.
Malformed tool-call arguments ride the tool-error channel
A structurally valid tool call whose
argumentsare unparseable or not an object is now corrected via a tool-error result (role="tool", anchored to itstool_call_id, drainingmax_tool_errors) — uniformly across all OpenAI-shape clients and all three integration modes (WorkflowRunner, proxy,Guardrailsfacade). A singledecode_tool_argshelper (clients/base.py) is the one place args are decoded; shape validation moved toResponseValidator. Supersedes 0.7.3's "malformed args drive a retry nudge."Framed as a native-mode conditioning bet, not an ontology claim — a small model plausibly self-corrects better on the channel it was pretrained on. In prompt mode the tool role downgrades to a user message, so behavior there is unchanged. See
docs/decisions/016.ToolCall/TextResponseare now plain dataclasses (args: Any); attribute access + keyword construction unchanged, pydantic.model_*API gone. (Not labeled BREAKING — blast radius is callers who serialized these via pydantic; reserving the badge for forced-migration changes like 0.7.3's--moderename.)CheckResult.actiongains"tool_error"; proxy exposes--max-tool-errors(default 2);StepTracker.check_prerequisitesguards non-dictargs.32GB eval tier + eval-generation dashboard infra
gencomparability epoch (decoupled from version + filename);report.pydedups latest-gen-per-config, badges lagging rows, hides Retired behind a toggle. The tool-arg change was a no-op on these 32GB models (error type didn't surface) → same gen as the 8–14B lineup.Validation
🤖 Generated with Claude Code