fix: exclude stream_options from proxy passthrough by alexandergunnarson · Pull Request #94 · antoinezambelli/forge

alexandergunnarson · 2026-05-30T13:53:46Z

Problem

When a client sends stream_options in the request body (e.g. {"stream_options": {"include_usage": true}}), the proxy's _extract_passthrough lets it through to the backend. But Forge makes non-streaming calls internally for guardrail validation (stream=False), so the backend receives stream_options without stream=True.

Strict OpenAI-compatible backends reject this combination:

ValidationError: 1 validation error for ChatCompletionRequest
  Value error, Stream options can only be defined when `stream=True`.

This triggers on every request from clients like Zed editor, which always send stream_options.

Fix

Add stream_options to _FORGE_OWNED in handler.py, alongside the existing stream entry. Forge's own clients already inject stream_options when needed in their send_stream() paths (e.g. VLLMClient, LlamafileClient), so stripping it from passthrough is safe.

Changes

src/forge/proxy/handler.py: Add stream_options to _FORGE_OWNED frozenset
tests/unit/test_proxy_handler.py: Add regression test

Testing

All 27 proxy handler tests pass, including the new one.

Forge controls streaming independently — it uses non-streaming calls internally for guardrail validation. When an inbound request carries stream_options (e.g. from Zed, OpenCode, or any client requesting streaming token usage), that field leaked through _extract_passthrough into the outbound body sent to the backend. On strict OpenAI-compatible backends like vLLM, this causes a validation error: 'Stream options can only be defined when stream=True'. The backend rightfully rejects the request because Forge set stream=False but left stream_options in the body. Fix: add stream_options to _FORGE_OWNED so it is stripped alongside stream. Forge's own clients (LlamafileClient, VLLMClient) inject stream_options when they need it in their send_stream() paths. Includes a regression test.

antoinezambelli

This is a clean, correct fix — thank you! Merging it.

You've got the reasoning exactly right: forge owns the streaming decision (stream is already in _FORGE_OWNED), and it makes non-streaming client.send(...) calls to the backend regardless of the inbound stream value. So a leaked stream_options rides into a non-streaming request, and strict OpenAI-compatible backends (vLLM especially) reject stream_options when stream isn't True. Putting stream_options right next to stream in the owned set is the principled spot for it — they're two halves of the same decision forge is already reasoning about.

The regression test is nicely targeted, too — asserting both the exclusion and the exact resulting passthrough shape so a future refactor can't quietly let it slip back in.

Verified locally: your new test passes and the full unit suite is green.

Thanks again for catching this and for the tidy patch!

* Proxy: native-only + transparent OpenAI passthrough Make the OpenAI-compatible proxy native-tool-call-only and forward the client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip that dropped schema detail and leaked empty tool names. - Remove the proxy's --mode surface; the proxy always drives the backend client native. LlamafileClient's prompt-injection machinery is retained for non-proxy WorkflowRunner / direct-client use (it still wins for some models in full-guardrail workflow evals). - Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native path sends it verbatim. Other clients accept-and-ignore (vLLM also gains the previously-missing passthrough/inbound_anthropic_body kwargs). - run_inference forwards raw OpenAI messages/tools only on the clean first attempt (use_raw_messages gate); any mutation falls back to fold+serialize. - respond tool is now opt-in (--inject-respond-tool, default off). - No instrumentation (proxy_trace/guardrail_stats deliberately not ported). - Tests: drop removed mode-guard tests; respond tests opt in explicitly; add native-passthrough, detachment, respond-default, and first-attempt-gate coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Add prompt-injection as opt-in proxy capability (--backend-capability) The proxy serves tool-call-capable backends natively (verbatim tool/message passthrough). This adds prompt-injection back as an explicit opt-in for non-function-calling backends (llama.cpp / llamafile without a tool template). - New --backend-capability {native,prompt} (default native), declared once at construction and frozen — no runtime probing or mid-request mode mutation. - prompt capability reuses LlamafileClient's existing prompt path (build the tool prompt, downgrade tool/assistant-tool_call history to text, parse the JSON tool call back into native tool_calls). No client changes. - Handler suppresses verbatim raw passthrough when in prompt mode so inference folds normally and the client injects the tool prompt. - Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol). - Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Proxy: log effective backend_timeout at startup The configurable backend_timeout (#91) was validated, stored, and threaded into every client request, but never surfaced at launch. Extend the "Proxy ready" line to report the effective value so the operative timeout is visible/diagnosable from the startup log. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * vLLM: single source of truth for model identity (#75) VLLMClient kept two identity fields with distinct roles — model_path (the verbatim wire "model" field, which vLLM validates against its --served-model-name) and model (the derived registry-lookup key). The proxy's external-mode served-name adoption set both by hand (model_path = served; model = served), duplicating the derivation logic and storing the full served name where the constructor's rule stores the stem. Extract the path->key derivation into _derive_model_field and wrap both assignments in _set_model_identity, then call it from __init__ and from the proxy. External adoption now upholds the same (model_path, model) invariant as construction: an HF-repo-id served name reaches the wire verbatim while the registry key is the derived stem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Clients: consistent malformed-tool-call + response-shape handling Audited malformed-tool-call and unexpected-payload handling across the OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and LlamafileClient. Standardize on one principle, applied uniformly: - Malformed argument JSON (a model mistake) -> TextResponse, routing the raw output back through the inference loop so the rescue/retry path can recover. - A broken provider envelope (missing choices/message) or unexpected args type (a contract violation, not the model's fault) -> BackendError: fail loud and consistent, never a stray KeyError/IndexError. Changes: - vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on malformed args, unlike llamafile's retry-driving TextResponse) with a _parse_tool_calls mirroring the reference. Route both send() and send_stream() through it so streaming and non-streaming agree: a fully accumulated but unparseable arguments string finalizes as a TextResponse, not an exception. - llamafile / openai_compat: guard the bare data["choices"][0]["message"] subscripts -> BackendError on a broken envelope (matching what vLLM already did for choices). llamafile also hardens function/name access. - ollama: defensive .get on function/name (both paths); document that Ollama emits dict args by contract, so no json.loads is needed there. Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing- function/reasoning) + streaming malformed-fragment parity; envelope-guard tests for llamafile and openai_compat. 1092 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * LlamafileClient: remove runtime auto mode; native-first, frozen capability Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try native, fall back to prompt on HTTP error, recording resolved_mode). This was the last vestige of the mid-request capability mutation the proxy rewrite excised everywhere else; the proxy already declares its capability up front via --backend-capability. With auto gone, resolved_mode is always == self.mode, so the whole tri-state indirection collapses to a direct dispatch on self.mode. The default is now native. This is both hardening and a deliberate posture shift: local-model function-calling support has matured into the more reliable path, so native-first is the right default. Prompt-injection is preserved as an explicit opt-in (mode="prompt") and is the theoretically correct fallback for non-FC backends — but it is honestly flagged, in the docstring and docs, that models tend to struggle to drive the prompt-injected protocol reliably on more complex, multi-step interactions. Capability is declared-and-frozen: an invalid mode (including the old "auto") now raises ValueError rather than silently degrading. - llamafile.py: validate mode in __init__; default native; delete _resolve_and_send and the resolved_mode attribute/branches; dispatch send / send_stream on self.mode; rewrite the class docstring (native-first rationale + prompt caveat). - eval_runner.py: --llamafile-mode choices [native, prompt], default native. - docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat. - tests: drop the auto-mode suite; assert native default + ValueError on "auto". Consumers verified unaffected: the proxy (both sites), batch_eval, and the integration script all pass mode explicitly. 1086 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(eval): honor manual context budget in batch_eval via start_with_budget batch_eval brought servers up with a bare server.start() (no ctx_override) and resolved the budget separately via server.resolve_budget(), so --budget-mode manual --num-ctx N was a no-op for llama-server: the server booted at the model's full native context (no -c), and resolve_budget(MANUAL) just read that full value back from /props. (Ollama was unaffected — its context is per-request via set_num_ctx.) Route both the initial bring-up and _recover_server through the prod start_with_budget() path, which threads manual_tokens -> ctx_override -> -c at launch and returns the resolved budget. _recover_server gains budget_mode/manual_tokens params so a restarted server reuses the same budget. Drops the now-redundant standalone resolve_budget() on the happy path (still used on the recovery branch to read back the resolved value). This also fixes FORGE_FAST mode, which the old bare-start() path never supported. Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000): server boots with -c, rows record budget_tokens=20224 (server-clamped) instead of the previous 262144 full-native read-back. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(release): 0.7.3 — native-first proxy Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient #89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94). Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed to --backend-capability (no alias; --mode was only introduced in 0.7.1). Native is the default and only auto-selected protocol; prompt-injection is an explicit opt-in for non-FC llama.cpp/llamafile backends. USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode tends to degrade on more complex multi-step interactions. BACKEND_SETUP, EVAL_GUIDE, and ADR-012 were already updated earlier on this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

alexandergunnarson and others added 2 commits May 30, 2026 13:53

Merge branch 'main' into fix/strip-stream-options-from-passthrough

69a6329

antoinezambelli self-requested a review May 31, 2026 20:39

antoinezambelli approved these changes May 31, 2026

View reviewed changes

antoinezambelli merged commit 96b0458 into antoinezambelli:main May 31, 2026
2 checks passed

antoinezambelli mentioned this pull request Jun 1, 2026

Native-first proxy (0.7.3) #97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: exclude stream_options from proxy passthrough#94

fix: exclude stream_options from proxy passthrough#94
antoinezambelli merged 2 commits into
antoinezambelli:mainfrom
alexandergunnarson:fix/strip-stream-options-from-passthrough

alexandergunnarson commented May 30, 2026 •

edited

Loading

Uh oh!

antoinezambelli left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexandergunnarson commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Changes

Testing

Uh oh!

antoinezambelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexandergunnarson commented May 30, 2026 •

edited

Loading