Skip to content

fix: exclude stream_options from proxy passthrough#94

Merged
antoinezambelli merged 2 commits into
antoinezambelli:mainfrom
alexandergunnarson:fix/strip-stream-options-from-passthrough
May 31, 2026
Merged

fix: exclude stream_options from proxy passthrough#94
antoinezambelli merged 2 commits into
antoinezambelli:mainfrom
alexandergunnarson:fix/strip-stream-options-from-passthrough

Conversation

@alexandergunnarson

@alexandergunnarson alexandergunnarson commented May 30, 2026

Copy link
Copy Markdown
Contributor

Problem

When a client sends stream_options in the request body (e.g. {"stream_options": {"include_usage": true}}), the proxy's _extract_passthrough lets it through to the backend. But Forge makes non-streaming calls internally for guardrail validation (stream=False), so the backend receives stream_options without stream=True.

Strict OpenAI-compatible backends reject this combination:

ValidationError: 1 validation error for ChatCompletionRequest
  Value error, Stream options can only be defined when `stream=True`.

This triggers on every request from clients like Zed editor, which always send stream_options.

Fix

Add stream_options to _FORGE_OWNED in handler.py, alongside the existing stream entry. Forge's own clients already inject stream_options when needed in their send_stream() paths (e.g. VLLMClient, LlamafileClient), so stripping it from passthrough is safe.

Changes

  • src/forge/proxy/handler.py: Add stream_options to _FORGE_OWNED frozenset
  • tests/unit/test_proxy_handler.py: Add regression test

Testing

All 27 proxy handler tests pass, including the new one.

alexandergunnarson and others added 2 commits May 30, 2026 13:53
Forge controls streaming independently — it uses non-streaming calls
internally for guardrail validation. When an inbound request carries
stream_options (e.g. from Zed, OpenCode, or any client requesting
streaming token usage), that field leaked through _extract_passthrough
into the outbound body sent to the backend.

On strict OpenAI-compatible backends like vLLM, this causes a
validation error: 'Stream options can only be defined when stream=True'.
The backend rightfully rejects the request because Forge set stream=False
but left stream_options in the body.

Fix: add stream_options to _FORGE_OWNED so it is stripped alongside
stream. Forge's own clients (LlamafileClient, VLLMClient) inject
stream_options when they need it in their send_stream() paths.

Includes a regression test.
@antoinezambelli antoinezambelli self-requested a review May 31, 2026 20:39

@antoinezambelli antoinezambelli left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clean, correct fix — thank you! Merging it.

You've got the reasoning exactly right: forge owns the streaming decision (stream is already in _FORGE_OWNED), and it makes non-streaming client.send(...) calls to the backend regardless of the inbound stream value. So a leaked stream_options rides into a non-streaming request, and strict OpenAI-compatible backends (vLLM especially) reject stream_options when stream isn't True. Putting stream_options right next to stream in the owned set is the principled spot for it — they're two halves of the same decision forge is already reasoning about.

The regression test is nicely targeted, too — asserting both the exclusion and the exact resulting passthrough shape so a future refactor can't quietly let it slip back in.

Verified locally: your new test passes and the full unit suite is green.

Thanks again for catching this and for the tidy patch!

@antoinezambelli antoinezambelli merged commit 96b0458 into antoinezambelli:main May 31, 2026
2 checks passed
antoinezambelli added a commit that referenced this pull request Jun 1, 2026
* Proxy: native-only + transparent OpenAI passthrough

Make the OpenAI-compatible proxy native-tool-call-only and forward the
client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip
that dropped schema detail and leaked empty tool names.

- Remove the proxy's --mode surface; the proxy always drives the backend
  client native. LlamafileClient's prompt-injection machinery is retained
  for non-proxy WorkflowRunner / direct-client use (it still wins for some
  models in full-guardrail workflow evals).
- Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native
  path sends it verbatim. Other clients accept-and-ignore (vLLM also gains
  the previously-missing passthrough/inbound_anthropic_body kwargs).
- run_inference forwards raw OpenAI messages/tools only on the clean first
  attempt (use_raw_messages gate); any mutation falls back to fold+serialize.
- respond tool is now opt-in (--inject-respond-tool, default off).
- No instrumentation (proxy_trace/guardrail_stats deliberately not ported).
- Tests: drop removed mode-guard tests; respond tests opt in explicitly; add
  native-passthrough, detachment, respond-default, and first-attempt-gate
  coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Add prompt-injection as opt-in proxy capability (--backend-capability)

The proxy serves tool-call-capable backends natively (verbatim tool/message
passthrough). This adds prompt-injection back as an explicit opt-in for
non-function-calling backends (llama.cpp / llamafile without a tool template).

- New --backend-capability {native,prompt} (default native), declared once at
  construction and frozen — no runtime probing or mid-request mode mutation.
- prompt capability reuses LlamafileClient's existing prompt path (build the
  tool prompt, downgrade tool/assistant-tool_call history to text, parse the
  JSON tool call back into native tool_calls). No client changes.
- Handler suppresses verbatim raw passthrough when in prompt mode so inference
  folds normally and the client injects the tool prompt.
- Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol).
- Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Proxy: log effective backend_timeout at startup

The configurable backend_timeout (#91) was validated, stored, and threaded
into every client request, but never surfaced at launch. Extend the
"Proxy ready" line to report the effective value so the operative timeout
is visible/diagnosable from the startup log.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* vLLM: single source of truth for model identity (#75)

VLLMClient kept two identity fields with distinct roles — model_path (the
verbatim wire "model" field, which vLLM validates against its
--served-model-name) and model (the derived registry-lookup key). The proxy's
external-mode served-name adoption set both by hand (model_path = served;
model = served), duplicating the derivation logic and storing the full served
name where the constructor's rule stores the stem.

Extract the path->key derivation into _derive_model_field and wrap both
assignments in _set_model_identity, then call it from __init__ and from the
proxy. External adoption now upholds the same (model_path, model) invariant as
construction: an HF-repo-id served name reaches the wire verbatim while the
registry key is the derived stem.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Clients: consistent malformed-tool-call + response-shape handling

Audited malformed-tool-call and unexpected-payload handling across the
OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and
LlamafileClient. Standardize on one principle, applied uniformly:

- Malformed argument JSON (a model mistake) -> TextResponse, routing the raw
  output back through the inference loop so the rescue/retry path can recover.
- A broken provider envelope (missing choices/message) or unexpected args type
  (a contract violation, not the model's fault) -> BackendError: fail loud and
  consistent, never a stray KeyError/IndexError.

Changes:
- vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on
  malformed args, unlike llamafile's retry-driving TextResponse) with a
  _parse_tool_calls mirroring the reference. Route both send() and send_stream()
  through it so streaming and non-streaming agree: a fully accumulated but
  unparseable arguments string finalizes as a TextResponse, not an exception.
- llamafile / openai_compat: guard the bare data["choices"][0]["message"]
  subscripts -> BackendError on a broken envelope (matching what vLLM already
  did for choices). llamafile also hardens function/name access.
- ollama: defensive .get on function/name (both paths); document that Ollama
  emits dict args by contract, so no json.loads is needed there.

Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing-
function/reasoning) + streaming malformed-fragment parity; envelope-guard tests
for llamafile and openai_compat. 1092 unit tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* LlamafileClient: remove runtime auto mode; native-first, frozen capability

Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try
native, fall back to prompt on HTTP error, recording resolved_mode). This was
the last vestige of the mid-request capability mutation the proxy rewrite
excised everywhere else; the proxy already declares its capability up front via
--backend-capability. With auto gone, resolved_mode is always == self.mode, so
the whole tri-state indirection collapses to a direct dispatch on self.mode.

The default is now native. This is both hardening and a deliberate posture
shift: local-model function-calling support has matured into the more reliable
path, so native-first is the right default. Prompt-injection is preserved as an
explicit opt-in (mode="prompt") and is the theoretically correct fallback for
non-FC backends — but it is honestly flagged, in the docstring and docs, that
models tend to struggle to drive the prompt-injected protocol reliably on more
complex, multi-step interactions. Capability is declared-and-frozen: an invalid
mode (including the old "auto") now raises ValueError rather than silently
degrading.

- llamafile.py: validate mode in __init__; default native; delete
  _resolve_and_send and the resolved_mode attribute/branches; dispatch send /
  send_stream on self.mode; rewrite the class docstring (native-first rationale
  + prompt caveat).
- eval_runner.py: --llamafile-mode choices [native, prompt], default native.
- docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat.
- tests: drop the auto-mode suite; assert native default + ValueError on "auto".

Consumers verified unaffected: the proxy (both sites), batch_eval, and the
integration script all pass mode explicitly. 1086 unit tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(eval): honor manual context budget in batch_eval via start_with_budget

batch_eval brought servers up with a bare server.start() (no ctx_override)
and resolved the budget separately via server.resolve_budget(), so
--budget-mode manual --num-ctx N was a no-op for llama-server: the server
booted at the model's full native context (no -c), and resolve_budget(MANUAL)
just read that full value back from /props. (Ollama was unaffected — its
context is per-request via set_num_ctx.)

Route both the initial bring-up and _recover_server through the prod
start_with_budget() path, which threads manual_tokens -> ctx_override -> -c
at launch and returns the resolved budget. _recover_server gains
budget_mode/manual_tokens params so a restarted server reuses the same
budget. Drops the now-redundant standalone resolve_budget() on the happy
path (still used on the recovery branch to read back the resolved value).

This also fixes FORGE_FAST mode, which the old bare-start() path never
supported.

Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000):
server boots with -c, rows record budget_tokens=20224 (server-clamped)
instead of the previous 262144 full-native read-back.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(release): 0.7.3 — native-first proxy

Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this
branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient
#89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94).

Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed
to --backend-capability (no alias; --mode was only introduced in 0.7.1).
Native is the default and only auto-selected protocol; prompt-injection is
an explicit opt-in for non-FC llama.cpp/llamafile backends.

USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode
tends to degrade on more complex multi-step interactions. BACKEND_SETUP,
EVAL_GUIDE, and ADR-012 were already updated earlier on this branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants