fix: exclude stream_options from proxy passthrough#94
Conversation
Forge controls streaming independently — it uses non-streaming calls internally for guardrail validation. When an inbound request carries stream_options (e.g. from Zed, OpenCode, or any client requesting streaming token usage), that field leaked through _extract_passthrough into the outbound body sent to the backend. On strict OpenAI-compatible backends like vLLM, this causes a validation error: 'Stream options can only be defined when stream=True'. The backend rightfully rejects the request because Forge set stream=False but left stream_options in the body. Fix: add stream_options to _FORGE_OWNED so it is stripped alongside stream. Forge's own clients (LlamafileClient, VLLMClient) inject stream_options when they need it in their send_stream() paths. Includes a regression test.
antoinezambelli
left a comment
There was a problem hiding this comment.
This is a clean, correct fix — thank you! Merging it.
You've got the reasoning exactly right: forge owns the streaming decision (stream is already in _FORGE_OWNED), and it makes non-streaming client.send(...) calls to the backend regardless of the inbound stream value. So a leaked stream_options rides into a non-streaming request, and strict OpenAI-compatible backends (vLLM especially) reject stream_options when stream isn't True. Putting stream_options right next to stream in the owned set is the principled spot for it — they're two halves of the same decision forge is already reasoning about.
The regression test is nicely targeted, too — asserting both the exclusion and the exact resulting passthrough shape so a future refactor can't quietly let it slip back in.
Verified locally: your new test passes and the full unit suite is green.
Thanks again for catching this and for the tidy patch!
* Proxy: native-only + transparent OpenAI passthrough
Make the OpenAI-compatible proxy native-tool-call-only and forward the
client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip
that dropped schema detail and leaked empty tool names.
- Remove the proxy's --mode surface; the proxy always drives the backend
client native. LlamafileClient's prompt-injection machinery is retained
for non-proxy WorkflowRunner / direct-client use (it still wins for some
models in full-guardrail workflow evals).
- Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native
path sends it verbatim. Other clients accept-and-ignore (vLLM also gains
the previously-missing passthrough/inbound_anthropic_body kwargs).
- run_inference forwards raw OpenAI messages/tools only on the clean first
attempt (use_raw_messages gate); any mutation falls back to fold+serialize.
- respond tool is now opt-in (--inject-respond-tool, default off).
- No instrumentation (proxy_trace/guardrail_stats deliberately not ported).
- Tests: drop removed mode-guard tests; respond tests opt in explicitly; add
native-passthrough, detachment, respond-default, and first-attempt-gate
coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Add prompt-injection as opt-in proxy capability (--backend-capability)
The proxy serves tool-call-capable backends natively (verbatim tool/message
passthrough). This adds prompt-injection back as an explicit opt-in for
non-function-calling backends (llama.cpp / llamafile without a tool template).
- New --backend-capability {native,prompt} (default native), declared once at
construction and frozen — no runtime probing or mid-request mode mutation.
- prompt capability reuses LlamafileClient's existing prompt path (build the
tool prompt, downgrade tool/assistant-tool_call history to text, parse the
JSON tool call back into native tool_calls). No client changes.
- Handler suppresses verbatim raw passthrough when in prompt mode so inference
folds normally and the client injects the tool prompt.
- Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol).
- Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Proxy: log effective backend_timeout at startup
The configurable backend_timeout (#91) was validated, stored, and threaded
into every client request, but never surfaced at launch. Extend the
"Proxy ready" line to report the effective value so the operative timeout
is visible/diagnosable from the startup log.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* vLLM: single source of truth for model identity (#75)
VLLMClient kept two identity fields with distinct roles — model_path (the
verbatim wire "model" field, which vLLM validates against its
--served-model-name) and model (the derived registry-lookup key). The proxy's
external-mode served-name adoption set both by hand (model_path = served;
model = served), duplicating the derivation logic and storing the full served
name where the constructor's rule stores the stem.
Extract the path->key derivation into _derive_model_field and wrap both
assignments in _set_model_identity, then call it from __init__ and from the
proxy. External adoption now upholds the same (model_path, model) invariant as
construction: an HF-repo-id served name reaches the wire verbatim while the
registry key is the derived stem.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Clients: consistent malformed-tool-call + response-shape handling
Audited malformed-tool-call and unexpected-payload handling across the
OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and
LlamafileClient. Standardize on one principle, applied uniformly:
- Malformed argument JSON (a model mistake) -> TextResponse, routing the raw
output back through the inference loop so the rescue/retry path can recover.
- A broken provider envelope (missing choices/message) or unexpected args type
(a contract violation, not the model's fault) -> BackendError: fail loud and
consistent, never a stray KeyError/IndexError.
Changes:
- vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on
malformed args, unlike llamafile's retry-driving TextResponse) with a
_parse_tool_calls mirroring the reference. Route both send() and send_stream()
through it so streaming and non-streaming agree: a fully accumulated but
unparseable arguments string finalizes as a TextResponse, not an exception.
- llamafile / openai_compat: guard the bare data["choices"][0]["message"]
subscripts -> BackendError on a broken envelope (matching what vLLM already
did for choices). llamafile also hardens function/name access.
- ollama: defensive .get on function/name (both paths); document that Ollama
emits dict args by contract, so no json.loads is needed there.
Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing-
function/reasoning) + streaming malformed-fragment parity; envelope-guard tests
for llamafile and openai_compat. 1092 unit tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* LlamafileClient: remove runtime auto mode; native-first, frozen capability
Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try
native, fall back to prompt on HTTP error, recording resolved_mode). This was
the last vestige of the mid-request capability mutation the proxy rewrite
excised everywhere else; the proxy already declares its capability up front via
--backend-capability. With auto gone, resolved_mode is always == self.mode, so
the whole tri-state indirection collapses to a direct dispatch on self.mode.
The default is now native. This is both hardening and a deliberate posture
shift: local-model function-calling support has matured into the more reliable
path, so native-first is the right default. Prompt-injection is preserved as an
explicit opt-in (mode="prompt") and is the theoretically correct fallback for
non-FC backends — but it is honestly flagged, in the docstring and docs, that
models tend to struggle to drive the prompt-injected protocol reliably on more
complex, multi-step interactions. Capability is declared-and-frozen: an invalid
mode (including the old "auto") now raises ValueError rather than silently
degrading.
- llamafile.py: validate mode in __init__; default native; delete
_resolve_and_send and the resolved_mode attribute/branches; dispatch send /
send_stream on self.mode; rewrite the class docstring (native-first rationale
+ prompt caveat).
- eval_runner.py: --llamafile-mode choices [native, prompt], default native.
- docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat.
- tests: drop the auto-mode suite; assert native default + ValueError on "auto".
Consumers verified unaffected: the proxy (both sites), batch_eval, and the
integration script all pass mode explicitly. 1086 unit tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eval): honor manual context budget in batch_eval via start_with_budget
batch_eval brought servers up with a bare server.start() (no ctx_override)
and resolved the budget separately via server.resolve_budget(), so
--budget-mode manual --num-ctx N was a no-op for llama-server: the server
booted at the model's full native context (no -c), and resolve_budget(MANUAL)
just read that full value back from /props. (Ollama was unaffected — its
context is per-request via set_num_ctx.)
Route both the initial bring-up and _recover_server through the prod
start_with_budget() path, which threads manual_tokens -> ctx_override -> -c
at launch and returns the resolved budget. _recover_server gains
budget_mode/manual_tokens params so a restarted server reuses the same
budget. Drops the now-redundant standalone resolve_budget() on the happy
path (still used on the recovery branch to read back the resolved value).
This also fixes FORGE_FAST mode, which the old bare-start() path never
supported.
Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000):
server boots with -c, rows record budget_tokens=20224 (server-clamped)
instead of the previous 262144 full-native read-back.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(release): 0.7.3 — native-first proxy
Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this
branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient
#89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94).
Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed
to --backend-capability (no alias; --mode was only introduced in 0.7.1).
Native is the default and only auto-selected protocol; prompt-injection is
an explicit opt-in for non-FC llama.cpp/llamafile backends.
USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode
tends to degrade on more complex multi-step interactions. BACKEND_SETUP,
EVAL_GUIDE, and ADR-012 were already updated earlier on this branch.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Problem
When a client sends
stream_optionsin the request body (e.g.{"stream_options": {"include_usage": true}}), the proxy's_extract_passthroughlets it through to the backend. But Forge makes non-streaming calls internally for guardrail validation (stream=False), so the backend receivesstream_optionswithoutstream=True.Strict OpenAI-compatible backends reject this combination:
This triggers on every request from clients like Zed editor, which always send
stream_options.Fix
Add
stream_optionsto_FORGE_OWNEDinhandler.py, alongside the existingstreamentry. Forge's own clients already injectstream_optionswhen needed in theirsend_stream()paths (e.g.VLLMClient,LlamafileClient), so stripping it from passthrough is safe.Changes
src/forge/proxy/handler.py: Addstream_optionsto_FORGE_OWNEDfrozensettests/unit/test_proxy_handler.py: Add regression testTesting
All 27 proxy handler tests pass, including the new one.