Native-first proxy (0.7.3) by antoinezambelli · Pull Request #97 · antoinezambelli/forge

antoinezambelli · 2026-06-01T06:03:09Z

Native-first proxy (0.7.3)

With native function calling now well-supported across modern local models, this release makes the OpenAI-compatible proxy native-first: by default it forwards the client's OpenAI tools / messages to the backend verbatim and is optimized for that path. Prompt-injection is preserved as an explicit opt-in for llama.cpp / llamafile backends that lack a function-calling template — but it is no longer the default.

This branch also folds in the commits that landed on main since 0.7.2 so they ship under one version (the OpenAICompatClient, the configurable backend timeout, and several proxy/eval fixes).

⚠️ Breaking change

The proxy --mode {native,prompt} flag is renamed to --backend-capability {native,prompt} (and ProxyServer(mode=…) → ProxyServer(backend_capability=…)).

--mode collided with the proxy's existing managed / external deployment mode. The new name states what it actually controls — the backend's tool-calling protocol — and reflects that the choice is declared once at startup and frozen, never probed or switched mid-request.
No deprecation alias. --mode was only introduced in 0.7.1, so it is removed cleanly rather than carried as a confusingly-named compatibility shim.
Migration: --mode native → drop it (native is the default) or --backend-capability native; --mode prompt → --backend-capability prompt.

A note on the prompt path

Prompt-injection remains fully supported as --backend-capability prompt, and it's the right tool for a backend with no function-calling template. But whether a model stays coherent across multi-turn tool results in prompt mode varies by model, and it tends to degrade on more complex, multi-step agentic interactions. Treat it as a fallback for non-FC backends, not the recommended path — reach for native whenever the backend supports it.

What's included

Added

OpenAICompatClient for arbitrary OpenAI-compatible endpoints. Add OpenAICompatClient for OpenAI-compatible endpoints #89 (thanks @lucasgerads).
--backend-timeout proxy option — configurable backend response timeout (default 300s). Add proxy backend timeout option #91.
--backend-capability {native,prompt} proxy flag — native (default) forwards tools/messages verbatim to a function-calling-capable backend; prompt opts into prompt-injection for non-FC llama.cpp/llamafile backends. Declared once at startup and frozen.
Effective backend timeout logged at proxy startup.

Changed

BREAKING — --mode → --backend-capability (see above).
Native function calling is now transparent passthrough — the proxy forwards the client's OpenAI tool/message payloads verbatim instead of round-tripping them through forge's internal ToolSpec representation, which dropped schema detail.
vLLM model identity consolidated to a single source of truth (the wire model_path and the registry model key are set together). Extract VLLMClient model-field derivation into a helper to avoid post-construction mutation #75.
The prompt capability is now rejected loudly for ollama / vllm / anthropic backends — previously it was silently ignored for ollama.
stream_options is excluded from proxy passthrough. fix: exclude stream_options from proxy passthrough #94 (thanks @alexandergunnarson).

Fixed

Consistent malformed-tool-call / unexpected-response handling across the OpenAI-shape clients — malformed model tool args drive a retry (TextResponse) instead of degrading silently or raising inconsistently, and non-streaming responses are guarded so a broken provider envelope fails loud.
Guardrails.record() no longer drops tool args for prerequisite tracking. Fix Guardrails.record() dropping tool args for prerequisite tracking #72 (thanks @hobostay).
Deprecated asyncio API replaced; proxy server input validation added. Fix deprecated asyncio API and add input validation in proxy server #71 (thanks @hobostay).
Proxy input hardening, non-blocking Ollama stop, client shutdown, and loud arg decode. Post-review fixes: proxy input hardening, non-blocking Ollama stop, client shutdown, loud arg decode #86.
Dead code and a fragile variable reference cleaned up in LlamafileClient. Clean up dead code and fragile variable reference in LlamafileClient #73 (thanks @hobostay).

Removed

Runtime auto function-calling mode in LlamafileClient — the proxy never used it, and its mid-request probe-and-switch is replaced by the declared-and-frozen --backend-capability.

Validation

Unit suite: 1086 passed.
Regression eval against the 0.7.0 baseline (single reference model, llama.cpp server, reforged, N=50 across 8 scenarios spanning the changed surfaces): both native and prompt track the baseline within sampling noise — no behavioral regression from the rewrite or the malformed-response hardening.
The native-first / verbatim-passthrough design is documented in ADR-012; backend setup and the user guide are updated for --backend-capability and the prompt-path caveat.

Make the OpenAI-compatible proxy native-tool-call-only and forward the client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip that dropped schema detail and leaked empty tool names. - Remove the proxy's --mode surface; the proxy always drives the backend client native. LlamafileClient's prompt-injection machinery is retained for non-proxy WorkflowRunner / direct-client use (it still wins for some models in full-guardrail workflow evals). - Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native path sends it verbatim. Other clients accept-and-ignore (vLLM also gains the previously-missing passthrough/inbound_anthropic_body kwargs). - run_inference forwards raw OpenAI messages/tools only on the clean first attempt (use_raw_messages gate); any mutation falls back to fold+serialize. - respond tool is now opt-in (--inject-respond-tool, default off). - No instrumentation (proxy_trace/guardrail_stats deliberately not ported). - Tests: drop removed mode-guard tests; respond tests opt in explicitly; add native-passthrough, detachment, respond-default, and first-attempt-gate coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The proxy serves tool-call-capable backends natively (verbatim tool/message passthrough). This adds prompt-injection back as an explicit opt-in for non-function-calling backends (llama.cpp / llamafile without a tool template). - New --backend-capability {native,prompt} (default native), declared once at construction and frozen — no runtime probing or mid-request mode mutation. - prompt capability reuses LlamafileClient's existing prompt path (build the tool prompt, downgrade tool/assistant-tool_call history to text, parse the JSON tool call back into native tool_calls). No client changes. - Handler suppresses verbatim raw passthrough when in prompt mode so inference folds normally and the client injects the tool prompt. - Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol). - Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The configurable backend_timeout (#91) was validated, stored, and threaded into every client request, but never surfaced at launch. Extend the "Proxy ready" line to report the effective value so the operative timeout is visible/diagnosable from the startup log. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

VLLMClient kept two identity fields with distinct roles — model_path (the verbatim wire "model" field, which vLLM validates against its --served-model-name) and model (the derived registry-lookup key). The proxy's external-mode served-name adoption set both by hand (model_path = served; model = served), duplicating the derivation logic and storing the full served name where the constructor's rule stores the stem. Extract the path->key derivation into _derive_model_field and wrap both assignments in _set_model_identity, then call it from __init__ and from the proxy. External adoption now upholds the same (model_path, model) invariant as construction: an HF-repo-id served name reaches the wire verbatim while the registry key is the derived stem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Audited malformed-tool-call and unexpected-payload handling across the OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and LlamafileClient. Standardize on one principle, applied uniformly: - Malformed argument JSON (a model mistake) -> TextResponse, routing the raw output back through the inference loop so the rescue/retry path can recover. - A broken provider envelope (missing choices/message) or unexpected args type (a contract violation, not the model's fault) -> BackendError: fail loud and consistent, never a stray KeyError/IndexError. Changes: - vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on malformed args, unlike llamafile's retry-driving TextResponse) with a _parse_tool_calls mirroring the reference. Route both send() and send_stream() through it so streaming and non-streaming agree: a fully accumulated but unparseable arguments string finalizes as a TextResponse, not an exception. - llamafile / openai_compat: guard the bare data["choices"][0]["message"] subscripts -> BackendError on a broken envelope (matching what vLLM already did for choices). llamafile also hardens function/name access. - ollama: defensive .get on function/name (both paths); document that Ollama emits dict args by contract, so no json.loads is needed there. Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing- function/reasoning) + streaming malformed-fragment parity; envelope-guard tests for llamafile and openai_compat. 1092 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ility Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try native, fall back to prompt on HTTP error, recording resolved_mode). This was the last vestige of the mid-request capability mutation the proxy rewrite excised everywhere else; the proxy already declares its capability up front via --backend-capability. With auto gone, resolved_mode is always == self.mode, so the whole tri-state indirection collapses to a direct dispatch on self.mode. The default is now native. This is both hardening and a deliberate posture shift: local-model function-calling support has matured into the more reliable path, so native-first is the right default. Prompt-injection is preserved as an explicit opt-in (mode="prompt") and is the theoretically correct fallback for non-FC backends — but it is honestly flagged, in the docstring and docs, that models tend to struggle to drive the prompt-injected protocol reliably on more complex, multi-step interactions. Capability is declared-and-frozen: an invalid mode (including the old "auto") now raises ValueError rather than silently degrading. - llamafile.py: validate mode in __init__; default native; delete _resolve_and_send and the resolved_mode attribute/branches; dispatch send / send_stream on self.mode; rewrite the class docstring (native-first rationale + prompt caveat). - eval_runner.py: --llamafile-mode choices [native, prompt], default native. - docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat. - tests: drop the auto-mode suite; assert native default + ValueError on "auto". Consumers verified unaffected: the proxy (both sites), batch_eval, and the integration script all pass mode explicitly. 1086 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…udget batch_eval brought servers up with a bare server.start() (no ctx_override) and resolved the budget separately via server.resolve_budget(), so --budget-mode manual --num-ctx N was a no-op for llama-server: the server booted at the model's full native context (no -c), and resolve_budget(MANUAL) just read that full value back from /props. (Ollama was unaffected — its context is per-request via set_num_ctx.) Route both the initial bring-up and _recover_server through the prod start_with_budget() path, which threads manual_tokens -> ctx_override -> -c at launch and returns the resolved budget. _recover_server gains budget_mode/manual_tokens params so a restarted server reuses the same budget. Drops the now-redundant standalone resolve_budget() on the happy path (still used on the recovery branch to read back the resolved value). This also fixes FORGE_FAST mode, which the old bare-start() path never supported. Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000): server boots with -c, rows record budget_tokens=20224 (server-clamped) instead of the previous 262144 full-native read-back. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient #89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94). Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed to --backend-capability (no alias; --mode was only introduced in 0.7.1). Native is the default and only auto-selected protocol; prompt-injection is an explicit opt-in for non-FC llama.cpp/llamafile backends. USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode tends to degrade on more complex multi-step interactions. BACKEND_SETUP, EVAL_GUIDE, and ADR-012 were already updated earlier on this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

antoinezambelli and others added 9 commits May 31, 2026 17:15

Merge remote-tracking branch 'origin/main' into az/proxy-native-rewrite

17186bc

antoinezambelli merged commit 51c0fe6 into main Jun 1, 2026
2 checks passed

antoinezambelli deleted the az/proxy-native-rewrite branch June 1, 2026 06:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native-first proxy (0.7.3)#97

Native-first proxy (0.7.3)#97
antoinezambelli merged 9 commits into
mainfrom
az/proxy-native-rewrite

antoinezambelli commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antoinezambelli commented Jun 1, 2026