Native-first proxy (0.7.3)#97
Merged
Merged
Conversation
Make the OpenAI-compatible proxy native-tool-call-only and forward the client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip that dropped schema detail and leaked empty tool names. - Remove the proxy's --mode surface; the proxy always drives the backend client native. LlamafileClient's prompt-injection machinery is retained for non-proxy WorkflowRunner / direct-client use (it still wins for some models in full-guardrail workflow evals). - Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native path sends it verbatim. Other clients accept-and-ignore (vLLM also gains the previously-missing passthrough/inbound_anthropic_body kwargs). - run_inference forwards raw OpenAI messages/tools only on the clean first attempt (use_raw_messages gate); any mutation falls back to fold+serialize. - respond tool is now opt-in (--inject-respond-tool, default off). - No instrumentation (proxy_trace/guardrail_stats deliberately not ported). - Tests: drop removed mode-guard tests; respond tests opt in explicitly; add native-passthrough, detachment, respond-default, and first-attempt-gate coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The proxy serves tool-call-capable backends natively (verbatim tool/message
passthrough). This adds prompt-injection back as an explicit opt-in for
non-function-calling backends (llama.cpp / llamafile without a tool template).
- New --backend-capability {native,prompt} (default native), declared once at
construction and frozen — no runtime probing or mid-request mode mutation.
- prompt capability reuses LlamafileClient's existing prompt path (build the
tool prompt, downgrade tool/assistant-tool_call history to text, parse the
JSON tool call back into native tool_calls). No client changes.
- Handler suppresses verbatim raw passthrough when in prompt mode so inference
folds normally and the client injects the tool prompt.
- Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol).
- Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The configurable backend_timeout (#91) was validated, stored, and threaded into every client request, but never surfaced at launch. Extend the "Proxy ready" line to report the effective value so the operative timeout is visible/diagnosable from the startup log. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
VLLMClient kept two identity fields with distinct roles — model_path (the verbatim wire "model" field, which vLLM validates against its --served-model-name) and model (the derived registry-lookup key). The proxy's external-mode served-name adoption set both by hand (model_path = served; model = served), duplicating the derivation logic and storing the full served name where the constructor's rule stores the stem. Extract the path->key derivation into _derive_model_field and wrap both assignments in _set_model_identity, then call it from __init__ and from the proxy. External adoption now upholds the same (model_path, model) invariant as construction: an HF-repo-id served name reaches the wire verbatim while the registry key is the derived stem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Audited malformed-tool-call and unexpected-payload handling across the OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and LlamafileClient. Standardize on one principle, applied uniformly: - Malformed argument JSON (a model mistake) -> TextResponse, routing the raw output back through the inference loop so the rescue/retry path can recover. - A broken provider envelope (missing choices/message) or unexpected args type (a contract violation, not the model's fault) -> BackendError: fail loud and consistent, never a stray KeyError/IndexError. Changes: - vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on malformed args, unlike llamafile's retry-driving TextResponse) with a _parse_tool_calls mirroring the reference. Route both send() and send_stream() through it so streaming and non-streaming agree: a fully accumulated but unparseable arguments string finalizes as a TextResponse, not an exception. - llamafile / openai_compat: guard the bare data["choices"][0]["message"] subscripts -> BackendError on a broken envelope (matching what vLLM already did for choices). llamafile also hardens function/name access. - ollama: defensive .get on function/name (both paths); document that Ollama emits dict args by contract, so no json.loads is needed there. Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing- function/reasoning) + streaming malformed-fragment parity; envelope-guard tests for llamafile and openai_compat. 1092 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ility Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try native, fall back to prompt on HTTP error, recording resolved_mode). This was the last vestige of the mid-request capability mutation the proxy rewrite excised everywhere else; the proxy already declares its capability up front via --backend-capability. With auto gone, resolved_mode is always == self.mode, so the whole tri-state indirection collapses to a direct dispatch on self.mode. The default is now native. This is both hardening and a deliberate posture shift: local-model function-calling support has matured into the more reliable path, so native-first is the right default. Prompt-injection is preserved as an explicit opt-in (mode="prompt") and is the theoretically correct fallback for non-FC backends — but it is honestly flagged, in the docstring and docs, that models tend to struggle to drive the prompt-injected protocol reliably on more complex, multi-step interactions. Capability is declared-and-frozen: an invalid mode (including the old "auto") now raises ValueError rather than silently degrading. - llamafile.py: validate mode in __init__; default native; delete _resolve_and_send and the resolved_mode attribute/branches; dispatch send / send_stream on self.mode; rewrite the class docstring (native-first rationale + prompt caveat). - eval_runner.py: --llamafile-mode choices [native, prompt], default native. - docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat. - tests: drop the auto-mode suite; assert native default + ValueError on "auto". Consumers verified unaffected: the proxy (both sites), batch_eval, and the integration script all pass mode explicitly. 1086 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…udget batch_eval brought servers up with a bare server.start() (no ctx_override) and resolved the budget separately via server.resolve_budget(), so --budget-mode manual --num-ctx N was a no-op for llama-server: the server booted at the model's full native context (no -c), and resolve_budget(MANUAL) just read that full value back from /props. (Ollama was unaffected — its context is per-request via set_num_ctx.) Route both the initial bring-up and _recover_server through the prod start_with_budget() path, which threads manual_tokens -> ctx_override -> -c at launch and returns the resolved budget. _recover_server gains budget_mode/manual_tokens params so a restarted server reuses the same budget. Drops the now-redundant standalone resolve_budget() on the happy path (still used on the recovery branch to read back the resolved value). This also fixes FORGE_FAST mode, which the old bare-start() path never supported. Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000): server boots with -c, rows record budget_tokens=20224 (server-clamped) instead of the previous 262144 full-native read-back. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient #89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94). Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed to --backend-capability (no alias; --mode was only introduced in 0.7.1). Native is the default and only auto-selected protocol; prompt-injection is an explicit opt-in for non-FC llama.cpp/llamafile backends. USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode tends to degrade on more complex multi-step interactions. BACKEND_SETUP, EVAL_GUIDE, and ADR-012 were already updated earlier on this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Native-first proxy (0.7.3)
With native function calling now well-supported across modern local models, this release makes the OpenAI-compatible proxy native-first: by default it forwards the client's OpenAI
tools/messagesto the backend verbatim and is optimized for that path. Prompt-injection is preserved as an explicit opt-in for llama.cpp / llamafile backends that lack a function-calling template — but it is no longer the default.This branch also folds in the commits that landed on
mainsince 0.7.2 so they ship under one version (theOpenAICompatClient, the configurable backend timeout, and several proxy/eval fixes).The proxy
--mode {native,prompt}flag is renamed to--backend-capability {native,prompt}(andProxyServer(mode=…)→ProxyServer(backend_capability=…)).--modecollided with the proxy's existing managed / external deployment mode. The new name states what it actually controls — the backend's tool-calling protocol — and reflects that the choice is declared once at startup and frozen, never probed or switched mid-request.--modewas only introduced in 0.7.1, so it is removed cleanly rather than carried as a confusingly-named compatibility shim.--mode native→ drop it (native is the default) or--backend-capability native;--mode prompt→--backend-capability prompt.A note on the prompt path
Prompt-injection remains fully supported as
--backend-capability prompt, and it's the right tool for a backend with no function-calling template. But whether a model stays coherent across multi-turn tool results in prompt mode varies by model, and it tends to degrade on more complex, multi-step agentic interactions. Treat it as a fallback for non-FC backends, not the recommended path — reach for native whenever the backend supports it.What's included
Added
OpenAICompatClientfor arbitrary OpenAI-compatible endpoints. Add OpenAICompatClient for OpenAI-compatible endpoints #89 (thanks @lucasgerads).--backend-timeoutproxy option — configurable backend response timeout (default 300s). Add proxy backend timeout option #91.--backend-capability {native,prompt}proxy flag —native(default) forwards tools/messages verbatim to a function-calling-capable backend;promptopts into prompt-injection for non-FC llama.cpp/llamafile backends. Declared once at startup and frozen.Changed
--mode→--backend-capability(see above).ToolSpecrepresentation, which dropped schema detail.model_pathand the registrymodelkey are set together). Extract VLLMClient model-field derivation into a helper to avoid post-construction mutation #75.promptcapability is now rejected loudly for ollama / vllm / anthropic backends — previously it was silently ignored for ollama.stream_optionsis excluded from proxy passthrough. fix: exclude stream_options from proxy passthrough #94 (thanks @alexandergunnarson).Fixed
TextResponse) instead of degrading silently or raising inconsistently, and non-streaming responses are guarded so a broken provider envelope fails loud.Guardrails.record()no longer drops tool args for prerequisite tracking. Fix Guardrails.record() dropping tool args for prerequisite tracking #72 (thanks @hobostay).LlamafileClient. Clean up dead code and fragile variable reference in LlamafileClient #73 (thanks @hobostay).Removed
autofunction-calling mode inLlamafileClient— the proxy never used it, and its mid-request probe-and-switch is replaced by the declared-and-frozen--backend-capability.Validation
--backend-capabilityand the prompt-path caveat.