Add proxy backend timeout option by antoinezambelli · Pull Request #91 · antoinezambelli/forge

antoinezambelli · 2026-05-29T16:41:36Z

Summary

expose a configurable backend response timeout on ProxyServer and forge-proxy
thread the timeout through proxy-created llama.cpp, vLLM, Ollama, and Anthropic clients
reject non-positive/non-finite timeout values and cover the wiring in unit tests

Closes #90 .

Tests

.\.venv\Scripts\python.exe -m pytest -q
.\.venv\Scripts\python.exe -m pytest tests\unit\test_proxy_proxy.py tests\unit\test_proxy_path1.py -q

Expose a configurable backend response timeout for forge-proxy and ProxyServer, defaulting to the existing 300 seconds. Thread the value through proxy-created LlamafileClient, VLLMClient, OllamaClient, and AnthropicClient instances. Reject non-positive and non-finite timeout values at construction, and cover the external, managed, and Anthropic proxy wiring in unit tests.

* Proxy: native-only + transparent OpenAI passthrough Make the OpenAI-compatible proxy native-tool-call-only and forward the client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip that dropped schema detail and leaked empty tool names. - Remove the proxy's --mode surface; the proxy always drives the backend client native. LlamafileClient's prompt-injection machinery is retained for non-proxy WorkflowRunner / direct-client use (it still wins for some models in full-guardrail workflow evals). - Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native path sends it verbatim. Other clients accept-and-ignore (vLLM also gains the previously-missing passthrough/inbound_anthropic_body kwargs). - run_inference forwards raw OpenAI messages/tools only on the clean first attempt (use_raw_messages gate); any mutation falls back to fold+serialize. - respond tool is now opt-in (--inject-respond-tool, default off). - No instrumentation (proxy_trace/guardrail_stats deliberately not ported). - Tests: drop removed mode-guard tests; respond tests opt in explicitly; add native-passthrough, detachment, respond-default, and first-attempt-gate coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Add prompt-injection as opt-in proxy capability (--backend-capability) The proxy serves tool-call-capable backends natively (verbatim tool/message passthrough). This adds prompt-injection back as an explicit opt-in for non-function-calling backends (llama.cpp / llamafile without a tool template). - New --backend-capability {native,prompt} (default native), declared once at construction and frozen — no runtime probing or mid-request mode mutation. - prompt capability reuses LlamafileClient's existing prompt path (build the tool prompt, downgrade tool/assistant-tool_call history to text, parse the JSON tool call back into native tool_calls). No client changes. - Handler suppresses verbatim raw passthrough when in prompt mode so inference folds normally and the client injects the tool prompt. - Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol). - Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Proxy: log effective backend_timeout at startup The configurable backend_timeout (#91) was validated, stored, and threaded into every client request, but never surfaced at launch. Extend the "Proxy ready" line to report the effective value so the operative timeout is visible/diagnosable from the startup log. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * vLLM: single source of truth for model identity (#75) VLLMClient kept two identity fields with distinct roles — model_path (the verbatim wire "model" field, which vLLM validates against its --served-model-name) and model (the derived registry-lookup key). The proxy's external-mode served-name adoption set both by hand (model_path = served; model = served), duplicating the derivation logic and storing the full served name where the constructor's rule stores the stem. Extract the path->key derivation into _derive_model_field and wrap both assignments in _set_model_identity, then call it from __init__ and from the proxy. External adoption now upholds the same (model_path, model) invariant as construction: an HF-repo-id served name reaches the wire verbatim while the registry key is the derived stem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Clients: consistent malformed-tool-call + response-shape handling Audited malformed-tool-call and unexpected-payload handling across the OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and LlamafileClient. Standardize on one principle, applied uniformly: - Malformed argument JSON (a model mistake) -> TextResponse, routing the raw output back through the inference loop so the rescue/retry path can recover. - A broken provider envelope (missing choices/message) or unexpected args type (a contract violation, not the model's fault) -> BackendError: fail loud and consistent, never a stray KeyError/IndexError. Changes: - vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on malformed args, unlike llamafile's retry-driving TextResponse) with a _parse_tool_calls mirroring the reference. Route both send() and send_stream() through it so streaming and non-streaming agree: a fully accumulated but unparseable arguments string finalizes as a TextResponse, not an exception. - llamafile / openai_compat: guard the bare data["choices"][0]["message"] subscripts -> BackendError on a broken envelope (matching what vLLM already did for choices). llamafile also hardens function/name access. - ollama: defensive .get on function/name (both paths); document that Ollama emits dict args by contract, so no json.loads is needed there. Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing- function/reasoning) + streaming malformed-fragment parity; envelope-guard tests for llamafile and openai_compat. 1092 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * LlamafileClient: remove runtime auto mode; native-first, frozen capability Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try native, fall back to prompt on HTTP error, recording resolved_mode). This was the last vestige of the mid-request capability mutation the proxy rewrite excised everywhere else; the proxy already declares its capability up front via --backend-capability. With auto gone, resolved_mode is always == self.mode, so the whole tri-state indirection collapses to a direct dispatch on self.mode. The default is now native. This is both hardening and a deliberate posture shift: local-model function-calling support has matured into the more reliable path, so native-first is the right default. Prompt-injection is preserved as an explicit opt-in (mode="prompt") and is the theoretically correct fallback for non-FC backends — but it is honestly flagged, in the docstring and docs, that models tend to struggle to drive the prompt-injected protocol reliably on more complex, multi-step interactions. Capability is declared-and-frozen: an invalid mode (including the old "auto") now raises ValueError rather than silently degrading. - llamafile.py: validate mode in __init__; default native; delete _resolve_and_send and the resolved_mode attribute/branches; dispatch send / send_stream on self.mode; rewrite the class docstring (native-first rationale + prompt caveat). - eval_runner.py: --llamafile-mode choices [native, prompt], default native. - docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat. - tests: drop the auto-mode suite; assert native default + ValueError on "auto". Consumers verified unaffected: the proxy (both sites), batch_eval, and the integration script all pass mode explicitly. 1086 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(eval): honor manual context budget in batch_eval via start_with_budget batch_eval brought servers up with a bare server.start() (no ctx_override) and resolved the budget separately via server.resolve_budget(), so --budget-mode manual --num-ctx N was a no-op for llama-server: the server booted at the model's full native context (no -c), and resolve_budget(MANUAL) just read that full value back from /props. (Ollama was unaffected — its context is per-request via set_num_ctx.) Route both the initial bring-up and _recover_server through the prod start_with_budget() path, which threads manual_tokens -> ctx_override -> -c at launch and returns the resolved budget. _recover_server gains budget_mode/manual_tokens params so a restarted server reuses the same budget. Drops the now-redundant standalone resolve_budget() on the happy path (still used on the recovery branch to read back the resolved value). This also fixes FORGE_FAST mode, which the old bare-start() path never supported. Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000): server boots with -c, rows record budget_tokens=20224 (server-clamped) instead of the previous 262144 full-native read-back. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(release): 0.7.3 — native-first proxy Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient #89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94). Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed to --backend-capability (no alias; --mode was only introduced in 0.7.1). Native is the default and only auto-selected protocol; prompt-injection is an explicit opt-in for non-FC llama.cpp/llamafile backends. USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode tends to degrade on more complex multi-step interactions. BACKEND_SETUP, EVAL_GUIDE, and ADR-012 were already updated earlier on this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

antoinezambelli merged commit fa3b913 into main May 29, 2026
2 checks passed

antoinezambelli deleted the az/proxy-timeout branch May 29, 2026 16:43

This was referenced May 31, 2026

Investigate: integration paths with Hermes Agent #51

Open

Native-first proxy (0.7.3) #97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add proxy backend timeout option#91

Add proxy backend timeout option#91
antoinezambelli merged 1 commit into
mainfrom
az/proxy-timeout

antoinezambelli commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antoinezambelli commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

antoinezambelli commented May 29, 2026 •

edited

Loading