Add proxy backend timeout option#91
Merged
Merged
Conversation
Expose a configurable backend response timeout for forge-proxy and ProxyServer, defaulting to the existing 300 seconds. Thread the value through proxy-created LlamafileClient, VLLMClient, OllamaClient, and AnthropicClient instances. Reject non-positive and non-finite timeout values at construction, and cover the external, managed, and Anthropic proxy wiring in unit tests.
andreskwan
pushed a commit
to andreskwan/forge
that referenced
this pull request
May 29, 2026
Expose a configurable backend response timeout for forge-proxy and ProxyServer, defaulting to the existing 300 seconds. Thread the value through proxy-created LlamafileClient, VLLMClient, OllamaClient, and AnthropicClient instances. Reject non-positive and non-finite timeout values at construction, and cover the external, managed, and Anthropic proxy wiring in unit tests.
This was referenced May 31, 2026
antoinezambelli
added a commit
that referenced
this pull request
Jun 1, 2026
* Proxy: native-only + transparent OpenAI passthrough
Make the OpenAI-compatible proxy native-tool-call-only and forward the
client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip
that dropped schema detail and leaked empty tool names.
- Remove the proxy's --mode surface; the proxy always drives the backend
client native. LlamafileClient's prompt-injection machinery is retained
for non-proxy WorkflowRunner / direct-client use (it still wins for some
models in full-guardrail workflow evals).
- Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native
path sends it verbatim. Other clients accept-and-ignore (vLLM also gains
the previously-missing passthrough/inbound_anthropic_body kwargs).
- run_inference forwards raw OpenAI messages/tools only on the clean first
attempt (use_raw_messages gate); any mutation falls back to fold+serialize.
- respond tool is now opt-in (--inject-respond-tool, default off).
- No instrumentation (proxy_trace/guardrail_stats deliberately not ported).
- Tests: drop removed mode-guard tests; respond tests opt in explicitly; add
native-passthrough, detachment, respond-default, and first-attempt-gate
coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Add prompt-injection as opt-in proxy capability (--backend-capability)
The proxy serves tool-call-capable backends natively (verbatim tool/message
passthrough). This adds prompt-injection back as an explicit opt-in for
non-function-calling backends (llama.cpp / llamafile without a tool template).
- New --backend-capability {native,prompt} (default native), declared once at
construction and frozen — no runtime probing or mid-request mode mutation.
- prompt capability reuses LlamafileClient's existing prompt path (build the
tool prompt, downgrade tool/assistant-tool_call history to text, parse the
JSON tool call back into native tool_calls). No client changes.
- Handler suppresses verbatim raw passthrough when in prompt mode so inference
folds normally and the client injects the tool prompt.
- Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol).
- Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Proxy: log effective backend_timeout at startup
The configurable backend_timeout (#91) was validated, stored, and threaded
into every client request, but never surfaced at launch. Extend the
"Proxy ready" line to report the effective value so the operative timeout
is visible/diagnosable from the startup log.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* vLLM: single source of truth for model identity (#75)
VLLMClient kept two identity fields with distinct roles — model_path (the
verbatim wire "model" field, which vLLM validates against its
--served-model-name) and model (the derived registry-lookup key). The proxy's
external-mode served-name adoption set both by hand (model_path = served;
model = served), duplicating the derivation logic and storing the full served
name where the constructor's rule stores the stem.
Extract the path->key derivation into _derive_model_field and wrap both
assignments in _set_model_identity, then call it from __init__ and from the
proxy. External adoption now upholds the same (model_path, model) invariant as
construction: an HF-repo-id served name reaches the wire verbatim while the
registry key is the derived stem.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Clients: consistent malformed-tool-call + response-shape handling
Audited malformed-tool-call and unexpected-payload handling across the
OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and
LlamafileClient. Standardize on one principle, applied uniformly:
- Malformed argument JSON (a model mistake) -> TextResponse, routing the raw
output back through the inference loop so the rescue/retry path can recover.
- A broken provider envelope (missing choices/message) or unexpected args type
(a contract violation, not the model's fault) -> BackendError: fail loud and
consistent, never a stray KeyError/IndexError.
Changes:
- vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on
malformed args, unlike llamafile's retry-driving TextResponse) with a
_parse_tool_calls mirroring the reference. Route both send() and send_stream()
through it so streaming and non-streaming agree: a fully accumulated but
unparseable arguments string finalizes as a TextResponse, not an exception.
- llamafile / openai_compat: guard the bare data["choices"][0]["message"]
subscripts -> BackendError on a broken envelope (matching what vLLM already
did for choices). llamafile also hardens function/name access.
- ollama: defensive .get on function/name (both paths); document that Ollama
emits dict args by contract, so no json.loads is needed there.
Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing-
function/reasoning) + streaming malformed-fragment parity; envelope-guard tests
for llamafile and openai_compat. 1092 unit tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* LlamafileClient: remove runtime auto mode; native-first, frozen capability
Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try
native, fall back to prompt on HTTP error, recording resolved_mode). This was
the last vestige of the mid-request capability mutation the proxy rewrite
excised everywhere else; the proxy already declares its capability up front via
--backend-capability. With auto gone, resolved_mode is always == self.mode, so
the whole tri-state indirection collapses to a direct dispatch on self.mode.
The default is now native. This is both hardening and a deliberate posture
shift: local-model function-calling support has matured into the more reliable
path, so native-first is the right default. Prompt-injection is preserved as an
explicit opt-in (mode="prompt") and is the theoretically correct fallback for
non-FC backends — but it is honestly flagged, in the docstring and docs, that
models tend to struggle to drive the prompt-injected protocol reliably on more
complex, multi-step interactions. Capability is declared-and-frozen: an invalid
mode (including the old "auto") now raises ValueError rather than silently
degrading.
- llamafile.py: validate mode in __init__; default native; delete
_resolve_and_send and the resolved_mode attribute/branches; dispatch send /
send_stream on self.mode; rewrite the class docstring (native-first rationale
+ prompt caveat).
- eval_runner.py: --llamafile-mode choices [native, prompt], default native.
- docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat.
- tests: drop the auto-mode suite; assert native default + ValueError on "auto".
Consumers verified unaffected: the proxy (both sites), batch_eval, and the
integration script all pass mode explicitly. 1086 unit tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eval): honor manual context budget in batch_eval via start_with_budget
batch_eval brought servers up with a bare server.start() (no ctx_override)
and resolved the budget separately via server.resolve_budget(), so
--budget-mode manual --num-ctx N was a no-op for llama-server: the server
booted at the model's full native context (no -c), and resolve_budget(MANUAL)
just read that full value back from /props. (Ollama was unaffected — its
context is per-request via set_num_ctx.)
Route both the initial bring-up and _recover_server through the prod
start_with_budget() path, which threads manual_tokens -> ctx_override -> -c
at launch and returns the resolved budget. _recover_server gains
budget_mode/manual_tokens params so a restarted server reuses the same
budget. Drops the now-redundant standalone resolve_budget() on the happy
path (still used on the recovery branch to read back the resolved value).
This also fixes FORGE_FAST mode, which the old bare-start() path never
supported.
Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000):
server boots with -c, rows record budget_tokens=20224 (server-clamped)
instead of the previous 262144 full-native read-back.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(release): 0.7.3 — native-first proxy
Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this
branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient
#89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94).
Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed
to --backend-capability (no alias; --mode was only introduced in 0.7.1).
Native is the default and only auto-selected protocol; prompt-injection is
an explicit opt-in for non-FC llama.cpp/llamafile backends.
USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode
tends to degrade on more complex multi-step interactions. BACKEND_SETUP,
EVAL_GUIDE, and ADR-012 were already updated earlier on this branch.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #90 .
Tests