Add OpenAICompatClient for OpenAI-compatible endpoints#89
Conversation
antoinezambelli
left a comment
There was a problem hiding this comment.
Thanks for this — the overall shape is spot on and it matches the direction from #88: standalone sibling client, provider-agnostic OpenAI-compatible endpoint, bearer auth, extra_headers, full LLMClient surface, and good focused tests/docs.
One change before merge, around Forge's fail-fast/fail-loud preference: malformed tool-call argument JSON shouldn't become {}. Please route it through the retry-driving TextResponse path, matching LlamafileClient — both in _parse_tool_calls (non-streaming) and in the streaming assembly. The two inline comments point at both spots; they're the same fix in two places.
Note: VLLMClient currently doesn't fully match the desired malformed-args behavior either. That's existing debt on my side, not precedent to copy here — LlamafileClient is the reference.
Provider-agnostic client for any backend exposing the OpenAI /v1/chat/completions shape with optional bearer auth — Cloudflare Workers AI, Fireworks, OpenRouter, OpenAI itself, etc. Native function calling, SSE streaming, full LLMClient protocol surface. Addresses the review feedback in antoinezambelli#88: - Mirrors VLLMClient sampling kwargs (top_k, min_p, repeat_penalty, presence_penalty, chat_template_kwargs) with recommended_sampling defaulting to False (hosted model IDs aren't in forge's registry). - Accepts passthrough and inbound_anthropic_body to satisfy the LLMClient protocol; inbound_anthropic_body is a no-op here. - extra_headers kwarg for provider quirks (e.g. OpenRouter's HTTP-Referer) without a per-provider registry. - aclose() for httpx pool cleanup. - Docstring sets the contributor expectation: file an issue rather than adding per-provider if/else branches. Includes BACKEND_SETUP.md section and CHANGELOG entry.
Addresses Antoine's review comments on PR antoinezambelli#89 (CHANGES_REQUESTED). Per Forge's fail-fast/fail-loud preference, malformed tool-call argument JSON must not be coerced to {} and executed as fn(**{}) — that's a quiet false success. Route it through the retry-driving TextResponse path instead, matching LlamafileClient (the reference; not VLLMClient). One fix in the two spots the inline comments pointed at: - _parse_tool_calls (non-streaming): on JSONDecodeError, return a TextResponse instead of {}, so the validator's rescue-parse + retry/ nudge loop can drive a correction. Surfaces assistant text when present, else the raw malformed args. Matches llamafile.py:523-528. - send_stream: final assembly now routes the same way (matches llamafile.py:400-414), and a non-string arg fragment is serialized into the buffer rather than silently dropped (dropping left a gap that could parse into wrong-but-valid args). Tests: flip the mis-asserting malformed-args test to expect TextResponse; add assistant-text-surfaced, dict-args-accepted, and two streaming cases.
|
Addressed both comments with the single fix you described: malformed tool-call args now return a Heads-up: I rebased on |
antoinezambelli
left a comment
There was a problem hiding this comment.
This is great — thank you! Merging it.
The fail-loud fix is exactly right, and I really like that you mirrored LlamafileClient — that's the correct reference here (not VLLMClient), and it keeps the two clients behaving consistently:
- Non-streaming (
_parse_tool_calls): returning aTextResponseonJSONDecodeErrorinstead of coercing tofn(**{})is precisely the behavior we want — it lets the validator's rescue-parse + retry/nudge loop drive a correction rather than quietly succeeding with empty args. Surfacing assistant text when present and falling back to the raw args otherwise is a nice touch. Matchesllamafile.py:524-528. - Streaming (
send_stream): nice catch routing final assembly through the same path, and serializing a non-string arg fragment instead of dropping it is the right instinct — a dropped fragment can leave a gap that parses into wrong-but-valid args (a quiet false success), whereas folding it in means the parse at stream end either recovers the object or fails loud into the retry path. Matchesllamafile.py:411-414.
The "one malformed among several → whole response becomes TextResponse" behavior (no partial execution of valid sibling calls) lines up with llamafile's bail too, and the tests around it are lovely. The extra_headers-over-bearer design keeping providers out of a per-quirk registry is exactly the boundary I'd have wanted.
I pulled it down and ran everything locally: your 38 new tests pass, and the full unit suite (1066 tests) is green against current main.
Really appreciate how thoroughly you turned around the review feedback — the malformed/streaming/dict-args edge-case tests especially. Squash-merging now. Thanks again for the contribution!
Remove the changelog entry that landed with #89 — this project does not keep an [Unreleased] section; entries are added under the version that ships them. Fix the backend table in BACKEND_SETUP.md, which said "four" and was missing two clients: vLLM (added in 0.7.2) and the new OpenAI-compatible client. Now lists all six. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for merging and the thorough review! |
* Proxy: native-only + transparent OpenAI passthrough
Make the OpenAI-compatible proxy native-tool-call-only and forward the
client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip
that dropped schema detail and leaked empty tool names.
- Remove the proxy's --mode surface; the proxy always drives the backend
client native. LlamafileClient's prompt-injection machinery is retained
for non-proxy WorkflowRunner / direct-client use (it still wins for some
models in full-guardrail workflow evals).
- Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native
path sends it verbatim. Other clients accept-and-ignore (vLLM also gains
the previously-missing passthrough/inbound_anthropic_body kwargs).
- run_inference forwards raw OpenAI messages/tools only on the clean first
attempt (use_raw_messages gate); any mutation falls back to fold+serialize.
- respond tool is now opt-in (--inject-respond-tool, default off).
- No instrumentation (proxy_trace/guardrail_stats deliberately not ported).
- Tests: drop removed mode-guard tests; respond tests opt in explicitly; add
native-passthrough, detachment, respond-default, and first-attempt-gate
coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Add prompt-injection as opt-in proxy capability (--backend-capability)
The proxy serves tool-call-capable backends natively (verbatim tool/message
passthrough). This adds prompt-injection back as an explicit opt-in for
non-function-calling backends (llama.cpp / llamafile without a tool template).
- New --backend-capability {native,prompt} (default native), declared once at
construction and frozen — no runtime probing or mid-request mode mutation.
- prompt capability reuses LlamafileClient's existing prompt path (build the
tool prompt, downgrade tool/assistant-tool_call history to text, parse the
JSON tool call back into native tool_calls). No client changes.
- Handler suppresses verbatim raw passthrough when in prompt mode so inference
folds normally and the client injects the tool prompt.
- Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol).
- Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Proxy: log effective backend_timeout at startup
The configurable backend_timeout (#91) was validated, stored, and threaded
into every client request, but never surfaced at launch. Extend the
"Proxy ready" line to report the effective value so the operative timeout
is visible/diagnosable from the startup log.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* vLLM: single source of truth for model identity (#75)
VLLMClient kept two identity fields with distinct roles — model_path (the
verbatim wire "model" field, which vLLM validates against its
--served-model-name) and model (the derived registry-lookup key). The proxy's
external-mode served-name adoption set both by hand (model_path = served;
model = served), duplicating the derivation logic and storing the full served
name where the constructor's rule stores the stem.
Extract the path->key derivation into _derive_model_field and wrap both
assignments in _set_model_identity, then call it from __init__ and from the
proxy. External adoption now upholds the same (model_path, model) invariant as
construction: an HF-repo-id served name reaches the wire verbatim while the
registry key is the derived stem.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Clients: consistent malformed-tool-call + response-shape handling
Audited malformed-tool-call and unexpected-payload handling across the
OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and
LlamafileClient. Standardize on one principle, applied uniformly:
- Malformed argument JSON (a model mistake) -> TextResponse, routing the raw
output back through the inference loop so the rescue/retry path can recover.
- A broken provider envelope (missing choices/message) or unexpected args type
(a contract violation, not the model's fault) -> BackendError: fail loud and
consistent, never a stray KeyError/IndexError.
Changes:
- vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on
malformed args, unlike llamafile's retry-driving TextResponse) with a
_parse_tool_calls mirroring the reference. Route both send() and send_stream()
through it so streaming and non-streaming agree: a fully accumulated but
unparseable arguments string finalizes as a TextResponse, not an exception.
- llamafile / openai_compat: guard the bare data["choices"][0]["message"]
subscripts -> BackendError on a broken envelope (matching what vLLM already
did for choices). llamafile also hardens function/name access.
- ollama: defensive .get on function/name (both paths); document that Ollama
emits dict args by contract, so no json.loads is needed there.
Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing-
function/reasoning) + streaming malformed-fragment parity; envelope-guard tests
for llamafile and openai_compat. 1092 unit tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* LlamafileClient: remove runtime auto mode; native-first, frozen capability
Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try
native, fall back to prompt on HTTP error, recording resolved_mode). This was
the last vestige of the mid-request capability mutation the proxy rewrite
excised everywhere else; the proxy already declares its capability up front via
--backend-capability. With auto gone, resolved_mode is always == self.mode, so
the whole tri-state indirection collapses to a direct dispatch on self.mode.
The default is now native. This is both hardening and a deliberate posture
shift: local-model function-calling support has matured into the more reliable
path, so native-first is the right default. Prompt-injection is preserved as an
explicit opt-in (mode="prompt") and is the theoretically correct fallback for
non-FC backends — but it is honestly flagged, in the docstring and docs, that
models tend to struggle to drive the prompt-injected protocol reliably on more
complex, multi-step interactions. Capability is declared-and-frozen: an invalid
mode (including the old "auto") now raises ValueError rather than silently
degrading.
- llamafile.py: validate mode in __init__; default native; delete
_resolve_and_send and the resolved_mode attribute/branches; dispatch send /
send_stream on self.mode; rewrite the class docstring (native-first rationale
+ prompt caveat).
- eval_runner.py: --llamafile-mode choices [native, prompt], default native.
- docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat.
- tests: drop the auto-mode suite; assert native default + ValueError on "auto".
Consumers verified unaffected: the proxy (both sites), batch_eval, and the
integration script all pass mode explicitly. 1086 unit tests green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* fix(eval): honor manual context budget in batch_eval via start_with_budget
batch_eval brought servers up with a bare server.start() (no ctx_override)
and resolved the budget separately via server.resolve_budget(), so
--budget-mode manual --num-ctx N was a no-op for llama-server: the server
booted at the model's full native context (no -c), and resolve_budget(MANUAL)
just read that full value back from /props. (Ollama was unaffected — its
context is per-request via set_num_ctx.)
Route both the initial bring-up and _recover_server through the prod
start_with_budget() path, which threads manual_tokens -> ctx_override -> -c
at launch and returns the resolved budget. _recover_server gains
budget_mode/manual_tokens params so a restarted server reuses the same
budget. Drops the now-redundant standalone resolve_budget() on the happy
path (still used on the recovery branch to read back the resolved value).
This also fixes FORGE_FAST mode, which the old bare-start() path never
supported.
Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000):
server boots with -c, rows record budget_tokens=20224 (server-clamped)
instead of the previous 262144 full-native read-back.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* chore(release): 0.7.3 — native-first proxy
Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this
branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient
#89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94).
Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed
to --backend-capability (no alias; --mode was only introduced in 0.7.1).
Native is the default and only auto-selected protocol; prompt-injection is
an explicit opt-in for non-FC llama.cpp/llamafile backends.
USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode
tends to degrade on more complex multi-step interactions. BACKEND_SETUP,
EVAL_GUIDE, and ADR-012 were already updated earlier on this branch.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Tested end-to-end against Cloudflare Workers AI (Mistral-Small-3.1-24B) via my application (teasel local). Multi-turn function calling, streaming, bearer auth all confirmed working.
Provider-agnostic client for any backend exposing the OpenAI /v1/chat/completions shape with optional bearer auth — Cloudflare Workers AI, Fireworks, OpenRouter, OpenAI itself, etc. Native function calling, SSE streaming, full LLMClient protocol surface.
Addresses the review feedback in #88:
Includes BACKEND_SETUP.md section and CHANGELOG entry.