Add OpenAICompatClient for OpenAI-compatible endpoints by lucasgerads · Pull Request #89 · antoinezambelli/forge

lucasgerads · 2026-05-28T23:07:23Z

Tested end-to-end against Cloudflare Workers AI (Mistral-Small-3.1-24B) via my application (teasel local). Multi-turn function calling, streaming, bearer auth all confirmed working.

Provider-agnostic client for any backend exposing the OpenAI /v1/chat/completions shape with optional bearer auth — Cloudflare Workers AI, Fireworks, OpenRouter, OpenAI itself, etc. Native function calling, SSE streaming, full LLMClient protocol surface.

Addresses the review feedback in #88:

Mirrors VLLMClient sampling kwargs (top_k, min_p, repeat_penalty, presence_penalty, chat_template_kwargs) with recommended_sampling defaulting to False (hosted model IDs aren't in forge's registry).
Accepts passthrough and inbound_anthropic_body to satisfy the LLMClient protocol; inbound_anthropic_body is a no-op here.
extra_headers kwarg for provider quirks (e.g. OpenRouter's HTTP-Referer) without a per-provider registry.
aclose() for httpx pool cleanup.
Docstring sets the contributor expectation: file an issue rather than adding per-provider if/else branches.

Includes BACKEND_SETUP.md section and CHANGELOG entry.

antoinezambelli

Thanks for this — the overall shape is spot on and it matches the direction from #88: standalone sibling client, provider-agnostic OpenAI-compatible endpoint, bearer auth, extra_headers, full LLMClient surface, and good focused tests/docs.

One change before merge, around Forge's fail-fast/fail-loud preference: malformed tool-call argument JSON shouldn't become {}. Please route it through the retry-driving TextResponse path, matching LlamafileClient — both in _parse_tool_calls (non-streaming) and in the streaming assembly. The two inline comments point at both spots; they're the same fix in two places.

Note: VLLMClient currently doesn't fully match the desired malformed-args behavior either. That's existing debt on my side, not precedent to copy here — LlamafileClient is the reference.

Provider-agnostic client for any backend exposing the OpenAI /v1/chat/completions shape with optional bearer auth — Cloudflare Workers AI, Fireworks, OpenRouter, OpenAI itself, etc. Native function calling, SSE streaming, full LLMClient protocol surface. Addresses the review feedback in antoinezambelli#88: - Mirrors VLLMClient sampling kwargs (top_k, min_p, repeat_penalty, presence_penalty, chat_template_kwargs) with recommended_sampling defaulting to False (hosted model IDs aren't in forge's registry). - Accepts passthrough and inbound_anthropic_body to satisfy the LLMClient protocol; inbound_anthropic_body is a no-op here. - extra_headers kwarg for provider quirks (e.g. OpenRouter's HTTP-Referer) without a per-provider registry. - aclose() for httpx pool cleanup. - Docstring sets the contributor expectation: file an issue rather than adding per-provider if/else branches. Includes BACKEND_SETUP.md section and CHANGELOG entry.

Addresses Antoine's review comments on PR antoinezambelli#89 (CHANGES_REQUESTED). Per Forge's fail-fast/fail-loud preference, malformed tool-call argument JSON must not be coerced to {} and executed as fn(**{}) — that's a quiet false success. Route it through the retry-driving TextResponse path instead, matching LlamafileClient (the reference; not VLLMClient). One fix in the two spots the inline comments pointed at: - _parse_tool_calls (non-streaming): on JSONDecodeError, return a TextResponse instead of {}, so the validator's rescue-parse + retry/ nudge loop can drive a correction. Surfaces assistant text when present, else the raw malformed args. Matches llamafile.py:523-528. - send_stream: final assembly now routes the same way (matches llamafile.py:400-414), and a non-string arg fragment is serialized into the buffer rather than silently dropped (dropping left a gap that could parse into wrong-but-valid args). Tests: flip the mis-asserting malformed-args test to expect TextResponse; add assistant-text-surfaced, dict-args-accepted, and two streaming cases.

lucasgerads · 2026-05-30T20:54:38Z

Addressed both comments with the single fix you described: malformed tool-call args now return a TextResponse (via _parse_tool_calls, shared by the streaming + non-streaming paths) instead of coercing to {}, matching LlamafileClient. Also stopped the streaming path from silently dropping a non-string arg fragment, and added test coverage. Left VLLMClient alone per your note.

Heads-up: I rebased on main and force-pushed, so your inline comments show as outdated — the fix is in the latest commit (660a53c).

antoinezambelli

This is great — thank you! Merging it.

The fail-loud fix is exactly right, and I really like that you mirrored LlamafileClient — that's the correct reference here (not VLLMClient), and it keeps the two clients behaving consistently:

Non-streaming (_parse_tool_calls): returning a TextResponse on JSONDecodeError instead of coercing to fn(**{}) is precisely the behavior we want — it lets the validator's rescue-parse + retry/nudge loop drive a correction rather than quietly succeeding with empty args. Surfacing assistant text when present and falling back to the raw args otherwise is a nice touch. Matches llamafile.py:524-528.
Streaming (send_stream): nice catch routing final assembly through the same path, and serializing a non-string arg fragment instead of dropping it is the right instinct — a dropped fragment can leave a gap that parses into wrong-but-valid args (a quiet false success), whereas folding it in means the parse at stream end either recovers the object or fails loud into the retry path. Matches llamafile.py:411-414.

The "one malformed among several → whole response becomes TextResponse" behavior (no partial execution of valid sibling calls) lines up with llamafile's bail too, and the tests around it are lovely. The extra_headers-over-bearer design keeping providers out of a per-quirk registry is exactly the boundary I'd have wanted.

I pulled it down and ran everything locally: your 38 new tests pass, and the full unit suite (1066 tests) is green against current main.

Really appreciate how thoroughly you turned around the review feedback — the malformed/streaming/dict-args edge-case tests especially. Squash-merging now. Thanks again for the contribution!

Remove the changelog entry that landed with #89 — this project does not keep an [Unreleased] section; entries are added under the version that ships them. Fix the backend table in BACKEND_SETUP.md, which said "four" and was missing two clients: vLLM (added in 0.7.2) and the new OpenAI-compatible client. Now lists all six. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lucasgerads · 2026-06-01T05:51:49Z

Thanks for merging and the thorough review!

* Proxy: native-only + transparent OpenAI passthrough Make the OpenAI-compatible proxy native-tool-call-only and forward the client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip that dropped schema detail and leaked empty tool names. - Remove the proxy's --mode surface; the proxy always drives the backend client native. LlamafileClient's prompt-injection machinery is retained for non-proxy WorkflowRunner / direct-client use (it still wins for some models in full-guardrail workflow evals). - Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native path sends it verbatim. Other clients accept-and-ignore (vLLM also gains the previously-missing passthrough/inbound_anthropic_body kwargs). - run_inference forwards raw OpenAI messages/tools only on the clean first attempt (use_raw_messages gate); any mutation falls back to fold+serialize. - respond tool is now opt-in (--inject-respond-tool, default off). - No instrumentation (proxy_trace/guardrail_stats deliberately not ported). - Tests: drop removed mode-guard tests; respond tests opt in explicitly; add native-passthrough, detachment, respond-default, and first-attempt-gate coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Add prompt-injection as opt-in proxy capability (--backend-capability) The proxy serves tool-call-capable backends natively (verbatim tool/message passthrough). This adds prompt-injection back as an explicit opt-in for non-function-calling backends (llama.cpp / llamafile without a tool template). - New --backend-capability {native,prompt} (default native), declared once at construction and frozen — no runtime probing or mid-request mode mutation. - prompt capability reuses LlamafileClient's existing prompt path (build the tool prompt, downgrade tool/assistant-tool_call history to text, parse the JSON tool call back into native tool_calls). No client changes. - Handler suppresses verbatim raw passthrough when in prompt mode so inference folds normally and the client injects the tool prompt. - Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol). - Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Proxy: log effective backend_timeout at startup The configurable backend_timeout (#91) was validated, stored, and threaded into every client request, but never surfaced at launch. Extend the "Proxy ready" line to report the effective value so the operative timeout is visible/diagnosable from the startup log. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * vLLM: single source of truth for model identity (#75) VLLMClient kept two identity fields with distinct roles — model_path (the verbatim wire "model" field, which vLLM validates against its --served-model-name) and model (the derived registry-lookup key). The proxy's external-mode served-name adoption set both by hand (model_path = served; model = served), duplicating the derivation logic and storing the full served name where the constructor's rule stores the stem. Extract the path->key derivation into _derive_model_field and wrap both assignments in _set_model_identity, then call it from __init__ and from the proxy. External adoption now upholds the same (model_path, model) invariant as construction: an HF-repo-id served name reaches the wire verbatim while the registry key is the derived stem. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Clients: consistent malformed-tool-call + response-shape handling Audited malformed-tool-call and unexpected-payload handling across the OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and LlamafileClient. Standardize on one principle, applied uniformly: - Malformed argument JSON (a model mistake) -> TextResponse, routing the raw output back through the inference loop so the rescue/retry path can recover. - A broken provider envelope (missing choices/message) or unexpected args type (a contract violation, not the model's fault) -> BackendError: fail loud and consistent, never a stray KeyError/IndexError. Changes: - vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on malformed args, unlike llamafile's retry-driving TextResponse) with a _parse_tool_calls mirroring the reference. Route both send() and send_stream() through it so streaming and non-streaming agree: a fully accumulated but unparseable arguments string finalizes as a TextResponse, not an exception. - llamafile / openai_compat: guard the bare data["choices"][0]["message"] subscripts -> BackendError on a broken envelope (matching what vLLM already did for choices). llamafile also hardens function/name access. - ollama: defensive .get on function/name (both paths); document that Ollama emits dict args by contract, so no json.loads is needed there. Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing- function/reasoning) + streaming malformed-fragment parity; envelope-guard tests for llamafile and openai_compat. 1092 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * LlamafileClient: remove runtime auto mode; native-first, frozen capability Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try native, fall back to prompt on HTTP error, recording resolved_mode). This was the last vestige of the mid-request capability mutation the proxy rewrite excised everywhere else; the proxy already declares its capability up front via --backend-capability. With auto gone, resolved_mode is always == self.mode, so the whole tri-state indirection collapses to a direct dispatch on self.mode. The default is now native. This is both hardening and a deliberate posture shift: local-model function-calling support has matured into the more reliable path, so native-first is the right default. Prompt-injection is preserved as an explicit opt-in (mode="prompt") and is the theoretically correct fallback for non-FC backends — but it is honestly flagged, in the docstring and docs, that models tend to struggle to drive the prompt-injected protocol reliably on more complex, multi-step interactions. Capability is declared-and-frozen: an invalid mode (including the old "auto") now raises ValueError rather than silently degrading. - llamafile.py: validate mode in __init__; default native; delete _resolve_and_send and the resolved_mode attribute/branches; dispatch send / send_stream on self.mode; rewrite the class docstring (native-first rationale + prompt caveat). - eval_runner.py: --llamafile-mode choices [native, prompt], default native. - docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat. - tests: drop the auto-mode suite; assert native default + ValueError on "auto". Consumers verified unaffected: the proxy (both sites), batch_eval, and the integration script all pass mode explicitly. 1086 unit tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(eval): honor manual context budget in batch_eval via start_with_budget batch_eval brought servers up with a bare server.start() (no ctx_override) and resolved the budget separately via server.resolve_budget(), so --budget-mode manual --num-ctx N was a no-op for llama-server: the server booted at the model's full native context (no -c), and resolve_budget(MANUAL) just read that full value back from /props. (Ollama was unaffected — its context is per-request via set_num_ctx.) Route both the initial bring-up and _recover_server through the prod start_with_budget() path, which threads manual_tokens -> ctx_override -> -c at launch and returns the resolved budget. _recover_server gains budget_mode/manual_tokens params so a restarted server reuses the same budget. Drops the now-redundant standalone resolve_budget() on the happy path (still used on the recovery branch to read back the resolved value). This also fixes FORGE_FAST mode, which the old bare-start() path never supported. Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000): server boots with -c, rows record budget_tokens=20224 (server-clamped) instead of the previous 262144 full-native read-back. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(release): 0.7.3 — native-first proxy Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient #89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94). Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed to --backend-capability (no alias; --mode was only introduced in 0.7.1). Native is the default and only auto-selected protocol; prompt-injection is an explicit opt-in for non-FC llama.cpp/llamafile backends. USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode tends to degrade on more complex multi-step interactions. BACKEND_SETUP, EVAL_GUIDE, and ADR-012 were already updated earlier on this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

lucasgerads marked this pull request as ready for review May 28, 2026 23:11

antoinezambelli self-requested a review May 29, 2026 06:16

antoinezambelli reviewed May 29, 2026

View reviewed changes

Comment thread src/forge/clients/openai_compat.py Outdated

antoinezambelli reviewed May 29, 2026

View reviewed changes

Comment thread src/forge/clients/openai_compat.py Outdated

antoinezambelli requested changes May 29, 2026

View reviewed changes

lucasgerads added 2 commits May 30, 2026 09:39

lucasgerads force-pushed the lucas branch from 0796197 to 660a53c Compare May 30, 2026 20:52

antoinezambelli approved these changes May 31, 2026

View reviewed changes

antoinezambelli merged commit ba25f3c into antoinezambelli:main May 31, 2026
2 checks passed

antoinezambelli mentioned this pull request May 31, 2026

Post-merge cleanup for OpenAICompatClient (#89) #95

Merged

antoinezambelli mentioned this pull request May 31, 2026

Bearer-auth support for hosted OpenAI-compatible providers (e.g Cloudflare AI workers) #88

Closed

antoinezambelli mentioned this pull request Jun 1, 2026

Native-first proxy (0.7.3) #97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenAICompatClient for OpenAI-compatible endpoints#89

Add OpenAICompatClient for OpenAI-compatible endpoints#89
antoinezambelli merged 2 commits into
antoinezambelli:mainfrom
teasel-tools:lucas

lucasgerads commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

antoinezambelli left a comment

Uh oh!

lucasgerads commented May 30, 2026

Uh oh!

antoinezambelli left a comment

Uh oh!

Uh oh!

lucasgerads commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lucasgerads commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

antoinezambelli left a comment

Choose a reason for hiding this comment

Uh oh!

lucasgerads commented May 30, 2026

Uh oh!

antoinezambelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lucasgerads commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants