Skip to content

Add OpenAICompatClient for OpenAI-compatible endpoints#89

Merged
antoinezambelli merged 2 commits into
antoinezambelli:mainfrom
teasel-tools:lucas
May 31, 2026
Merged

Add OpenAICompatClient for OpenAI-compatible endpoints#89
antoinezambelli merged 2 commits into
antoinezambelli:mainfrom
teasel-tools:lucas

Conversation

@lucasgerads

Copy link
Copy Markdown
Contributor

Tested end-to-end against Cloudflare Workers AI (Mistral-Small-3.1-24B) via my application (teasel local). Multi-turn function calling, streaming, bearer auth all confirmed working.

Provider-agnostic client for any backend exposing the OpenAI /v1/chat/completions shape with optional bearer auth — Cloudflare Workers AI, Fireworks, OpenRouter, OpenAI itself, etc. Native function calling, SSE streaming, full LLMClient protocol surface.

Addresses the review feedback in #88:

  • Mirrors VLLMClient sampling kwargs (top_k, min_p, repeat_penalty, presence_penalty, chat_template_kwargs) with recommended_sampling defaulting to False (hosted model IDs aren't in forge's registry).
  • Accepts passthrough and inbound_anthropic_body to satisfy the LLMClient protocol; inbound_anthropic_body is a no-op here.
  • extra_headers kwarg for provider quirks (e.g. OpenRouter's HTTP-Referer) without a per-provider registry.
  • aclose() for httpx pool cleanup.
  • Docstring sets the contributor expectation: file an issue rather than adding per-provider if/else branches.

Includes BACKEND_SETUP.md section and CHANGELOG entry.

@lucasgerads lucasgerads marked this pull request as ready for review May 28, 2026 23:11
@antoinezambelli antoinezambelli self-requested a review May 29, 2026 06:16
Comment thread src/forge/clients/openai_compat.py Outdated
Comment thread src/forge/clients/openai_compat.py Outdated

@antoinezambelli antoinezambelli left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the overall shape is spot on and it matches the direction from #88: standalone sibling client, provider-agnostic OpenAI-compatible endpoint, bearer auth, extra_headers, full LLMClient surface, and good focused tests/docs.

One change before merge, around Forge's fail-fast/fail-loud preference: malformed tool-call argument JSON shouldn't become {}. Please route it through the retry-driving TextResponse path, matching LlamafileClient — both in _parse_tool_calls (non-streaming) and in the streaming assembly. The two inline comments point at both spots; they're the same fix in two places.

Note: VLLMClient currently doesn't fully match the desired malformed-args behavior either. That's existing debt on my side, not precedent to copy here — LlamafileClient is the reference.

Provider-agnostic client for any backend exposing the OpenAI
/v1/chat/completions shape with optional bearer auth — Cloudflare
Workers AI, Fireworks, OpenRouter, OpenAI itself, etc. Native function
calling, SSE streaming, full LLMClient protocol surface.

Addresses the review feedback in antoinezambelli#88:
- Mirrors VLLMClient sampling kwargs (top_k, min_p, repeat_penalty,
  presence_penalty, chat_template_kwargs) with recommended_sampling
  defaulting to False (hosted model IDs aren't in forge's registry).
- Accepts passthrough and inbound_anthropic_body to satisfy the
  LLMClient protocol; inbound_anthropic_body is a no-op here.
- extra_headers kwarg for provider quirks (e.g. OpenRouter's
  HTTP-Referer) without a per-provider registry.
- aclose() for httpx pool cleanup.
- Docstring sets the contributor expectation: file an issue rather
  than adding per-provider if/else branches.

Includes BACKEND_SETUP.md section and CHANGELOG entry.
Addresses Antoine's review comments on PR antoinezambelli#89 (CHANGES_REQUESTED).

Per Forge's fail-fast/fail-loud preference, malformed tool-call argument
JSON must not be coerced to {} and executed as fn(**{}) — that's a quiet
false success. Route it through the retry-driving TextResponse path
instead, matching LlamafileClient (the reference; not VLLMClient).

One fix in the two spots the inline comments pointed at:
- _parse_tool_calls (non-streaming): on JSONDecodeError, return a
  TextResponse instead of {}, so the validator's rescue-parse + retry/
  nudge loop can drive a correction. Surfaces assistant text when present,
  else the raw malformed args. Matches llamafile.py:523-528.
- send_stream: final assembly now routes the same way (matches
  llamafile.py:400-414), and a non-string arg fragment is serialized into
  the buffer rather than silently dropped (dropping left a gap that could
  parse into wrong-but-valid args).

Tests: flip the mis-asserting malformed-args test to expect TextResponse;
add assistant-text-surfaced, dict-args-accepted, and two streaming cases.
@lucasgerads

Copy link
Copy Markdown
Contributor Author

Addressed both comments with the single fix you described: malformed tool-call args now return a TextResponse (via _parse_tool_calls, shared by the streaming + non-streaming paths) instead of coercing to {}, matching LlamafileClient. Also stopped the streaming path from silently dropping a non-string arg fragment, and added test coverage. Left VLLMClient alone per your note.

Heads-up: I rebased on main and force-pushed, so your inline comments show as outdated — the fix is in the latest commit (660a53c).

@antoinezambelli antoinezambelli left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great — thank you! Merging it.

The fail-loud fix is exactly right, and I really like that you mirrored LlamafileClient — that's the correct reference here (not VLLMClient), and it keeps the two clients behaving consistently:

  • Non-streaming (_parse_tool_calls): returning a TextResponse on JSONDecodeError instead of coercing to fn(**{}) is precisely the behavior we want — it lets the validator's rescue-parse + retry/nudge loop drive a correction rather than quietly succeeding with empty args. Surfacing assistant text when present and falling back to the raw args otherwise is a nice touch. Matches llamafile.py:524-528.
  • Streaming (send_stream): nice catch routing final assembly through the same path, and serializing a non-string arg fragment instead of dropping it is the right instinct — a dropped fragment can leave a gap that parses into wrong-but-valid args (a quiet false success), whereas folding it in means the parse at stream end either recovers the object or fails loud into the retry path. Matches llamafile.py:411-414.

The "one malformed among several → whole response becomes TextResponse" behavior (no partial execution of valid sibling calls) lines up with llamafile's bail too, and the tests around it are lovely. The extra_headers-over-bearer design keeping providers out of a per-quirk registry is exactly the boundary I'd have wanted.

I pulled it down and ran everything locally: your 38 new tests pass, and the full unit suite (1066 tests) is green against current main.

Really appreciate how thoroughly you turned around the review feedback — the malformed/streaming/dict-args edge-case tests especially. Squash-merging now. Thanks again for the contribution!

@antoinezambelli antoinezambelli merged commit ba25f3c into antoinezambelli:main May 31, 2026
2 checks passed
antoinezambelli added a commit that referenced this pull request May 31, 2026
Remove the changelog entry that landed with #89 — this project does not
keep an [Unreleased] section; entries are added under the version that
ships them.

Fix the backend table in BACKEND_SETUP.md, which said "four" and was
missing two clients: vLLM (added in 0.7.2) and the new OpenAI-compatible
client. Now lists all six.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lucasgerads

Copy link
Copy Markdown
Contributor Author

Thanks for merging and the thorough review!

antoinezambelli added a commit that referenced this pull request Jun 1, 2026
* Proxy: native-only + transparent OpenAI passthrough

Make the OpenAI-compatible proxy native-tool-call-only and forward the
client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip
that dropped schema detail and leaked empty tool names.

- Remove the proxy's --mode surface; the proxy always drives the backend
  client native. LlamafileClient's prompt-injection machinery is retained
  for non-proxy WorkflowRunner / direct-client use (it still wins for some
  models in full-guardrail workflow evals).
- Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native
  path sends it verbatim. Other clients accept-and-ignore (vLLM also gains
  the previously-missing passthrough/inbound_anthropic_body kwargs).
- run_inference forwards raw OpenAI messages/tools only on the clean first
  attempt (use_raw_messages gate); any mutation falls back to fold+serialize.
- respond tool is now opt-in (--inject-respond-tool, default off).
- No instrumentation (proxy_trace/guardrail_stats deliberately not ported).
- Tests: drop removed mode-guard tests; respond tests opt in explicitly; add
  native-passthrough, detachment, respond-default, and first-attempt-gate
  coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Add prompt-injection as opt-in proxy capability (--backend-capability)

The proxy serves tool-call-capable backends natively (verbatim tool/message
passthrough). This adds prompt-injection back as an explicit opt-in for
non-function-calling backends (llama.cpp / llamafile without a tool template).

- New --backend-capability {native,prompt} (default native), declared once at
  construction and frozen — no runtime probing or mid-request mode mutation.
- prompt capability reuses LlamafileClient's existing prompt path (build the
  tool prompt, downgrade tool/assistant-tool_call history to text, parse the
  JSON tool call back into native tool_calls). No client changes.
- Handler suppresses verbatim raw passthrough when in prompt mode so inference
  folds normally and the client injects the tool prompt.
- Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol).
- Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Proxy: log effective backend_timeout at startup

The configurable backend_timeout (#91) was validated, stored, and threaded
into every client request, but never surfaced at launch. Extend the
"Proxy ready" line to report the effective value so the operative timeout
is visible/diagnosable from the startup log.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* vLLM: single source of truth for model identity (#75)

VLLMClient kept two identity fields with distinct roles — model_path (the
verbatim wire "model" field, which vLLM validates against its
--served-model-name) and model (the derived registry-lookup key). The proxy's
external-mode served-name adoption set both by hand (model_path = served;
model = served), duplicating the derivation logic and storing the full served
name where the constructor's rule stores the stem.

Extract the path->key derivation into _derive_model_field and wrap both
assignments in _set_model_identity, then call it from __init__ and from the
proxy. External adoption now upholds the same (model_path, model) invariant as
construction: an HF-repo-id served name reaches the wire verbatim while the
registry key is the derived stem.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Clients: consistent malformed-tool-call + response-shape handling

Audited malformed-tool-call and unexpected-payload handling across the
OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and
LlamafileClient. Standardize on one principle, applied uniformly:

- Malformed argument JSON (a model mistake) -> TextResponse, routing the raw
  output back through the inference loop so the rescue/retry path can recover.
- A broken provider envelope (missing choices/message) or unexpected args type
  (a contract violation, not the model's fault) -> BackendError: fail loud and
  consistent, never a stray KeyError/IndexError.

Changes:
- vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on
  malformed args, unlike llamafile's retry-driving TextResponse) with a
  _parse_tool_calls mirroring the reference. Route both send() and send_stream()
  through it so streaming and non-streaming agree: a fully accumulated but
  unparseable arguments string finalizes as a TextResponse, not an exception.
- llamafile / openai_compat: guard the bare data["choices"][0]["message"]
  subscripts -> BackendError on a broken envelope (matching what vLLM already
  did for choices). llamafile also hardens function/name access.
- ollama: defensive .get on function/name (both paths); document that Ollama
  emits dict args by contract, so no json.loads is needed there.

Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing-
function/reasoning) + streaming malformed-fragment parity; envelope-guard tests
for llamafile and openai_compat. 1092 unit tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* LlamafileClient: remove runtime auto mode; native-first, frozen capability

Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try
native, fall back to prompt on HTTP error, recording resolved_mode). This was
the last vestige of the mid-request capability mutation the proxy rewrite
excised everywhere else; the proxy already declares its capability up front via
--backend-capability. With auto gone, resolved_mode is always == self.mode, so
the whole tri-state indirection collapses to a direct dispatch on self.mode.

The default is now native. This is both hardening and a deliberate posture
shift: local-model function-calling support has matured into the more reliable
path, so native-first is the right default. Prompt-injection is preserved as an
explicit opt-in (mode="prompt") and is the theoretically correct fallback for
non-FC backends — but it is honestly flagged, in the docstring and docs, that
models tend to struggle to drive the prompt-injected protocol reliably on more
complex, multi-step interactions. Capability is declared-and-frozen: an invalid
mode (including the old "auto") now raises ValueError rather than silently
degrading.

- llamafile.py: validate mode in __init__; default native; delete
  _resolve_and_send and the resolved_mode attribute/branches; dispatch send /
  send_stream on self.mode; rewrite the class docstring (native-first rationale
  + prompt caveat).
- eval_runner.py: --llamafile-mode choices [native, prompt], default native.
- docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat.
- tests: drop the auto-mode suite; assert native default + ValueError on "auto".

Consumers verified unaffected: the proxy (both sites), batch_eval, and the
integration script all pass mode explicitly. 1086 unit tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(eval): honor manual context budget in batch_eval via start_with_budget

batch_eval brought servers up with a bare server.start() (no ctx_override)
and resolved the budget separately via server.resolve_budget(), so
--budget-mode manual --num-ctx N was a no-op for llama-server: the server
booted at the model's full native context (no -c), and resolve_budget(MANUAL)
just read that full value back from /props. (Ollama was unaffected — its
context is per-request via set_num_ctx.)

Route both the initial bring-up and _recover_server through the prod
start_with_budget() path, which threads manual_tokens -> ctx_override -> -c
at launch and returns the resolved budget. _recover_server gains
budget_mode/manual_tokens params so a restarted server reuses the same
budget. Drops the now-redundant standalone resolve_budget() on the happy
path (still used on the recovery branch to read back the resolved value).

This also fixes FORGE_FAST mode, which the old bare-start() path never
supported.

Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000):
server boots with -c, rows record budget_tokens=20224 (server-clamped)
instead of the previous 262144 full-native read-back.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(release): 0.7.3 — native-first proxy

Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this
branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient
#89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94).

Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed
to --backend-capability (no alias; --mode was only introduced in 0.7.1).
Native is the default and only auto-selected protocol; prompt-injection is
an explicit opt-in for non-FC llama.cpp/llamafile backends.

USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode
tends to degrade on more complex multi-step interactions. BACKEND_SETUP,
EVAL_GUIDE, and ADR-012 were already updated earlier on this branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants