From f21f2de05f1a53fa227447ff8d9f65bbf5acf73d Mon Sep 17 00:00:00 2001
From: Antoine Zambelli <antoine.zambelli@gmail.com>
Date: Sat, 30 May 2026 02:32:30 -0500
Subject: [PATCH 1/8] Proxy: native-only + transparent OpenAI passthrough

Make the OpenAI-compatible proxy native-tool-call-only and forward the
client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip
that dropped schema detail and leaked empty tool names.

- Remove the proxy's --mode surface; the proxy always drives the backend
  client native. LlamafileClient's prompt-injection machinery is retained
  for non-proxy WorkflowRunner / direct-client use (it still wins for some
  models in full-guardrail workflow evals).
- Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native
  path sends it verbatim. Other clients accept-and-ignore (vLLM also gains
  the previously-missing passthrough/inbound_anthropic_body kwargs).
- run_inference forwards raw OpenAI messages/tools only on the clean first
  attempt (use_raw_messages gate); any mutation falls back to fold+serialize.
- respond tool is now opt-in (--inject-respond-tool, default off).
- No instrumentation (proxy_trace/guardrail_stats deliberately not ported).
- Tests: drop removed mode-guard tests; respond tests opt in explicitly; add
  native-passthrough, detachment, respond-default, and first-attempt-gate
  coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/BACKEND_SETUP.md                    |   2 +
 docs/decisions/012-openai-proxy.md       |  33 ++++++++
 src/forge/clients/anthropic.py           |   5 ++
 src/forge/clients/base.py                |  14 ++++
 src/forge/clients/llamafile.py           |  51 +++++++++---
 src/forge/clients/ollama.py              |  10 ++-
 src/forge/clients/vllm.py                |  19 ++++-
 src/forge/core/inference.py              |  44 +++++++++-
 src/forge/proxy/__main__.py              |  15 ++--
 src/forge/proxy/handler.py               |  61 ++++++++++++--
 src/forge/proxy/proxy.py                 |  41 ++++------
 src/forge/proxy/server.py                |   3 +
 tests/unit/test_inference_passthrough.py | 100 +++++++++++++++++++++++
 tests/unit/test_proxy_handler.py         |  90 +++++++++++++++++++-
 tests/unit/test_proxy_path1.py           |   8 --
 tests/unit/test_proxy_proxy.py           |  21 ++---
 16 files changed, 428 insertions(+), 89 deletions(-)
 create mode 100644 tests/unit/test_inference_passthrough.py

diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md
index 8d5cdb2..0bb0297 100644
--- a/docs/BACKEND_SETUP.md
+++ b/docs/BACKEND_SETUP.md
@@ -75,6 +75,8 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999
 
 llamafile does **not** support native function calling — forge's `LlamafileClient` falls back to prompt-injected mode automatically (`mode="auto"`), or you can force it with `mode="prompt"`.
 
+> **Proxy note:** prompt-injection mode is a **direct-client / WorkflowRunner** feature. The OpenAI-compatible **proxy is native-only** — it forwards the client's tools verbatim and does not prompt-inject (see ADR-012). Put an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) behind the proxy; a non-FC backend like llamafile will degrade to passing text through.
+
 Smoke-test:
 
 ```bash
diff --git a/docs/decisions/012-openai-proxy.md b/docs/decisions/012-openai-proxy.md
index 644e5c0..90bc822 100644
--- a/docs/decisions/012-openai-proxy.md
+++ b/docs/decisions/012-openai-proxy.md
@@ -104,6 +104,39 @@ The proxy fully buffers each response from the backend before deciding what to d
 4. **Client disconnect handling** -- detect TCP drop, cancel in-flight backend request, release inference lock.
 5. **Testing** -- unit tests for extraction, integration tests with mock backend, smoke test with real llama-server.
 
+### Revision: native-only + transparent passthrough
+
+The proxy is **native-tool-call-only**. It targets backends that speak the
+native OpenAI tools API (llama.cpp with a tool-calling chat template / `--jinja`,
+vLLM, Ollama, Anthropic). There is no `--mode` flag and no prompt-injection
+fallback in the proxy — prompt-injection mode (`build_tool_prompt`,
+`_downgrade_messages`, the `mode="auto"` HTTP-error fallback) is a non-proxy
+**WorkflowRunner / direct-client** feature only, retained because it still wins
+for some models in full-guardrail workflow evals.
+
+Rationale: the proxy is a transparent layer for an external agent that already
+speaks native FC to a native-FC backend. A traced capture showed the native
+path forwards the client's request byte-for-byte. The earlier eval regression
+(prompt-mode proxy underperforming) was a prompt-injection artifact on an
+FC-capable backend, not proxy overhead.
+
+To preserve that transparency, the proxy forwards the client's **verbatim
+OpenAI `tools` and `messages`** to the backend on the clean first attempt
+(`raw_openai_tools` / `raw_openai_messages`), bypassing the lossy
+`ToolSpec.from_json_schema` → `format_tool` round-trip that dropped schema
+detail and leaked empty tool names. The parsed `ToolSpec` list is kept only as
+forge's validation sidecar. On any forge mutation (retry / compaction / context
+warning) the proxy falls back to the folded/serialized form — see the
+`use_raw_messages` gate in `run_inference`, which mirrors the ADR-015
+`inbound_anthropic_body` drop-on-mutation logic.
+
+The synthetic `respond` tool is **opt-in** (`--inject-respond-tool`, default
+off): the proxy forwards the client's tools untouched unless asked to inject it.
+
+If a backend lacking native FC is placed behind the proxy, it degrades to
+passing the model's text through (no auto-downgrade) — **bring an FC-capable
+backend.**
+
 ### What this is NOT
 
 - **Not a model server.** Forge sits in front of one.
diff --git a/src/forge/clients/anthropic.py b/src/forge/clients/anthropic.py
index 5f9e062..ad9954c 100644
--- a/src/forge/clients/anthropic.py
+++ b/src/forge/clients/anthropic.py
@@ -288,6 +288,7 @@ async def send(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: list[dict[str, Any]] | None = None,
     ) -> LLMResponse:
         """Send messages via the Anthropic Messages API.
 
@@ -296,6 +297,8 @@ async def send(
         forge. ``passthrough`` merges inbound-body extras into the SDK call.
         ``inbound_anthropic_body`` (path 1) triggers verbatim emit — see
         ADR-015 for the cache_control preservation rationale.
+        ``raw_openai_tools`` accepted for protocol symmetry, ignored
+        (Anthropic uses its own tool conversion).
         """
         if sampling:
             log.debug(
@@ -327,12 +330,14 @@ async def send_stream(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: list[dict[str, Any]] | None = None,
     ) -> AsyncIterator[StreamChunk]:
         """Stream via the Anthropic Messages API.
 
         ``sampling`` is accepted for protocol symmetry but ignored.
         ``passthrough`` merges inbound-body extras into the SDK call.
         ``inbound_anthropic_body`` (path 1) triggers verbatim emit; see ADR-015.
+        ``raw_openai_tools`` accepted for protocol symmetry, ignored.
         """
         if sampling:
             log.debug(
diff --git a/src/forge/clients/base.py b/src/forge/clients/base.py
index ec63b22..2a500ca 100644
--- a/src/forge/clients/base.py
+++ b/src/forge/clients/base.py
@@ -9,6 +9,12 @@
 
 from forge.core.workflow import LLMResponse, ToolCall, TextResponse, ToolSpec
 
+# Verbatim OpenAI-shape payloads forwarded by the proxy. The proxy hands the
+# client the user's original ``tools`` array so the backend sees the exact
+# schema the client authored, instead of forge's reconstructed ToolSpec.
+RawOpenAITools = list[dict[str, Any]]
+RawOpenAIMessages = list[dict[str, Any]]
+
 
 @dataclass(frozen=True)
 class TokenUsage:
@@ -86,6 +92,7 @@ async def send(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: RawOpenAITools | None = None,
     ) -> LLMResponse:
         """Send messages and return a parsed response.
 
@@ -116,6 +123,11 @@ async def send(
                 forge-mutation (retry / compaction / context warning) so
                 only the clean first-attempt call rides verbatim. Other
                 clients accept and ignore. See ADR-015.
+            raw_openai_tools: Proxy-only — the client's verbatim OpenAI
+                ``tools`` array. When set, LlamafileClient's native path sends
+                it as-is instead of re-emitting ``format_tool(spec)``, so the
+                backend sees the original schema (no name/schema drift). Other
+                clients accept and ignore.
         """
         ...
 
@@ -126,6 +138,7 @@ async def send_stream(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: RawOpenAITools | None = None,
     ) -> AsyncIterator[StreamChunk]:
         """Send messages and yield streaming chunks.
 
@@ -143,6 +156,7 @@ async def send_stream(
                 Per-call values win over instance state without mutating self.
             passthrough: Optional inbound-body extras dict (see ``send``).
             inbound_anthropic_body: Optional path-1 verbatim body (see ``send``).
+            raw_openai_tools: Optional verbatim OpenAI tools array (see ``send``).
         """
         ...
 
diff --git a/src/forge/clients/llamafile.py b/src/forge/clients/llamafile.py
index 2529a62..7140794 100644
--- a/src/forge/clients/llamafile.py
+++ b/src/forge/clients/llamafile.py
@@ -10,7 +10,7 @@
 
 import httpx
 
-from forge.clients.base import ChunkType, StreamChunk, TokenUsage, format_tool
+from forge.clients.base import ChunkType, RawOpenAITools, StreamChunk, TokenUsage, format_tool
 from forge.clients.sampling_defaults import apply_sampling_defaults
 from forge.core.workflow import LLMResponse, TextResponse, ToolCall, ToolSpec
 from forge.errors import BackendError, ContextDiscoveryError
@@ -270,16 +270,25 @@ async def send(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: RawOpenAITools | None = None,
     ) -> LLMResponse:
         """Resolve mode on first call with tools, then dispatch.
 
         ``inbound_anthropic_body`` is accepted for protocol symmetry and
         silently ignored — LlamafileClient only speaks OpenAI shape.
+
+        ``raw_openai_tools`` (proxy use) is forwarded verbatim as the
+        backend's ``tools`` array on the native path; the prompt path
+        accepts and ignores it (it keeps forge's prompt-injection format).
         """
         if self.resolved_mode is None:
-            return await self._resolve_and_send(messages, tools, sampling, passthrough)
+            return await self._resolve_and_send(
+                messages, tools, sampling, passthrough, raw_openai_tools,
+            )
         elif self.resolved_mode == "native":
-            return await self._send_native(messages, tools, sampling, passthrough)
+            return await self._send_native(
+                messages, tools, sampling, passthrough, raw_openai_tools,
+            )
         else:
             return await self._send_prompt(messages, tools, sampling, passthrough)
 
@@ -290,15 +299,20 @@ async def send_stream(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: RawOpenAITools | None = None,
     ) -> AsyncIterator[StreamChunk]:
         """Stream via SSE, handling both native FC and prompt-injected paths.
 
         ``inbound_anthropic_body`` accepted for protocol symmetry, ignored.
+        ``raw_openai_tools`` (proxy use) is forwarded verbatim on the native
+        path; ignored on the prompt path.
         """
         if self.resolved_mode is None:
             # Probe with a non-streaming call to resolve native vs prompt.
             # Result is discarded — the runner will use the streamed response.
-            await self._resolve_and_send(messages, tools, sampling, passthrough)
+            await self._resolve_and_send(
+                messages, tools, sampling, passthrough, raw_openai_tools,
+            )
         mode = self.resolved_mode
 
         body: dict[str, Any] = dict(passthrough or {})
@@ -315,8 +329,12 @@ async def send_stream(
             prepared = _merge_consecutive(messages)
         else:
             prepared = _merge_consecutive(_downgrade_messages(messages))
-        if mode == "native" and tools:
-            body["tools"] = [format_tool(t) for t in tools]
+        if mode == "native" and (raw_openai_tools is not None or tools):
+            body["tools"] = (
+                raw_openai_tools
+                if raw_openai_tools is not None
+                else [format_tool(t) for t in tools]
+            )
             body["messages"] = prepared
         elif mode == "prompt" and tools:
             tool_prompt = build_tool_prompt(tools)
@@ -457,6 +475,7 @@ async def _resolve_and_send(
         tools: list[ToolSpec] | None,
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
+        raw_openai_tools: RawOpenAITools | None = None,
     ) -> LLMResponse:
         """Auto-resolve mode on first send with tools.
 
@@ -469,10 +488,14 @@ async def _resolve_and_send(
         if not tools:
             # No tools to test with — send without tools, defer resolution
             self.resolved_mode = "native"
-            return await self._send_native(messages, tools, sampling, passthrough)
+            return await self._send_native(
+                messages, tools, sampling, passthrough, raw_openai_tools,
+            )
 
         try:
-            result = await self._send_native(messages, tools, sampling, passthrough)
+            result = await self._send_native(
+                messages, tools, sampling, passthrough, raw_openai_tools,
+            )
             self.resolved_mode = "native"
             return result
         except (httpx.HTTPStatusError, BackendError):
@@ -485,8 +508,14 @@ async def _send_native(
         tools: list[ToolSpec] | None,
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
+        raw_openai_tools: RawOpenAITools | None = None,
     ) -> LLMResponse:
-        """Send using native function calling (OpenAI tools parameter)."""
+        """Send using native function calling (OpenAI tools parameter).
+
+        When ``raw_openai_tools`` is supplied (proxy native passthrough), it is
+        sent as the ``tools`` array verbatim so the backend sees the client's
+        original schema instead of forge's re-emitted ``format_tool(spec)``.
+        """
         merged = _merge_consecutive(messages)
         body: dict[str, Any] = dict(passthrough or {})
         body.update({
@@ -496,7 +525,9 @@ async def _send_native(
         body.setdefault("model", self.model)
         self._apply_slot_id(body)
         self._apply_sampling(body, sampling)
-        if tools:
+        if raw_openai_tools is not None:
+            body["tools"] = raw_openai_tools
+        elif tools:
             body["tools"] = [format_tool(t) for t in tools]
 
         resp = await self._http.post(
diff --git a/src/forge/clients/ollama.py b/src/forge/clients/ollama.py
index 275e7d4..9f93a6c 100644
--- a/src/forge/clients/ollama.py
+++ b/src/forge/clients/ollama.py
@@ -145,6 +145,7 @@ async def send(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: list[dict[str, Any]] | None = None,
     ) -> LLMResponse:
         """Send messages via /api/chat and parse the response.
 
@@ -153,8 +154,8 @@ async def send(
         (forge proxy uses LlamafileClient for external mode). Adding
         Ollama passthrough is a follow-up.
 
-        ``inbound_anthropic_body`` accepted for protocol symmetry, ignored
-        (Ollama is OpenAI-shape only).
+        ``inbound_anthropic_body`` / ``raw_openai_tools`` accepted for protocol
+        symmetry, ignored (Ollama is OpenAI-shape only).
         """
         body: dict[str, Any] = {
             "model": self.model,
@@ -215,11 +216,12 @@ async def send_stream(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: list[dict[str, Any]] | None = None,
     ) -> AsyncIterator[StreamChunk]:
         """Stream via NDJSON from /api/chat.
 
-        ``passthrough`` / ``inbound_anthropic_body`` accepted for protocol
-        symmetry; see ``send`` notes.
+        ``passthrough`` / ``inbound_anthropic_body`` / ``raw_openai_tools``
+        accepted for protocol symmetry; see ``send`` notes.
         """
         body: dict[str, Any] = {
             "model": self.model,
diff --git a/src/forge/clients/vllm.py b/src/forge/clients/vllm.py
index 5983423..ec0b574 100644
--- a/src/forge/clients/vllm.py
+++ b/src/forge/clients/vllm.py
@@ -165,8 +165,16 @@ async def send(
         messages: list[dict[str, str]],
         tools: list[ToolSpec] | None = None,
         sampling: dict[str, Any] | None = None,
+        passthrough: dict[str, Any] | None = None,
+        inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: list[dict[str, Any]] | None = None,
     ) -> LLMResponse:
-        """Send messages via /v1/chat/completions and parse the response."""
+        """Send messages via /v1/chat/completions and parse the response.
+
+        ``passthrough`` / ``inbound_anthropic_body`` / ``raw_openai_tools`` are
+        accepted for protocol symmetry and ignored — vLLM parses tools and
+        reasoning server-side and is native-only.
+        """
         body: dict[str, Any] = {
             "model": self.model_path,
             "messages": messages,
@@ -213,8 +221,15 @@ async def send_stream(
         messages: list[dict[str, str]],
         tools: list[ToolSpec] | None = None,
         sampling: dict[str, Any] | None = None,
+        passthrough: dict[str, Any] | None = None,
+        inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: list[dict[str, Any]] | None = None,
     ) -> AsyncIterator[StreamChunk]:
-        """Stream via SSE from /v1/chat/completions."""
+        """Stream via SSE from /v1/chat/completions.
+
+        ``passthrough`` / ``inbound_anthropic_body`` / ``raw_openai_tools``
+        accepted for protocol symmetry and ignored (see ``send``).
+        """
         body: dict[str, Any] = {
             "model": self.model_path,
             "messages": messages,
diff --git a/src/forge/core/inference.py b/src/forge/core/inference.py
index f22c528..421eaae 100644
--- a/src/forge/core/inference.py
+++ b/src/forge/core/inference.py
@@ -13,7 +13,14 @@
 from dataclasses import dataclass, field
 from typing import Any
 
-from forge.clients.base import ChunkType, LLMClient, StreamChunk, TokenUsage
+from forge.clients.base import (
+    ChunkType,
+    LLMClient,
+    RawOpenAIMessages,
+    RawOpenAITools,
+    StreamChunk,
+    TokenUsage,
+)
 from forge.context.manager import ContextManager
 from forge.core.messages import Message, MessageMeta, MessageRole, MessageType, ToolCallInfo
 from forge.core.workflow import LLMResponse, TextResponse, ToolCall, ToolSpec
@@ -127,6 +134,8 @@ async def run_inference(
     sampling: dict[str, Any] | None = None,
     passthrough: dict[str, Any] | None = None,
     inbound_anthropic_body: dict[str, Any] | None = None,
+    raw_openai_messages: RawOpenAIMessages | None = None,
+    raw_openai_tools: RawOpenAITools | None = None,
 ) -> InferenceResult | None:
     """Send messages to the LLM with compaction, folding, validation, and retry.
 
@@ -197,8 +206,21 @@ async def run_inference(
         if context_warning:
             verbatim_body = None  # mutation
 
-        # Fold and serialize
-        api_messages = fold_and_serialize(messages, api_format)
+        # Fold and serialize. Proxy callers may supply the client's raw OpenAI
+        # transcript; on the clean first attempt (no compaction, no warning) we
+        # forward it verbatim so the backend sees the client-authored shape
+        # instead of forge's parsed/re-emitted form. Any forge mutation
+        # (compaction / context warning / retry) falls back to folding.
+        use_raw_messages = (
+            raw_openai_messages is not None
+            and _attempt == 0
+            and compacted is messages
+            and not context_warning
+        )
+        if use_raw_messages:
+            api_messages = raw_openai_messages
+        else:
+            api_messages = fold_and_serialize(messages, api_format)
 
         # Inject context warning as transient user message (not persisted
         # in conversation history). Uses "user" role because mid-conversation
@@ -213,16 +235,27 @@ async def run_inference(
                 MessageMeta(MessageType.CONTEXT_WARNING, step_index=step_index),
             ))
 
+        # Forward raw tools only on the clean first attempt — on retries forge
+        # has appended nudge/tool-error messages, so the parsed tool_specs path
+        # (format_tool) is the correct serialization. Pass the kwarg only when
+        # set so non-proxy callers (and their client doubles) keep the original
+        # call signature.
+        raw_tools_kwarg: dict[str, Any] = {}
+        if raw_openai_tools is not None and _attempt == 0:
+            raw_tools_kwarg["raw_openai_tools"] = raw_openai_tools
+
         # Send
         if stream:
             response = await _send_streaming(
                 client, api_messages, tool_specs, on_chunk, sampling, passthrough,
                 inbound_anthropic_body=verbatim_body,
+                **raw_tools_kwarg,
             )
         else:
             response = await client.send(
                 api_messages, tools=tool_specs, sampling=sampling, passthrough=passthrough,
                 inbound_anthropic_body=verbatim_body,
+                **raw_tools_kwarg,
             )
         # Subsequent attempts (retries) are mutations regardless of outcome.
         verbatim_body = None
@@ -329,12 +362,17 @@ async def _send_streaming(
     sampling: dict[str, Any] | None = None,
     passthrough: dict[str, Any] | None = None,
     inbound_anthropic_body: dict[str, Any] | None = None,
+    raw_openai_tools: RawOpenAITools | None = None,
 ) -> LLMResponse:
     """Send via streaming, forwarding chunks to on_chunk callback."""
     response = None
+    raw_tools_kwarg: dict[str, Any] = {}
+    if raw_openai_tools is not None:
+        raw_tools_kwarg["raw_openai_tools"] = raw_openai_tools
     async for chunk in client.send_stream(
         api_messages, tools=tool_specs, sampling=sampling, passthrough=passthrough,
         inbound_anthropic_body=inbound_anthropic_body,
+        **raw_tools_kwarg,
     ):
         if on_chunk is not None:
             await on_chunk(chunk)
diff --git a/src/forge/proxy/__main__.py b/src/forge/proxy/__main__.py
index 4ef0688..16adf19 100644
--- a/src/forge/proxy/__main__.py
+++ b/src/forge/proxy/__main__.py
@@ -45,13 +45,6 @@ def main() -> None:
     )
     parser.add_argument("--budget-tokens", type=int, help="Manual token budget")
     parser.add_argument("--extra-flags", nargs="*", help="Additional backend CLI flags")
-    parser.add_argument(
-        "--mode",
-        choices=["native", "prompt"],
-        default="native",
-        help="Function-calling mode (default: native). Use 'prompt' for "
-             "OpenAI-compatible backends without a function-calling template.",
-    )
     parser.add_argument(
         "--backend-protocol",
         choices=["openai", "anthropic"],
@@ -74,6 +67,12 @@ def main() -> None:
         help="Backend response timeout in seconds (default: 300)",
     )
     parser.add_argument("--no-rescue", action="store_true", help="Disable rescue parsing")
+    parser.add_argument(
+        "--inject-respond-tool",
+        action="store_true",
+        help="Inject forge's synthetic respond() tool when the client sends "
+             "tools (keeps small models in tool-calling mode). Default off.",
+    )
     parser.add_argument("--verbose", "-v", action="store_true", help="Verbose logging")
 
     args = parser.parse_args()
@@ -108,7 +107,7 @@ def main() -> None:
         serialize=serialize,
         max_retries=args.max_retries,
         rescue_enabled=not args.no_rescue,
-        mode=args.mode,
+        inject_respond_tool=args.inject_respond_tool,
         backend_protocol=args.backend_protocol,
         backend_timeout=args.backend_timeout,
     )
diff --git a/src/forge/proxy/handler.py b/src/forge/proxy/handler.py
index 1a6bcba..c95f30a 100644
--- a/src/forge/proxy/handler.py
+++ b/src/forge/proxy/handler.py
@@ -3,9 +3,10 @@
 from __future__ import annotations
 
 import logging
+from copy import deepcopy
 from typing import Any, Literal
 
-from forge.clients.base import LLMClient
+from forge.clients.base import LLMClient, format_tool
 from forge.context.manager import ContextManager
 from forge.core.inference import _get_usage, fold_and_serialize, run_inference
 from forge.core.workflow import ToolCall, ToolSpec, TextResponse
@@ -100,12 +101,27 @@ def _extract_tool_names(tool_specs: list[ToolSpec]) -> list[str]:
     return [s.name for s in tool_specs]
 
 
+def _raw_openai_tools(request_tools: Any) -> list[dict[str, Any]] | None:
+    """Return a detached deep copy of the inbound OpenAI tools array."""
+    if not isinstance(request_tools, list) or not request_tools:
+        return None
+    return [deepcopy(tool) for tool in request_tools if isinstance(tool, dict)]
+
+
+def _raw_openai_messages(request_messages: Any) -> list[dict[str, Any]] | None:
+    """Return a detached deep copy of the inbound OpenAI messages array."""
+    if not isinstance(request_messages, list) or not request_messages:
+        return None
+    return [deepcopy(msg) for msg in request_messages if isinstance(msg, dict)]
+
+
 async def handle_chat_completions(
     body: dict[str, Any],
     client: LLMClient,
     context_manager: ContextManager,
     max_retries: int = 3,
     rescue_enabled: bool = True,
+    inject_respond_tool: bool = False,
     protocol: Literal["openai", "anthropic"] = "openai",
 ) -> dict[str, Any] | list[dict[str, Any]]:
     """Handle an inbound completions request.
@@ -120,6 +136,11 @@ async def handle_chat_completions(
         context_manager: For context compaction.
         max_retries: Max consecutive retries for bad responses.
         rescue_enabled: Whether to attempt rescue parsing.
+        inject_respond_tool: When True and the client request supplies tools,
+            inject forge's synthetic respond() tool so the model stays in
+            tool-calling mode (the call is stripped from the outbound
+            response). Default False — the proxy forwards the client's tools
+            untouched unless explicitly opted in.
         protocol: Inbound wire format. ``openai`` for
             ``/v1/chat/completions``; ``anthropic`` for ``/v1/messages``.
 
@@ -148,18 +169,38 @@ async def handle_chat_completions(
         # ADR-015.
         inbound_anthropic_body = body
     else:
-        messages = openai_to_messages(body.get("messages", []))
-        tool_specs = _extract_tool_specs(body.get("tools"))
+        request_messages = body.get("messages", [])
+        request_tools = body.get("tools")
+        messages = openai_to_messages(request_messages)
+        tool_specs = _extract_tool_specs(request_tools)
         sampling = _extract_sampling(body)
         passthrough = _extract_passthrough(body)
         inbound_anthropic_body = None
+        # Detached verbatim copies of the client's OpenAI tools/messages.
+        # Forwarded to the native backend on the clean first attempt so it
+        # sees the exact schema/transcript the client authored, bypassing the
+        # lossy ToolSpec round-trip. tool_specs stays as forge's validation
+        # sidecar. (Anthropic protocol converts shapes itself → None.)
+        raw_tools_for_backend = _raw_openai_tools(request_tools)
+        raw_messages_for_backend = _raw_openai_messages(request_messages)
+
+    if protocol == "anthropic":
+        raw_tools_for_backend = None
+        raw_messages_for_backend = None
 
-    # Inject respond tool when tools are present.  The model calls
-    # respond(message="...") instead of producing bare text, keeping it
-    # in tool-calling mode where guardrails apply.  The respond call is
+    # Optionally inject the respond tool (default off). When on, the model
+    # calls respond(message="...") instead of producing bare text, keeping it
+    # in tool-calling mode where guardrails apply. The respond call is
     # stripped from the outbound response — the client never sees it.
-    if tool_specs and not any(s.name == RESPOND_TOOL_NAME for s in tool_specs):
-        tool_specs.append(respond_spec())
+    if (
+        inject_respond_tool
+        and tool_specs
+        and not any(s.name == RESPOND_TOOL_NAME for s in tool_specs)
+    ):
+        respond = respond_spec()
+        tool_specs.append(respond)
+        if raw_tools_for_backend is not None:
+            raw_tools_for_backend.append(format_tool(respond))
 
     tool_names = _extract_tool_names(tool_specs)
 
@@ -168,7 +209,7 @@ async def handle_chat_completions(
     if not tool_specs:
         logger.info("No tools in request, passing through to backend")
         api_format = getattr(client, "api_format", "ollama")
-        api_messages = fold_and_serialize(messages, api_format)
+        api_messages = raw_messages_for_backend or fold_and_serialize(messages, api_format)
         response = await client.send(
             api_messages, tools=None, sampling=sampling, passthrough=passthrough,
             inbound_anthropic_body=inbound_anthropic_body,
@@ -193,6 +234,8 @@ async def handle_chat_completions(
             sampling=sampling,
             passthrough=passthrough,
             inbound_anthropic_body=inbound_anthropic_body,
+            raw_openai_messages=raw_messages_for_backend,
+            raw_openai_tools=raw_tools_for_backend,
         )
     except ToolCallError as exc:
         # Retries exhausted — the model kept returning text instead of tool
diff --git a/src/forge/proxy/proxy.py b/src/forge/proxy/proxy.py
index a4d3c8d..67cfd55 100644
--- a/src/forge/proxy/proxy.py
+++ b/src/forge/proxy/proxy.py
@@ -67,7 +67,7 @@ def __init__(
         serialize: bool | None = None,
         max_retries: int = 3,
         rescue_enabled: bool = True,
-        mode: Literal["native", "prompt"] = "native",
+        inject_respond_tool: bool = False,
         backend_protocol: Literal["openai", "anthropic"] = "openai",
         backend_timeout: float = 300.0,
     ) -> None:
@@ -93,11 +93,11 @@ def __init__(
                 managed, False for external).
             max_retries: Max consecutive retries for bad LLM responses.
             rescue_enabled: Attempt rescue parsing of text responses.
-            mode: Function-calling mode for OpenAI-compatible backends —
-                "native" uses the backend's native tools API, "prompt"
-                uses forge's prompt-injection fallback for backends
-                without a function-calling template. Not applicable to vLLM
-                (parses tool calls server-side) or the Anthropic protocol.
+            inject_respond_tool: When True, inject forge's synthetic respond()
+                tool into requests that already carry tools (keeps the model in
+                tool-calling mode). Default False. The proxy is native-only and
+                forwards the client's tools verbatim; prompt-injection mode is a
+                non-proxy WorkflowRunner feature.
             backend_protocol: Wire format of the external backend.
                 ``openai`` (default) for llama.cpp, vLLM, Ollama. ``anthropic``
                 for Anthropic-shape downstreams (the official Anthropic API,
@@ -108,13 +108,6 @@ def __init__(
         """
         if backend_url is None and backend is None:
             raise ValueError("Provide either backend_url (external) or backend (managed)")
-        if backend_protocol == "anthropic" and mode == "prompt":
-            raise ValueError(
-                "mode='prompt' is not supported with backend_protocol='anthropic' — "
-                "Anthropic protocol has native tool calling; the prompt-injection "
-                "fallback only applies to OpenAI-shape backends without a function-"
-                "calling template."
-            )
         if backend_protocol == "anthropic" and backend_url is None:
             raise ValueError(
                 "backend_protocol='anthropic' requires external mode (backend_url=...). "
@@ -125,11 +118,6 @@ def __init__(
                 "backend='vllm' speaks the OpenAI protocol; backend_protocol='anthropic' "
                 "is not applicable."
             )
-        if backend == "vllm" and mode == "prompt":
-            raise ValueError(
-                "backend='vllm' parses tool calls server-side (native only); "
-                "mode='prompt' is not applicable."
-            )
         if not math.isfinite(backend_timeout) or backend_timeout <= 0:
             raise ValueError("backend_timeout must be a finite value greater than 0")
         # Managed mode: each backend requires its own identity field. Fail
@@ -155,7 +143,7 @@ def __init__(
         self._port = port
         self._max_retries = max_retries
         self._rescue_enabled = rescue_enabled
-        self._mode = mode
+        self._inject_respond_tool = inject_respond_tool
         self._backend_protocol = backend_protocol
         self._backend_timeout = backend_timeout
 
@@ -236,6 +224,7 @@ async def _async_start(self, ready: threading.Event) -> None:
             serialize_requests=self._serialize,
             max_retries=self._max_retries,
             rescue_enabled=self._rescue_enabled,
+            inject_respond_tool=self._inject_respond_tool,
         )
         await self._http_server.start()
         self._started = True
@@ -310,7 +299,7 @@ async def _setup_external(self) -> tuple[LLMClient, ContextManager]:
             client = LlamafileClient(
                 gguf_path=self._model or "default",
                 base_url=base,
-                mode=self._mode,
+                mode="native",
                 timeout=self._backend_timeout,
             )
 
@@ -336,11 +325,11 @@ async def _setup_managed(self) -> tuple[LLMClient, ContextManager]:
         assert self._backend is not None
         client = self._build_managed_client()
 
-        # The backend process is always launched in native mode (--jinja is
-        # harmless and enables the native tools API where available); prompt
-        # mode is a client-side injection concern carried by the client.
-        # Pass each backend only its own identity field — setup_backend
-        # enforces mutual exclusivity.
+        # The backend process is launched in native mode (--jinja enables the
+        # native tools API). The proxy is native-only — it forwards the
+        # client's tools verbatim and never prompt-injects (prompt mode is a
+        # non-proxy WorkflowRunner feature). Pass each backend only its own
+        # identity field — setup_backend enforces mutual exclusivity.
         server, context_manager = await setup_backend(
             backend=self._backend,
             model=self._model if self._backend == "ollama" else None,
@@ -369,7 +358,7 @@ def _build_managed_client(self) -> LLMClient:
             return LlamafileClient(
                 gguf_path=self._gguf or "default",
                 base_url=base_url,
-                mode=self._mode,
+                mode="native",
                 timeout=self._backend_timeout,
             )
         if self._backend == "vllm":
diff --git a/src/forge/proxy/server.py b/src/forge/proxy/server.py
index 56174da..619af54 100644
--- a/src/forge/proxy/server.py
+++ b/src/forge/proxy/server.py
@@ -49,6 +49,7 @@ def __init__(
         serialize_requests: bool = True,
         max_retries: int = 3,
         rescue_enabled: bool = True,
+        inject_respond_tool: bool = False,
     ) -> None:
         self._client = client
         self._context_manager = context_manager
@@ -56,6 +57,7 @@ def __init__(
         self._port = port
         self._max_retries = max_retries
         self._rescue_enabled = rescue_enabled
+        self._inject_respond_tool = inject_respond_tool
         self._server: asyncio.Server | None = None
         self._serialize = serialize_requests
         self._queue: asyncio.Queue[_QueueItem] = asyncio.Queue()
@@ -306,6 +308,7 @@ async def _run_handler(
                 context_manager=self._context_manager,
                 max_retries=self._max_retries,
                 rescue_enabled=self._rescue_enabled,
+                inject_respond_tool=self._inject_respond_tool,
                 protocol=protocol,
             )
         except Exception as exc:
diff --git a/tests/unit/test_inference_passthrough.py b/tests/unit/test_inference_passthrough.py
new file mode 100644
index 0000000..78aa851
--- /dev/null
+++ b/tests/unit/test_inference_passthrough.py
@@ -0,0 +1,100 @@
+"""Tests for run_inference's raw-OpenAI passthrough first-attempt gate.
+
+The proxy hands run_inference the client's verbatim OpenAI transcript/tools.
+They must be forwarded ONLY on the clean first attempt; any forge mutation
+(retry here) falls back to fold_and_serialize + the parsed tool_specs.
+"""
+
+from unittest.mock import AsyncMock
+
+import pytest
+
+from forge.context.manager import ContextManager
+from forge.context.strategies import NoCompact
+from forge.core.inference import run_inference
+from forge.core.messages import Message, MessageMeta, MessageRole, MessageType
+from forge.core.workflow import TextResponse, ToolCall, ToolSpec
+from forge.guardrails import ErrorTracker, ResponseValidator
+
+
+def _client(*responses):
+    client = AsyncMock()
+    client.api_format = "ollama"
+    client.send = AsyncMock(side_effect=list(responses))
+    client.last_usage = {}
+    client._slot_id = 0
+    return client
+
+
+def _ctx():
+    return ContextManager(strategy=NoCompact(), budget_tokens=8192)
+
+
+def _search_spec():
+    return ToolSpec.from_json_schema(
+        name="search", description="", schema={"type": "object", "properties": {}},
+    )
+
+
+@pytest.mark.asyncio
+async def test_raw_used_on_first_attempt_folded_on_retry():
+    # Attempt 0: text (invalid → retry). Attempt 1: valid tool call.
+    client = _client(
+        TextResponse(content="just narrating, no tool"),
+        [ToolCall(tool="search", args={})],
+    )
+    messages = [Message(
+        MessageRole.USER, "folded-form",
+        MessageMeta(MessageType.USER_INPUT),
+    )]
+    raw_messages = [{"role": "user", "content": "VERBATIM", "name": "u1"}]
+    raw_tools = [{"type": "function", "function": {"name": "search", "parameters": {}}}]
+
+    result = await run_inference(
+        messages=messages,
+        client=client,
+        context_manager=_ctx(),
+        validator=ResponseValidator(["search"], rescue_enabled=True),
+        error_tracker=ErrorTracker(max_retries=2),
+        tool_specs=[_search_spec()],
+        raw_openai_messages=raw_messages,
+        raw_openai_tools=raw_tools,
+    )
+
+    assert result is not None
+    assert client.send.await_count == 2
+
+    # Attempt 0 (clean): forwarded the verbatim raw messages + raw tools.
+    first = client.send.call_args_list[0]
+    assert first.args[0] == raw_messages
+    assert first.kwargs["raw_openai_tools"] == raw_tools
+
+    # Attempt 1 (post-retry mutation): folded messages, no raw tools kwarg.
+    second = client.send.call_args_list[1]
+    assert second.args[0] != raw_messages
+    assert second.args[0][0]["content"] == "folded-form"
+    assert "raw_openai_tools" not in second.kwargs
+
+
+@pytest.mark.asyncio
+async def test_no_raw_falls_back_to_fold():
+    """Without raw_openai_* (the non-proxy runner path), folding is used and
+    no raw_openai_tools kwarg is passed to the client."""
+    client = _client([ToolCall(tool="search", args={})])
+    messages = [Message(
+        MessageRole.USER, "hello",
+        MessageMeta(MessageType.USER_INPUT),
+    )]
+
+    await run_inference(
+        messages=messages,
+        client=client,
+        context_manager=_ctx(),
+        validator=ResponseValidator(["search"], rescue_enabled=True),
+        error_tracker=ErrorTracker(max_retries=1),
+        tool_specs=[_search_spec()],
+    )
+
+    call = client.send.call_args
+    assert call.args[0][0]["content"] == "hello"
+    assert "raw_openai_tools" not in call.kwargs
diff --git a/tests/unit/test_proxy_handler.py b/tests/unit/test_proxy_handler.py
index 1a4203a..d417317 100644
--- a/tests/unit/test_proxy_handler.py
+++ b/tests/unit/test_proxy_handler.py
@@ -148,12 +148,13 @@ async def test_tool_call_stream(self):
 
     @pytest.mark.asyncio
     async def test_respond_tool_auto_injected(self):
-        """Respond tool is injected — model calling respond returns text."""
+        """With inject_respond_tool=True, a respond() call is stripped to text."""
         client = _mock_client([ToolCall(tool="respond", args={"message": "Hi!"})])
         client.last_usage = {0: TokenUsage(prompt_tokens=10, completion_tokens=5, total_tokens=15)}
-        
+
         result = await handle_chat_completions(
             _body(tools=[_tool_def("search")]), client, _context_manager(),
+            inject_respond_tool=True,
         )
         # respond is stripped — client sees text, not a tool call
         assert result["choices"][0]["message"]["content"] == "Hi!"
@@ -183,9 +184,10 @@ async def test_mixed_respond_and_tool_calls(self):
             ToolCall(tool="respond", args={"message": "also this"}),
         ])
         client.last_usage = {0: TokenUsage(prompt_tokens=10, completion_tokens=5, total_tokens=15)}
-        
+
         result = await handle_chat_completions(
             _body(tools=[_tool_def("search")]), client, _context_manager(),
+            inject_respond_tool=True,
         )
         tc = result["choices"][0]["message"]["tool_calls"]
         assert len(tc) == 1
@@ -201,8 +203,9 @@ async def test_respond_not_double_injected(self):
         tools = [_tool_def("search"), _tool_def("respond")]
         result = await handle_chat_completions(
             _body(tools=tools), client, _context_manager(),
+            inject_respond_tool=True,
         )
-        # Should still work — respond stripped to text
+        # Should still work — respond stripped to text (not double-injected)
         assert result["choices"][0]["message"]["content"] == "Hi!"
         assert result["usage"] == {"prompt_tokens": 10, "completion_tokens": 5, "total_tokens": 15}
 
@@ -467,3 +470,82 @@ async def test_system_top_level_flows_into_messages(self):
         api_messages = client.send.call_args.args[0]
         assert api_messages[0]["role"] == "system"
         assert api_messages[0]["content"] == "You are helpful."
+
+
+# ── Native transparent passthrough ──────────────────────────
+
+
+class TestNativePassthrough:
+    """The proxy forwards the client's OpenAI tools/messages verbatim on the
+    clean first attempt, bypassing the lossy ToolSpec round-trip."""
+
+    @pytest.mark.asyncio
+    async def test_raw_tools_forwarded_verbatim(self):
+        client = _mock_client([ToolCall(tool="search", args={"q": "x"})])
+        params = {
+            "type": "object",
+            "properties": {"q": {"type": "string", "description": "the query"}},
+            "required": ["q"],
+            "additionalProperties": False,
+        }
+        tools = [_tool_def("search", parameters=params)]
+        await handle_chat_completions(
+            _body(tools=tools), client, _context_manager(),
+        )
+        # The backend sees the client's exact tools array (full schema, no
+        # name/schema drift), not forge's reconstructed format_tool output.
+        sent = client.send.call_args.kwargs["raw_openai_tools"]
+        assert sent == tools
+        # Respond is NOT appended by default.
+        assert [t["function"]["name"] for t in sent] == ["search"]
+        # tool_specs (validation sidecar) still passed separately.
+        assert client.send.call_args.kwargs["tools"][0].name == "search"
+
+    @pytest.mark.asyncio
+    async def test_raw_messages_forwarded_verbatim(self):
+        client = _mock_client([ToolCall(tool="search", args={"q": "x"})])
+        # An extra non-standard key proves no normalization/folding happened.
+        messages = [{"role": "user", "content": "hi", "name": "u1"}]
+        await handle_chat_completions(
+            _body(messages=messages, tools=[_tool_def("search")]),
+            client, _context_manager(),
+        )
+        sent_messages = client.send.call_args.args[0]
+        assert sent_messages == messages
+
+    @pytest.mark.asyncio
+    async def test_inbound_body_mutation_does_not_affect_sent(self):
+        client = _mock_client([ToolCall(tool="search", args={"q": "x"})])
+        tools = [_tool_def("search")]
+        body = _body(tools=tools)
+        await handle_chat_completions(body, client, _context_manager())
+        # Mutate the caller's body after the call — detached copy is unaffected.
+        body["tools"][0]["function"]["name"] = "MUTATED"
+        body["messages"][0]["content"] = "MUTATED"
+        sent_tools = client.send.call_args.kwargs["raw_openai_tools"]
+        sent_messages = client.send.call_args.args[0]
+        assert sent_tools[0]["function"]["name"] == "search"
+        assert sent_messages[0]["content"] == "hi"
+
+    @pytest.mark.asyncio
+    async def test_respond_not_injected_by_default(self):
+        client = _mock_client([ToolCall(tool="search", args={"q": "x"})])
+        await handle_chat_completions(
+            _body(tools=[_tool_def("search")]), client, _context_manager(),
+        )
+        sent = client.send.call_args.kwargs["raw_openai_tools"]
+        names = [t["function"]["name"] for t in sent]
+        assert "respond" not in names
+        spec_names = [s.name for s in client.send.call_args.kwargs["tools"]]
+        assert "respond" not in spec_names
+
+    @pytest.mark.asyncio
+    async def test_respond_injected_into_raw_tools_when_opted_in(self):
+        client = _mock_client([ToolCall(tool="search", args={"q": "x"})])
+        await handle_chat_completions(
+            _body(tools=[_tool_def("search")]), client, _context_manager(),
+            inject_respond_tool=True,
+        )
+        sent = client.send.call_args.kwargs["raw_openai_tools"]
+        names = [t["function"]["name"] for t in sent]
+        assert names == ["search", "respond"]
diff --git a/tests/unit/test_proxy_path1.py b/tests/unit/test_proxy_path1.py
index 6a1d591..4cc8862 100644
--- a/tests/unit/test_proxy_path1.py
+++ b/tests/unit/test_proxy_path1.py
@@ -25,14 +25,6 @@
 
 
 class TestProxyServerValidation:
-    def test_anthropic_with_prompt_mode_rejected(self):
-        with pytest.raises(ValueError, match="mode='prompt'"):
-            ProxyServer(
-                backend_url="http://localhost:8080",
-                backend_protocol="anthropic",
-                mode="prompt",
-            )
-
     def test_anthropic_in_managed_mode_rejected(self):
         with pytest.raises(ValueError, match="external mode"):
             ProxyServer(
diff --git a/tests/unit/test_proxy_proxy.py b/tests/unit/test_proxy_proxy.py
index 56f8c34..ca8792f 100644
--- a/tests/unit/test_proxy_proxy.py
+++ b/tests/unit/test_proxy_proxy.py
@@ -21,7 +21,7 @@
 
 
 class TestConstructorValidation:
-    """__init__ validation: mode/protocol guards and managed identity rules."""
+    """__init__ validation: protocol guards and managed identity rules."""
 
     def test_neither_url_nor_backend_rejected(self) -> None:
         with pytest.raises(ValueError, match="Provide either backend_url"):
@@ -31,20 +31,10 @@ def test_anthropic_requires_external(self) -> None:
         with pytest.raises(ValueError, match="requires external mode"):
             ProxyServer(backend="llamaserver", gguf="m.gguf", backend_protocol="anthropic")
 
-    def test_anthropic_rejects_prompt_mode(self) -> None:
-        with pytest.raises(ValueError, match="mode='prompt' is not supported"):
-            ProxyServer(
-                backend_url="http://x", backend_protocol="anthropic", mode="prompt",
-            )
-
     def test_vllm_rejects_anthropic_protocol(self) -> None:
         with pytest.raises(ValueError, match="speaks the OpenAI protocol"):
             ProxyServer(backend_url="http://x:8000", backend="vllm", backend_protocol="anthropic")
 
-    def test_vllm_rejects_prompt_mode(self) -> None:
-        with pytest.raises(ValueError, match="parses tool calls server-side"):
-            ProxyServer(backend="vllm", model_path="/m", mode="prompt")
-
     # Managed identity rules
     def test_managed_ollama_requires_model(self) -> None:
         with pytest.raises(ValueError, match="backend='ollama' requires model"):
@@ -288,9 +278,10 @@ async def test_ollama_wiring(self) -> None:
         assert kwargs["client"] is client
 
     @pytest.mark.asyncio
-    async def test_managed_llamafile_carries_client_mode(self) -> None:
-        # prompt mode is a client-side concern; the server still starts native.
-        proxy = ProxyServer(backend="llamafile", gguf="/m/x.gguf", mode="prompt")
+    async def test_managed_llamafile_client_is_native(self) -> None:
+        # The proxy is native-only: the managed LlamafileClient is built in
+        # native mode and the backend process is launched native too.
+        proxy = ProxyServer(backend="llamafile", gguf="/m/x.gguf")
         mock_ctx = ContextManager.__new__(ContextManager)
         mock_ctx.budget_tokens = 8192
         with patch(
@@ -299,7 +290,7 @@ async def test_managed_llamafile_carries_client_mode(self) -> None:
         ) as mock_setup:
             client, _ = await proxy._setup_managed()
         assert isinstance(client, LlamafileClient)
-        assert client.mode == "prompt"
+        assert client.mode == "native"
         assert mock_setup.await_args.kwargs["mode"] == "native"
 
 

From fa9be50f34784f88f1f0652c454b0c8feefe31e6 Mon Sep 17 00:00:00 2001
From: Antoine Zambelli <antoine.zambelli@gmail.com>
Date: Sun, 31 May 2026 17:08:02 -0500
Subject: [PATCH 2/8] Add prompt-injection as opt-in proxy capability
 (--backend-capability)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The proxy serves tool-call-capable backends natively (verbatim tool/message
passthrough). This adds prompt-injection back as an explicit opt-in for
non-function-calling backends (llama.cpp / llamafile without a tool template).

- New --backend-capability {native,prompt} (default native), declared once at
  construction and frozen — no runtime probing or mid-request mode mutation.
- prompt capability reuses LlamafileClient's existing prompt path (build the
  tool prompt, downgrade tool/assistant-tool_call history to text, parse the
  JSON tool call back into native tool_calls). No client changes.
- Handler suppresses verbatim raw passthrough when in prompt mode so inference
  folds normally and the client injects the tool prompt.
- Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol).
- Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/BACKEND_SETUP.md              |  2 +-
 docs/decisions/012-openai-proxy.md | 31 ++++++++----
 src/forge/proxy/__main__.py        | 12 +++++
 src/forge/proxy/handler.py         | 22 ++++++++-
 src/forge/proxy/proxy.py           | 50 +++++++++++++++----
 src/forge/proxy/server.py          |  3 ++
 tests/unit/test_proxy_handler.py   | 46 +++++++++++++++++
 tests/unit/test_proxy_proxy.py     | 79 ++++++++++++++++++++++++++++++
 8 files changed, 223 insertions(+), 22 deletions(-)

diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md
index 0bb0297..024e762 100644
--- a/docs/BACKEND_SETUP.md
+++ b/docs/BACKEND_SETUP.md
@@ -75,7 +75,7 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999
 
 llamafile does **not** support native function calling — forge's `LlamafileClient` falls back to prompt-injected mode automatically (`mode="auto"`), or you can force it with `mode="prompt"`.
 
-> **Proxy note:** prompt-injection mode is a **direct-client / WorkflowRunner** feature. The OpenAI-compatible **proxy is native-only** — it forwards the client's tools verbatim and does not prompt-inject (see ADR-012). Put an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) behind the proxy; a non-FC backend like llamafile will degrade to passing text through.
+> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012.
 
 Smoke-test:
 
diff --git a/docs/decisions/012-openai-proxy.md b/docs/decisions/012-openai-proxy.md
index 90bc822..8a5d8d0 100644
--- a/docs/decisions/012-openai-proxy.md
+++ b/docs/decisions/012-openai-proxy.md
@@ -104,15 +104,28 @@ The proxy fully buffers each response from the backend before deciding what to d
 4. **Client disconnect handling** -- detect TCP drop, cancel in-flight backend request, release inference lock.
 5. **Testing** -- unit tests for extraction, integration tests with mock backend, smoke test with real llama-server.
 
-### Revision: native-only + transparent passthrough
-
-The proxy is **native-tool-call-only**. It targets backends that speak the
-native OpenAI tools API (llama.cpp with a tool-calling chat template / `--jinja`,
-vLLM, Ollama, Anthropic). There is no `--mode` flag and no prompt-injection
-fallback in the proxy — prompt-injection mode (`build_tool_prompt`,
-`_downgrade_messages`, the `mode="auto"` HTTP-error fallback) is a non-proxy
-**WorkflowRunner / direct-client** feature only, retained because it still wins
-for some models in full-guardrail workflow evals.
+### Revision: native-first, with opt-in prompt capability
+
+The proxy is **native-first**. By default (`--backend-capability native`) it
+targets backends that speak the native OpenAI tools API (llama.cpp with a
+tool-calling chat template / `--jinja`, vLLM, Ollama, Anthropic) and forwards
+the client's request verbatim (below).
+
+Prompt-injection is available as an **explicit opt-in**
+(`--backend-capability prompt`, llama.cpp/llamafile only) for non-FC backends —
+it reuses the WorkflowRunner's prompt path (`build_tool_prompt`,
+`_downgrade_messages`, `extract_tool_call`) so there is **one** prompt
+implementation, not a proxy-specific fork. The capability is **declared once at
+construction and frozen** — there is deliberately **no `mode="auto"` runtime
+probe** (the old auto/HTTP-error fallback that mutated state mid-request was the
+root of the original tangle; it is not reintroduced). In prompt capability the
+verbatim passthrough is suppressed (`native_passthrough=False`): tools are
+serialized into the prompt, so a raw native transcript would be meaningless.
+
+History: this revision originally cut prompt mode from the proxy entirely
+("native-only"). Prompt was then re-added as the opt-in capability above —
+native-first is a cleaner story than a backwards-incompatible drop, and non-FC
+backends (e.g. llamafile) stay usable through the proxy.
 
 Rationale: the proxy is a transparent layer for an external agent that already
 speaks native FC to a native-FC backend. A traced capture showed the native
diff --git a/src/forge/proxy/__main__.py b/src/forge/proxy/__main__.py
index 16adf19..45b7424 100644
--- a/src/forge/proxy/__main__.py
+++ b/src/forge/proxy/__main__.py
@@ -67,6 +67,17 @@ def main() -> None:
         help="Backend response timeout in seconds (default: 300)",
     )
     parser.add_argument("--no-rescue", action="store_true", help="Disable rescue parsing")
+    parser.add_argument(
+        "--backend-capability",
+        choices=["native", "prompt"],
+        default="native",
+        help="Tool-calling protocol for the backend (default: native). "
+             "'native' forwards the client's tools verbatim to a "
+             "function-calling-capable backend. 'prompt' opts into "
+             "prompt-injection for non-FC llama.cpp/llamafile backends "
+             "(strips tools into the prompt, parses the JSON call back). "
+             "Frozen at startup — never probed or switched mid-stream.",
+    )
     parser.add_argument(
         "--inject-respond-tool",
         action="store_true",
@@ -107,6 +118,7 @@ def main() -> None:
         serialize=serialize,
         max_retries=args.max_retries,
         rescue_enabled=not args.no_rescue,
+        backend_capability=args.backend_capability,
         inject_respond_tool=args.inject_respond_tool,
         backend_protocol=args.backend_protocol,
         backend_timeout=args.backend_timeout,
diff --git a/src/forge/proxy/handler.py b/src/forge/proxy/handler.py
index c95f30a..b0ecffc 100644
--- a/src/forge/proxy/handler.py
+++ b/src/forge/proxy/handler.py
@@ -121,6 +121,7 @@ async def handle_chat_completions(
     context_manager: ContextManager,
     max_retries: int = 3,
     rescue_enabled: bool = True,
+    native_passthrough: bool = True,
     inject_respond_tool: bool = False,
     protocol: Literal["openai", "anthropic"] = "openai",
 ) -> dict[str, Any] | list[dict[str, Any]]:
@@ -136,6 +137,13 @@ async def handle_chat_completions(
         context_manager: For context compaction.
         max_retries: Max consecutive retries for bad responses.
         rescue_enabled: Whether to attempt rescue parsing.
+        native_passthrough: When True (default, native capability), forward the
+            client's verbatim OpenAI tools/messages to the backend on the clean
+            first attempt (transparent passthrough). When False (prompt
+            capability), suppress the raw passthrough so the request folds
+            normally and the client's prompt path injects the tool prompt and
+            downgrades tool history — the raw passthrough is meaningless when
+            tools are serialized into the prompt text.
         inject_respond_tool: When True and the client request supplies tools,
             inject forge's synthetic respond() tool so the model stays in
             tool-calling mode (the call is stripped from the outbound
@@ -181,8 +189,18 @@ async def handle_chat_completions(
         # sees the exact schema/transcript the client authored, bypassing the
         # lossy ToolSpec round-trip. tool_specs stays as forge's validation
         # sidecar. (Anthropic protocol converts shapes itself → None.)
-        raw_tools_for_backend = _raw_openai_tools(request_tools)
-        raw_messages_for_backend = _raw_openai_messages(request_messages)
+        #
+        # In prompt capability (native_passthrough=False) we suppress the raw
+        # passthrough: the request folds normally and the client's prompt path
+        # (LlamafileClient._send_prompt) strips the tools into the prompt and
+        # downgrades tool history. A verbatim native transcript is meaningless
+        # once tools are injected as prompt text.
+        if native_passthrough:
+            raw_tools_for_backend = _raw_openai_tools(request_tools)
+            raw_messages_for_backend = _raw_openai_messages(request_messages)
+        else:
+            raw_tools_for_backend = None
+            raw_messages_for_backend = None
 
     if protocol == "anthropic":
         raw_tools_for_backend = None
diff --git a/src/forge/proxy/proxy.py b/src/forge/proxy/proxy.py
index 67cfd55..bc8b577 100644
--- a/src/forge/proxy/proxy.py
+++ b/src/forge/proxy/proxy.py
@@ -67,6 +67,7 @@ def __init__(
         serialize: bool | None = None,
         max_retries: int = 3,
         rescue_enabled: bool = True,
+        backend_capability: Literal["native", "prompt"] = "native",
         inject_respond_tool: bool = False,
         backend_protocol: Literal["openai", "anthropic"] = "openai",
         backend_timeout: float = 300.0,
@@ -93,11 +94,20 @@ def __init__(
                 managed, False for external).
             max_retries: Max consecutive retries for bad LLM responses.
             rescue_enabled: Attempt rescue parsing of text responses.
+            backend_capability: Tool-calling protocol for the backend.
+                ``native`` (default) forwards the client's OpenAI tools/messages
+                verbatim to a function-calling-capable backend (transparent
+                passthrough). ``prompt`` opts into prompt-injection for a non-FC
+                llama.cpp/llamafile backend — tools are stripped into the prompt
+                and the JSON tool call is parsed back out (the same path the
+                WorkflowRunner uses). Only valid for llama.cpp/llamafile
+                backends; rejected for vllm/ollama and the anthropic protocol.
+                Selected once at construction and frozen — never probed or
+                switched mid-stream.
             inject_respond_tool: When True, inject forge's synthetic respond()
                 tool into requests that already carry tools (keeps the model in
-                tool-calling mode). Default False. The proxy is native-only and
-                forwards the client's tools verbatim; prompt-injection mode is a
-                non-proxy WorkflowRunner feature.
+                tool-calling mode). Default False. Orthogonal to
+                backend_capability — works in both native and prompt modes.
             backend_protocol: Wire format of the external backend.
                 ``openai`` (default) for llama.cpp, vLLM, Ollama. ``anthropic``
                 for Anthropic-shape downstreams (the official Anthropic API,
@@ -118,6 +128,22 @@ def __init__(
                 "backend='vllm' speaks the OpenAI protocol; backend_protocol='anthropic' "
                 "is not applicable."
             )
+        # Prompt-injection is a llama.cpp/llamafile capability only. vLLM and
+        # Ollama clients are native-only (they accept-ignore raw tools and have
+        # no prompt path); the anthropic protocol does its own tool conversion.
+        # backend=None (external) defaults to the llama.cpp adapter, which
+        # supports prompt — so only vllm/ollama and anthropic are rejected.
+        if backend_capability == "prompt":
+            if backend_protocol == "anthropic":
+                raise ValueError(
+                    "backend_capability='prompt' is not supported with the "
+                    "anthropic protocol (native tool calling only)."
+                )
+            if backend in ("vllm", "ollama"):
+                raise ValueError(
+                    f"backend_capability='prompt' is only supported for "
+                    f"llama.cpp/llamafile backends, not backend={backend!r}."
+                )
         if not math.isfinite(backend_timeout) or backend_timeout <= 0:
             raise ValueError("backend_timeout must be a finite value greater than 0")
         # Managed mode: each backend requires its own identity field. Fail
@@ -143,6 +169,7 @@ def __init__(
         self._port = port
         self._max_retries = max_retries
         self._rescue_enabled = rescue_enabled
+        self._backend_capability = backend_capability
         self._inject_respond_tool = inject_respond_tool
         self._backend_protocol = backend_protocol
         self._backend_timeout = backend_timeout
@@ -224,6 +251,7 @@ async def _async_start(self, ready: threading.Event) -> None:
             serialize_requests=self._serialize,
             max_retries=self._max_retries,
             rescue_enabled=self._rescue_enabled,
+            native_passthrough=self._backend_capability == "native",
             inject_respond_tool=self._inject_respond_tool,
         )
         await self._http_server.start()
@@ -299,7 +327,7 @@ async def _setup_external(self) -> tuple[LLMClient, ContextManager]:
             client = LlamafileClient(
                 gguf_path=self._model or "default",
                 base_url=base,
-                mode="native",
+                mode=self._backend_capability,
                 timeout=self._backend_timeout,
             )
 
@@ -325,11 +353,13 @@ async def _setup_managed(self) -> tuple[LLMClient, ContextManager]:
         assert self._backend is not None
         client = self._build_managed_client()
 
-        # The backend process is launched in native mode (--jinja enables the
-        # native tools API). The proxy is native-only — it forwards the
-        # client's tools verbatim and never prompt-injects (prompt mode is a
-        # non-proxy WorkflowRunner feature). Pass each backend only its own
-        # identity field — setup_backend enforces mutual exclusivity.
+        # The backend process is always launched in native mode (--jinja enables
+        # the native tools API). This is independent of backend_capability: in
+        # prompt capability the proxy simply doesn't send native tools, so a
+        # native-launched backend (jinja template present but unused) serves the
+        # prompt-injected request fine. Keeping launch native avoids changing
+        # backend startup flags for the opt-in path. Pass each backend only its
+        # own identity field — setup_backend enforces mutual exclusivity.
         server, context_manager = await setup_backend(
             backend=self._backend,
             model=self._model if self._backend == "ollama" else None,
@@ -358,7 +388,7 @@ def _build_managed_client(self) -> LLMClient:
             return LlamafileClient(
                 gguf_path=self._gguf or "default",
                 base_url=base_url,
-                mode="native",
+                mode=self._backend_capability,
                 timeout=self._backend_timeout,
             )
         if self._backend == "vllm":
diff --git a/src/forge/proxy/server.py b/src/forge/proxy/server.py
index 619af54..3ca3149 100644
--- a/src/forge/proxy/server.py
+++ b/src/forge/proxy/server.py
@@ -49,6 +49,7 @@ def __init__(
         serialize_requests: bool = True,
         max_retries: int = 3,
         rescue_enabled: bool = True,
+        native_passthrough: bool = True,
         inject_respond_tool: bool = False,
     ) -> None:
         self._client = client
@@ -57,6 +58,7 @@ def __init__(
         self._port = port
         self._max_retries = max_retries
         self._rescue_enabled = rescue_enabled
+        self._native_passthrough = native_passthrough
         self._inject_respond_tool = inject_respond_tool
         self._server: asyncio.Server | None = None
         self._serialize = serialize_requests
@@ -308,6 +310,7 @@ async def _run_handler(
                 context_manager=self._context_manager,
                 max_retries=self._max_retries,
                 rescue_enabled=self._rescue_enabled,
+                native_passthrough=self._native_passthrough,
                 inject_respond_tool=self._inject_respond_tool,
                 protocol=protocol,
             )
diff --git a/tests/unit/test_proxy_handler.py b/tests/unit/test_proxy_handler.py
index d417317..e106667 100644
--- a/tests/unit/test_proxy_handler.py
+++ b/tests/unit/test_proxy_handler.py
@@ -549,3 +549,49 @@ async def test_respond_injected_into_raw_tools_when_opted_in(self):
         sent = client.send.call_args.kwargs["raw_openai_tools"]
         names = [t["function"]["name"] for t in sent]
         assert names == ["search", "respond"]
+
+
+# ── Prompt capability handoff ───────────────────────────────
+
+
+class TestPromptCapabilityHandoff:
+    """In prompt capability (native_passthrough=False) the handler suppresses
+    the verbatim passthrough so the request folds normally and the client's
+    prompt path injects the tools. (The injection itself is covered by the
+    LlamafileClient prompt-mode tests.)"""
+
+    @pytest.mark.asyncio
+    async def test_prompt_mode_suppresses_raw_tools(self):
+        client = _mock_client([ToolCall(tool="search", args={"q": "x"})])
+        await handle_chat_completions(
+            _body(tools=[_tool_def("search")]), client, _context_manager(),
+            native_passthrough=False,
+        )
+        # No verbatim tools forwarded — the client's prompt path injects them.
+        assert "raw_openai_tools" not in client.send.call_args.kwargs
+        # tool_specs (the source for build_tool_prompt) are still passed.
+        assert client.send.call_args.kwargs["tools"][0].name == "search"
+
+    @pytest.mark.asyncio
+    async def test_prompt_mode_folds_messages_not_verbatim(self):
+        client = _mock_client([ToolCall(tool="search", args={"q": "x"})])
+        # A non-standard key would survive verbatim passthrough but is dropped
+        # by fold_and_serialize — proving the raw transcript was NOT forwarded.
+        messages = [{"role": "user", "content": "hi", "name": "u1"}]
+        await handle_chat_completions(
+            _body(messages=messages, tools=[_tool_def("search")]),
+            client, _context_manager(), native_passthrough=False,
+        )
+        sent_messages = client.send.call_args.args[0]
+        assert sent_messages != messages
+        assert "name" not in sent_messages[0]
+
+    @pytest.mark.asyncio
+    async def test_native_default_still_forwards_raw(self):
+        # Sanity: default (native) path is unaffected by the new param.
+        client = _mock_client([ToolCall(tool="search", args={"q": "x"})])
+        tools = [_tool_def("search")]
+        await handle_chat_completions(
+            _body(tools=tools), client, _context_manager(),
+        )
+        assert client.send.call_args.kwargs["raw_openai_tools"] == tools
diff --git a/tests/unit/test_proxy_proxy.py b/tests/unit/test_proxy_proxy.py
index ca8792f..181c2d1 100644
--- a/tests/unit/test_proxy_proxy.py
+++ b/tests/unit/test_proxy_proxy.py
@@ -294,6 +294,85 @@ async def test_managed_llamafile_client_is_native(self) -> None:
         assert mock_setup.await_args.kwargs["mode"] == "native"
 
 
+class TestBackendCapability:
+    """backend_capability selects the tool-calling protocol, declared once at
+    construction and frozen. native (default) = verbatim passthrough; prompt =
+    opt-in prompt-injection for non-FC llama.cpp/llamafile backends."""
+
+    def test_default_is_native(self) -> None:
+        assert ProxyServer(backend_url="http://x:8080")._backend_capability == "native"
+
+    def test_prompt_stored(self) -> None:
+        proxy = ProxyServer(backend_url="http://x:8080", backend_capability="prompt")
+        assert proxy._backend_capability == "prompt"
+
+    # Guards: prompt is a llama.cpp/llamafile capability only.
+    def test_prompt_rejects_vllm(self) -> None:
+        with pytest.raises(ValueError, match="only supported for"):
+            ProxyServer(backend_url="http://x:8000", backend="vllm", backend_capability="prompt")
+
+    def test_prompt_rejects_ollama(self) -> None:
+        with pytest.raises(ValueError, match="only supported for"):
+            ProxyServer(backend="ollama", model="m", backend_capability="prompt")
+
+    def test_prompt_rejects_anthropic_protocol(self) -> None:
+        with pytest.raises(ValueError, match="not supported with the anthropic"):
+            ProxyServer(
+                backend_url="http://x:8080",
+                backend_protocol="anthropic",
+                backend_capability="prompt",
+            )
+
+    def test_prompt_allowed_for_external_llamacpp(self) -> None:
+        # backend=None (external) defaults to the llama.cpp adapter → prompt ok.
+        ProxyServer(backend_url="http://x:8080", backend_capability="prompt")
+        ProxyServer(backend="llamafile", gguf="m.gguf", backend_capability="prompt")
+
+    @pytest.mark.asyncio
+    async def test_external_default_builds_native_client(self) -> None:
+        proxy = ProxyServer(backend_url="http://localhost:8080", budget_tokens=8192)
+        client, _ = await proxy._setup_external()
+        assert isinstance(client, LlamafileClient)
+        assert client.mode == "native"
+
+    @pytest.mark.asyncio
+    async def test_external_prompt_builds_prompt_client(self) -> None:
+        proxy = ProxyServer(
+            backend_url="http://localhost:8080",
+            backend_capability="prompt",
+            budget_tokens=8192,
+        )
+        client, _ = await proxy._setup_external()
+        assert isinstance(client, LlamafileClient)
+        assert client.mode == "prompt"
+
+    @pytest.mark.asyncio
+    async def test_managed_prompt_client_is_prompt_but_launch_native(self) -> None:
+        # The managed LlamafileClient runs in prompt mode, but the backend
+        # process is still launched native (--jinja present, just unused).
+        proxy = ProxyServer(
+            backend="llamafile", gguf="/m/x.gguf", backend_capability="prompt",
+        )
+        mock_ctx = ContextManager.__new__(ContextManager)
+        mock_ctx.budget_tokens = 8192
+        with patch(
+            "forge.proxy.proxy.setup_backend",
+            new_callable=AsyncMock, return_value=(MagicMock(), mock_ctx),
+        ) as mock_setup:
+            client, _ = await proxy._setup_managed()
+        assert isinstance(client, LlamafileClient)
+        assert client.mode == "prompt"
+        assert mock_setup.await_args.kwargs["mode"] == "native"
+
+    def test_native_passthrough_forwarded_to_http_server(self) -> None:
+        # native → native_passthrough True; prompt → False.
+        assert (ProxyServer(backend_url="http://x")._backend_capability == "native")
+        assert (
+            ProxyServer(backend_url="http://x", backend_capability="prompt")
+            ._backend_capability == "prompt"
+        )
+
+
 class TestLifecycle:
     """start()/stop() thread + state management."""
 

From de936109d71a46ad275798318ec25264a17a6f3a Mon Sep 17 00:00:00 2001
From: Antoine Zambelli <antoine.zambelli@gmail.com>
Date: Sun, 31 May 2026 17:59:24 -0500
Subject: [PATCH 3/8] Proxy: log effective backend_timeout at startup

The configurable backend_timeout (#91) was validated, stored, and threaded
into every client request, but never surfaced at launch. Extend the
"Proxy ready" line to report the effective value so the operative timeout
is visible/diagnosable from the startup log.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 src/forge/proxy/proxy.py | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/forge/proxy/proxy.py b/src/forge/proxy/proxy.py
index bc8b577..c30cfca 100644
--- a/src/forge/proxy/proxy.py
+++ b/src/forge/proxy/proxy.py
@@ -211,7 +211,11 @@ def start(self) -> None:
         if not self._started:
             raise RuntimeError("Proxy failed to start")
 
-        logger.info("Proxy ready at %s", self.url)
+        logger.info(
+            "Proxy ready at %s (backend_timeout=%.1fs)",
+            self.url,
+            self._backend_timeout,
+        )
 
     def stop(self) -> None:
         """Stop the proxy (and managed backend if applicable)."""

From 49645da613fdb6d4ea7f339f1b554f736bd71608 Mon Sep 17 00:00:00 2001
From: Antoine Zambelli <antoine.zambelli@gmail.com>
Date: Sun, 31 May 2026 18:05:19 -0500
Subject: [PATCH 4/8] vLLM: single source of truth for model identity (#75)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

VLLMClient kept two identity fields with distinct roles — model_path (the
verbatim wire "model" field, which vLLM validates against its
--served-model-name) and model (the derived registry-lookup key). The proxy's
external-mode served-name adoption set both by hand (model_path = served;
model = served), duplicating the derivation logic and storing the full served
name where the constructor's rule stores the stem.

Extract the path->key derivation into _derive_model_field and wrap both
assignments in _set_model_identity, then call it from __init__ and from the
proxy. External adoption now upholds the same (model_path, model) invariant as
construction: an HF-repo-id served name reaches the wire verbatim while the
registry key is the derived stem.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 src/forge/clients/vllm.py      | 52 +++++++++++++++++++++++-----------
 src/forge/proxy/proxy.py       |  3 +-
 tests/unit/test_proxy_proxy.py | 16 +++++++++++
 3 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/src/forge/clients/vllm.py b/src/forge/clients/vllm.py
index ec0b574..5bdcb96 100644
--- a/src/forge/clients/vllm.py
+++ b/src/forge/clients/vllm.py
@@ -60,23 +60,12 @@ def __init__(
         recommended_sampling: bool = False,
     ) -> None:
         self.base_url = base_url
-        # model_path is the canonical identity. vLLM accepts either a local
-        # directory containing safetensors + config or a HuggingFace repo id
-        # (e.g. "google/gemma-4-26B-A4B-it"). We pass it through as-is in
-        # the wire-format "model" field and as the sampling-defaults lookup
-        # key (using the path stem for directory paths so registry lookups
-        # match the existing GGUF-stem convention).
-        self.model_path = str(model_path)
-        path_obj = Path(self.model_path)
-        # If model_path is a filesystem path, use the directory name as the
-        # registry lookup key. If it's an HF repo id (no leading slash, has
-        # a "/"), use the trailing segment. Otherwise the full string.
-        if path_obj.is_absolute() or path_obj.exists():
-            self.model = path_obj.name
-        elif "/" in self.model_path:
-            self.model = self.model_path.split("/")[-1]
-        else:
-            self.model = self.model_path
+        # model_path is the canonical identity, sent verbatim in the wire
+        # "model" field. self.model is the derived registry-lookup key. Both
+        # are set together so the (model_path, model) invariant holds — see
+        # _set_model_identity. Must run before apply_sampling_defaults below,
+        # which reads self.model.
+        self._set_model_identity(model_path)
 
         # Apply per-model recommended sampling defaults. Caller's explicit
         # (non-None) kwargs win over the map field-by-field.
@@ -103,6 +92,35 @@ async def aclose(self) -> None:
         """Close the underlying httpx connection pool."""
         await self._http.aclose()
 
+    @staticmethod
+    def _derive_model_field(model_path: str) -> str:
+        """Derive the sampling-registry lookup key from the canonical path.
+
+        vLLM accepts either a local directory (safetensors + config) or an HF
+        repo id (e.g. "google/gemma-4-26B-A4B-it"). The lookup key uses the
+        path stem so registry lookups match the existing GGUF-stem convention:
+        a filesystem path → its directory name; an HF repo id (has "/") → its
+        trailing segment; anything else → the string unchanged.
+        """
+        path_obj = Path(model_path)
+        if path_obj.is_absolute() or path_obj.exists():
+            return path_obj.name
+        if "/" in model_path:
+            return model_path.split("/")[-1]
+        return model_path
+
+    def _set_model_identity(self, model_path: str | Path) -> None:
+        """Set both identity fields atomically from one canonical path.
+
+        ``model_path`` is the wire "model" field (sent verbatim); ``model`` is
+        the derived registry key. Used by ``__init__`` and by the proxy's
+        external-mode served-name adoption, so the ``(model_path, model)``
+        invariant holds the same way in both — instead of mutating the two
+        fields separately after served-name discovery.
+        """
+        self.model_path = str(model_path)
+        self.model = self._derive_model_field(self.model_path)
+
     # Sampling fields recognized in per-call overrides. ``seed`` is
     # accepted only as a per-call override (not an instance field).
     # ``chat_template_kwargs`` is a nested dict of Jinja template variables
diff --git a/src/forge/proxy/proxy.py b/src/forge/proxy/proxy.py
index c30cfca..6a471cd 100644
--- a/src/forge/proxy/proxy.py
+++ b/src/forge/proxy/proxy.py
@@ -314,8 +314,7 @@ async def _setup_external(self) -> tuple[LLMClient, ContextManager]:
             served = await client.get_served_model_name()
             if served:
                 logger.info("Discovered vLLM served model name: %s", served)
-                client.model_path = served
-                client.model = served
+                client._set_model_identity(served)
             else:
                 logger.warning(
                     "Could not discover a served model name from %s/models; "
diff --git a/tests/unit/test_proxy_proxy.py b/tests/unit/test_proxy_proxy.py
index 181c2d1..871ac45 100644
--- a/tests/unit/test_proxy_proxy.py
+++ b/tests/unit/test_proxy_proxy.py
@@ -145,6 +145,22 @@ async def test_vllm_adopts_served_model_name(self) -> None:
         assert client.model_path == "my-awq-model"
         assert client.model == "my-awq-model"
 
+    @pytest.mark.asyncio
+    async def test_vllm_served_repo_id_keeps_wire_path_derives_registry_key(self) -> None:
+        # An HF-repo-id served name must reach the wire verbatim (vLLM validates
+        # it), while the registry key is the derived stem — the (model_path,
+        # model) invariant, applied to served-name adoption.
+        proxy = ProxyServer(
+            backend_url="http://localhost:8000", backend="vllm", budget_tokens=8192,
+        )
+        with patch.object(
+            VLLMClient, "get_served_model_name",
+            new_callable=AsyncMock, return_value="google/gemma-4-26B-A4B-it",
+        ):
+            client, _ = await proxy._setup_external()
+        assert client.model_path == "google/gemma-4-26B-A4B-it"
+        assert client.model == "gemma-4-26B-A4B-it"
+
     @pytest.mark.asyncio
     async def test_vllm_keeps_placeholder_when_discovery_fails(self) -> None:
         proxy = ProxyServer(

From f18242cf3c8ff711068d7c3f0597113621c4df64 Mon Sep 17 00:00:00 2001
From: Antoine Zambelli <antoine.zambelli@gmail.com>
Date: Sun, 31 May 2026 18:16:03 -0500
Subject: [PATCH 5/8] Clients: consistent malformed-tool-call + response-shape
 handling

Audited malformed-tool-call and unexpected-payload handling across the
OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and
LlamafileClient. Standardize on one principle, applied uniformly:

- Malformed argument JSON (a model mistake) -> TextResponse, routing the raw
  output back through the inference loop so the rescue/retry path can recover.
- A broken provider envelope (missing choices/message) or unexpected args type
  (a contract violation, not the model's fault) -> BackendError: fail loud and
  consistent, never a stray KeyError/IndexError.

Changes:
- vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on
  malformed args, unlike llamafile's retry-driving TextResponse) with a
  _parse_tool_calls mirroring the reference. Route both send() and send_stream()
  through it so streaming and non-streaming agree: a fully accumulated but
  unparseable arguments string finalizes as a TextResponse, not an exception.
- llamafile / openai_compat: guard the bare data["choices"][0]["message"]
  subscripts -> BackendError on a broken envelope (matching what vLLM already
  did for choices). llamafile also hardens function/name access.
- ollama: defensive .get on function/name (both paths); document that Ollama
  emits dict args by contract, so no json.loads is needed there.

Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing-
function/reasoning) + streaming malformed-fragment parity; envelope-guard tests
for llamafile and openai_compat. 1092 unit tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 src/forge/clients/llamafile.py          | 10 ++-
 src/forge/clients/ollama.py             | 12 ++-
 src/forge/clients/openai_compat.py      |  5 +-
 src/forge/clients/vllm.py               | 98 +++++++++++++++++--------
 tests/unit/test_llamafile_client.py     | 10 +++
 tests/unit/test_openai_compat_client.py | 10 +++
 tests/unit/test_vllm_client.py          | 76 +++++++++++++++++--
 7 files changed, 175 insertions(+), 46 deletions(-)

diff --git a/src/forge/clients/llamafile.py b/src/forge/clients/llamafile.py
index 7140794..c04bcec 100644
--- a/src/forge/clients/llamafile.py
+++ b/src/forge/clients/llamafile.py
@@ -540,8 +540,10 @@ async def _send_native(
         data = resp.json()
         self._record_usage(data)
 
-        top_choice = data["choices"][0]
-        choice = top_choice["message"]
+        choices = data.get("choices") or []
+        if not choices:
+            raise BackendError(500, f"response has no choices: {data}")
+        choice = choices[0].get("message", {})
         raw_tool_calls = choice.get("tool_calls")
         if raw_tool_calls:
             reasoning = self._resolve_reasoning(
@@ -550,7 +552,7 @@ async def _send_native(
             )
             result_calls: list[ToolCall] = []
             for i, tc_entry in enumerate(raw_tool_calls):
-                tc_func = tc_entry["function"]
+                tc_func = tc_entry.get("function", {})
                 args = tc_func.get("arguments", "{}")
                 if isinstance(args, str):
                     try:
@@ -558,7 +560,7 @@ async def _send_native(
                     except json.JSONDecodeError:
                         return TextResponse(content=choice.get("content", args))
                 result_calls.append(ToolCall(
-                    tool=tc_func["name"],
+                    tool=tc_func.get("name", ""),
                     args=args,
                     reasoning=reasoning if i == 0 else None,
                 ))
diff --git a/src/forge/clients/ollama.py b/src/forge/clients/ollama.py
index 9f93a6c..5a9cce8 100644
--- a/src/forge/clients/ollama.py
+++ b/src/forge/clients/ollama.py
@@ -198,10 +198,14 @@ async def send(
             reasoning = self._resolve_reasoning(
                 msg.get("thinking", ""), msg.get("content", ""),
             )
+            # Ollama returns tool-call arguments already decoded as a dict
+            # (unlike vLLM/llama.cpp, which send a JSON string) — no json.loads
+            # needed. Defensive .get on function/name so a broken tool-call
+            # entry degrades to empty rather than raising KeyError.
             return [
                 ToolCall(
-                    tool=tc["function"]["name"],
-                    args=tc["function"].get("arguments", {}),
+                    tool=tc.get("function", {}).get("name", ""),
+                    args=tc.get("function", {}).get("arguments", {}),
                     reasoning=reasoning if i == 0 else None,
                 )
                 for i, tc in enumerate(tool_calls)
@@ -304,8 +308,8 @@ async def _iter_stream(
                         )
                         final: LLMResponse = [
                             ToolCall(
-                                tool=tc["function"]["name"],
-                                args=tc["function"].get("arguments", {}),
+                                tool=tc.get("function", {}).get("name", ""),
+                                args=tc.get("function", {}).get("arguments", {}),
                                 reasoning=reasoning if i == 0 else None,
                             )
                             for i, tc in enumerate(tool_calls)
diff --git a/src/forge/clients/openai_compat.py b/src/forge/clients/openai_compat.py
index 96ead28..e6b0319 100644
--- a/src/forge/clients/openai_compat.py
+++ b/src/forge/clients/openai_compat.py
@@ -208,7 +208,10 @@ async def send(
         data = resp.json()
         self._record_usage(data)
 
-        msg = data["choices"][0]["message"]
+        choices = data.get("choices") or []
+        if not choices:
+            raise BackendError(500, f"response has no choices: {data}")
+        msg = choices[0].get("message", {})
         tool_calls = msg.get("tool_calls")
         if tool_calls:
             return self._parse_tool_calls(tool_calls, fallback_content=msg.get("content") or "")
diff --git a/src/forge/clients/vllm.py b/src/forge/clients/vllm.py
index 5bdcb96..ff50e12 100644
--- a/src/forge/clients/vllm.py
+++ b/src/forge/clients/vllm.py
@@ -222,15 +222,11 @@ async def send(
 
         tool_calls = message.get("tool_calls") or []
         if tool_calls:
-            reasoning = self._resolve_reasoning(message)
-            return [
-                ToolCall(
-                    tool=tc["function"]["name"],
-                    args=self._parse_tool_args(tc["function"].get("arguments", {})),
-                    reasoning=reasoning if i == 0 else None,
-                )
-                for i, tc in enumerate(tool_calls)
-            ]
+            return self._parse_tool_calls(
+                tool_calls,
+                reasoning=self._resolve_reasoning(message),
+                fallback_content=message.get("content") or "",
+            )
 
         return TextResponse(content=message.get("content") or "")
 
@@ -318,37 +314,75 @@ async def send_stream(
                         type=ChunkType.TEXT_DELTA, content=content,
                     )
 
-        # Build the final response
+        # Build the final response. Reassemble the accumulated deltas into the
+        # OpenAI tool-call shape and route through the same parser as send(), so
+        # streaming and non-streaming agree on malformed-args handling: a fully
+        # accumulated but unparseable arguments string yields a retry-driving
+        # TextResponse, not an exception.
         if tool_call_parts:
-            reasoning = self._resolve_reasoning(
-                accumulated_reasoning, accumulated_content,
-            )
-            final: LLMResponse = [
-                ToolCall(
-                    tool=part["name"],
-                    args=self._parse_tool_args(part["args"]),
-                    reasoning=reasoning if i == 0 else None,
-                )
-                for i, part in enumerate(
-                    tool_call_parts[k] for k in sorted(tool_call_parts)
-                )
+            reassembled = [
+                {"function": {"name": part["name"], "arguments": part["args"]}}
+                for part in (tool_call_parts[k] for k in sorted(tool_call_parts))
             ]
+            final: LLMResponse = self._parse_tool_calls(
+                reassembled,
+                reasoning=self._resolve_reasoning(
+                    accumulated_reasoning, accumulated_content,
+                ),
+                fallback_content=accumulated_content,
+            )
         else:
             final = TextResponse(content=accumulated_content)
         yield StreamChunk(type=ChunkType.FINAL, response=final)
 
     @staticmethod
-    def _parse_tool_args(raw: Any) -> dict[str, Any]:
-        """Tool args from vLLM arrive as JSON-encoded string in the
-        OpenAI native format. Decode to dict.
+    def _parse_tool_calls(
+        tool_calls: list[dict[str, Any]],
+        reasoning: str | None,
+        fallback_content: str,
+    ) -> LLMResponse:
+        """Parse vLLM ``tool_calls`` into ``ToolCall`` objects (or TextResponse).
+
+        Mirrors ``OpenAICompatClient`` / ``LlamafileClient`` so every
+        OpenAI-shape client behaves the same. Tool-call ``arguments`` arrive as
+        a JSON string (vLLM's native format) or an already-decoded dict.
+        Forge is fail-loud on the right axis:
+
+        - **Malformed argument JSON** is NOT coerced into empty args (that would
+          let a model silently proceed with wrong arguments). We return a
+          ``TextResponse``, routing the raw output back through the inference
+          loop so the rescue/retry path can recover — matching llamafile.
+        - An **unexpected args type** (neither str nor dict) is a provider
+          contract violation, not a model mistake → ``BackendError``.
+
+        Defensive ``.get`` on ``function`` / ``name`` keeps a broken tool-call
+        entry from raising ``KeyError``. Used by both send() and send_stream()
+        for parity (the stream path reassembles deltas into this shape first).
         """
-        if isinstance(raw, dict):
-            return raw
-        if isinstance(raw, str):
-            if not raw:
-                return {}
-            return json.loads(raw)
-        raise BackendError(500, f"unexpected tool args shape: {type(raw).__name__}")
+        parsed: list[ToolCall] = []
+        for i, tc in enumerate(tool_calls):
+            fn = tc.get("function", {})
+            raw_args = fn.get("arguments", {})
+            if isinstance(raw_args, str):
+                if not raw_args:
+                    args: dict[str, Any] = {}
+                else:
+                    try:
+                        args = json.loads(raw_args)
+                    except json.JSONDecodeError:
+                        return TextResponse(content=fallback_content or raw_args)
+            elif isinstance(raw_args, dict):
+                args = raw_args
+            else:
+                raise BackendError(
+                    500, f"unexpected tool args shape: {type(raw_args).__name__}",
+                )
+            parsed.append(ToolCall(
+                tool=fn.get("name", ""),
+                args=args,
+                reasoning=reasoning if i == 0 else None,
+            ))
+        return parsed
 
     async def get_context_length(self) -> int | None:
         """Query the vLLM /v1/models endpoint for max_model_len.
diff --git a/tests/unit/test_llamafile_client.py b/tests/unit/test_llamafile_client.py
index 68abe34..3785ee7 100644
--- a/tests/unit/test_llamafile_client.py
+++ b/tests/unit/test_llamafile_client.py
@@ -9,6 +9,7 @@
 
 from forge.clients.llamafile import LlamafileClient, _extract_think_tags, _merge_consecutive
 from forge.core.workflow import TextResponse, ToolCall, ToolSpec
+from forge.errors import BackendError
 from pydantic import BaseModel, Field
 from forge.clients.base import ChunkType
 
@@ -116,6 +117,15 @@ async def test_returns_text_response(self) -> None:
         assert isinstance(result, TextResponse)
         assert result.content == "I need more info"
 
+    @pytest.mark.asyncio
+    async def test_missing_choices_raises_backend_error(self) -> None:
+        # Broken provider envelope (200, no choices) → fail loud and consistent
+        # rather than KeyError/IndexError on data["choices"][0].
+        client = _make_client("native")
+        client._http.post.return_value = _mock_response({"object": "error"})
+        with pytest.raises(BackendError, match="response has no choices"):
+            await client.send([{"role": "user", "content": "test"}])
+
     @pytest.mark.asyncio
     async def test_arguments_parsed_from_string(self) -> None:
         """OpenAI format sends arguments as JSON string, not dict."""
diff --git a/tests/unit/test_openai_compat_client.py b/tests/unit/test_openai_compat_client.py
index 9aaff77..55fddd0 100644
--- a/tests/unit/test_openai_compat_client.py
+++ b/tests/unit/test_openai_compat_client.py
@@ -97,6 +97,16 @@ async def test_returns_text_response(self) -> None:
         assert isinstance(result, TextResponse)
         assert result.content == "I need more info"
 
+    @pytest.mark.asyncio
+    async def test_missing_choices_raises_backend_error(self) -> None:
+        # A broken provider envelope (200 with no choices) is a contract
+        # violation, not a model mistake — fail loud and consistent rather
+        # than KeyError/IndexError on data["choices"][0].
+        client = _make_client()
+        client._http.post.return_value = _mock_response({"object": "error"})
+        with pytest.raises(BackendError, match="response has no choices"):
+            await client.send([{"role": "user", "content": "test"}])
+
     @pytest.mark.asyncio
     async def test_null_content_returns_empty_text(self) -> None:
         client = _make_client()
diff --git a/tests/unit/test_vllm_client.py b/tests/unit/test_vllm_client.py
index 0576040..0de394b 100644
--- a/tests/unit/test_vllm_client.py
+++ b/tests/unit/test_vllm_client.py
@@ -326,6 +326,31 @@ async def test_yields_tool_call_delta_then_final(self) -> None:
         assert result[0].tool == "get_weather"
         assert result[0].args == {"city": "Paris"}
 
+    @pytest.mark.asyncio
+    async def test_malformed_accumulated_args_finalize_as_text_response(self) -> None:
+        # Streaming/non-streaming parity: once all fragments are accumulated,
+        # an unparseable arguments string must yield a retry-driving
+        # TextResponse (same as send()), not raise out of the stream.
+        client = _make_client()
+        client._http.stream.return_value = _MockStreamResponse([
+            _sse({"choices": [{"delta": {
+                "tool_calls": [{
+                    "index": 0,
+                    "function": {"name": "get_weather", "arguments": '{"city": '}
+                }],
+            }}]}),
+            _sse({"choices": [{"delta": {"content": "let me try again"}}]}),
+            "data: [DONE]",
+        ])
+        chunks = []
+        async for chunk in client.send_stream(
+            [{"role": "user", "content": "x"}], tools=[_make_spec()],
+        ):
+            chunks.append(chunk)
+        finals = [c for c in chunks if c.type == ChunkType.FINAL]
+        assert len(finals) == 1
+        assert isinstance(finals[0].response, TextResponse)
+
     @pytest.mark.asyncio
     async def test_accumulates_reasoning_across_deltas(self) -> None:
         client = _make_client(think=True)
@@ -473,16 +498,57 @@ async def test_usage_only_chunk_records_usage_and_continues(self) -> None:
         assert usage.completion_tokens == 3
 
 
-class TestParseToolArgs:
+class TestParseToolCalls:
+    @staticmethod
+    def _call(arguments: object) -> object:
+        return VLLMClient._parse_tool_calls(
+            [{"function": {"name": "lookup", "arguments": arguments}}],
+            reasoning=None,
+            fallback_content="raw model text",
+        )
+
+    def test_string_args_decoded(self) -> None:
+        """vLLM's native format — arguments arrive as a JSON string."""
+        assert self._call('{"city": "Paris"}') == [
+            ToolCall(tool="lookup", args={"city": "Paris"}),
+        ]
+
     def test_dict_passed_through(self) -> None:
         """Some downstream wrappers send dict args directly — pass through."""
-        assert VLLMClient._parse_tool_args({"city": "Paris"}) == {"city": "Paris"}
+        assert self._call({"city": "Paris"}) == [
+            ToolCall(tool="lookup", args={"city": "Paris"}),
+        ]
 
     def test_empty_string_returns_empty_dict(self) -> None:
         """No-arg tool calls — empty string args is valid."""
-        assert VLLMClient._parse_tool_args("") == {}
+        assert self._call("") == [ToolCall(tool="lookup", args={})]
+
+    def test_malformed_json_returns_textresponse(self) -> None:
+        """Matches llamafile/openai_compat: malformed args drive a retry via
+        TextResponse, never silent {} or an exception."""
+        assert self._call('{"city": ') == TextResponse(content="raw model text")
 
     def test_unexpected_type_raises(self) -> None:
-        """Unknown shape (list, int, etc.) — fail loud."""
+        """Unknown shape (list, int, etc.) — provider contract violation."""
         with pytest.raises(BackendError, match="unexpected tool args shape"):
-            VLLMClient._parse_tool_args(123)  # type: ignore[arg-type]
+            self._call(123)
+
+    def test_missing_function_is_defensive(self) -> None:
+        """A broken tool-call entry (no "function") must not KeyError."""
+        assert VLLMClient._parse_tool_calls(
+            [{}], reasoning=None, fallback_content="",
+        ) == [ToolCall(tool="", args={})]
+
+    def test_reasoning_attached_to_first_call_only(self) -> None:
+        result = VLLMClient._parse_tool_calls(
+            [
+                {"function": {"name": "a", "arguments": "{}"}},
+                {"function": {"name": "b", "arguments": "{}"}},
+            ],
+            reasoning="because",
+            fallback_content="",
+        )
+        assert result == [
+            ToolCall(tool="a", args={}, reasoning="because"),
+            ToolCall(tool="b", args={}, reasoning=None),
+        ]

From 91f660debfe90a97d31ae9647317c06be67396e5 Mon Sep 17 00:00:00 2001
From: Antoine Zambelli <antoine.zambelli@gmail.com>
Date: Sun, 31 May 2026 18:24:04 -0500
Subject: [PATCH 6/8] LlamafileClient: remove runtime auto mode; native-first,
 frozen capability
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try
native, fall back to prompt on HTTP error, recording resolved_mode). This was
the last vestige of the mid-request capability mutation the proxy rewrite
excised everywhere else; the proxy already declares its capability up front via
--backend-capability. With auto gone, resolved_mode is always == self.mode, so
the whole tri-state indirection collapses to a direct dispatch on self.mode.

The default is now native. This is both hardening and a deliberate posture
shift: local-model function-calling support has matured into the more reliable
path, so native-first is the right default. Prompt-injection is preserved as an
explicit opt-in (mode="prompt") and is the theoretically correct fallback for
non-FC backends — but it is honestly flagged, in the docstring and docs, that
models tend to struggle to drive the prompt-injected protocol reliably on more
complex, multi-step interactions. Capability is declared-and-frozen: an invalid
mode (including the old "auto") now raises ValueError rather than silently
degrading.

- llamafile.py: validate mode in __init__; default native; delete
  _resolve_and_send and the resolved_mode attribute/branches; dispatch send /
  send_stream on self.mode; rewrite the class docstring (native-first rationale
  + prompt caveat).
- eval_runner.py: --llamafile-mode choices [native, prompt], default native.
- docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat.
- tests: drop the auto-mode suite; assert native default + ValueError on "auto".

Consumers verified unaffected: the proxy (both sites), batch_eval, and the
integration script all pass mode explicitly. 1086 unit tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/BACKEND_SETUP.md               |   4 +-
 docs/EVAL_GUIDE.md                  |   2 +-
 src/forge/clients/llamafile.py      |  91 ++++++-----------
 tests/eval/eval_runner.py           |   4 +-
 tests/unit/test_llamafile_client.py | 148 +++-------------------------
 5 files changed, 46 insertions(+), 203 deletions(-)

diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md
index 024e762..75e667d 100644
--- a/docs/BACKEND_SETUP.md
+++ b/docs/BACKEND_SETUP.md
@@ -73,7 +73,7 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999
 | `-ngl 999` | Offload all layers to GPU |
 | `-m <path>` | Path to GGUF |
 
-llamafile does **not** support native function calling — forge's `LlamafileClient` falls back to prompt-injected mode automatically (`mode="auto"`), or you can force it with `mode="prompt"`.
+`LlamafileClient` is **native-first**: `mode="native"` (the default) forwards tools via the backend's `tools` parameter and requires native function calling (llama.cpp with `--jinja`). For a backend without native FC, declare `mode="prompt"` to inject tool descriptions into the prompt and parse the JSON call back out. The capability is declared at construction and frozen — there is no runtime auto-detection. Native-first is the default because local-model FC support has matured into the more reliable path; prompt-injection stays fully supported as an explicit opt-in, but note that on more complex, multi-step interactions models tend to struggle to drive the prompt-injected protocol reliably, so reach for it only when the backend leaves no alternative.
 
 > **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012.
 
@@ -90,7 +90,7 @@ from forge.clients import LlamafileClient
 
 client = LlamafileClient(
     gguf_path="path/to/model.gguf",
-    mode="prompt",  # or "auto" to try native first
+    mode="prompt",  # default is "native"; use "prompt" only for non-FC backends
     recommended_sampling=True,
 )
 ```
diff --git a/docs/EVAL_GUIDE.md b/docs/EVAL_GUIDE.md
index 07b4757..6ac9a69 100644
--- a/docs/EVAL_GUIDE.md
+++ b/docs/EVAL_GUIDE.md
@@ -30,7 +30,7 @@ python -m tests.eval.eval_runner --backend anthropic --model claude-haiku-4-5-20
 | `--verbose`, `-v` | flag | off | Print live per-message trace |
 | `--tags` | `plumbing`, `model_quality`, `advanced_reasoning`, `compaction`, `stateful`, `reasoning`, `error_recovery` | all | Filter scenarios by tag |
 | `--scenario` | name(s) | all | Run specific scenario(s) by name |
-| `--llamafile-mode` | `native`, `prompt`, `auto` | `auto` | FC mode for llamafile/llama-server backend |
+| `--llamafile-mode` | `native`, `prompt` | `native` | FC mode for llamafile/llama-server backend (native-first; `prompt` for non-FC backends) |
 | `--think` | `true`, `false`, `auto` | `auto` | Thinking mode. Ollama: controls `think` param. Llamafile: captures `[THINK]` tags and `reasoning_content` |
 | `--budget-mode` | `backend`, `manual`, `forge-full`, `forge-fast` | `forge-full` | Context budget strategy. Compaction scenarios always override with their own budget |
 | `--num-ctx` | int | none | Exact token budget (requires `--budget-mode manual`) |
diff --git a/src/forge/clients/llamafile.py b/src/forge/clients/llamafile.py
index c04bcec..9fd4ab7 100644
--- a/src/forge/clients/llamafile.py
+++ b/src/forge/clients/llamafile.py
@@ -121,12 +121,23 @@ def _downgrade_messages(messages: list[dict[str, Any]]) -> list[dict[str, Any]]:
 
 
 class LlamafileClient:
-    """OpenAI-compatible client for Llamafile.
-
-    mode="native" uses the tools parameter (requires Llamafile with FC support).
-    mode="prompt" injects tool descriptions into the prompt and extracts JSON.
-    mode="auto" tries native first, falls back to prompt on failure — with
-        an explicit warning log and resolved_mode set for caller inspection.
+    """OpenAI-compatible client for Llamafile / llama.cpp.
+
+    The capability is declared once at construction and frozen — there is no
+    runtime auto-detection. ``mode`` is one of:
+
+    - ``"native"`` (default): forwards tools via the ``tools`` parameter
+      (requires a backend with native function calling — llama.cpp ``--jinja``).
+    - ``"prompt"``: injects tool descriptions into the prompt and parses the
+      JSON tool call back out; for backends without native FC.
+
+    Native-first is the default because function-calling support across local
+    models has matured to the point where it is the more reliable path.
+    Prompt-injection remains fully supported as an explicit opt-in: it is the
+    theoretically correct fallback when a backend can't do native FC, but be
+    aware that on more complex, multi-step interactions models tend to struggle
+    to drive the prompt-injected protocol reliably. Choose ``"prompt"`` only
+    when the backend leaves no alternative.
     """
 
     api_format: str = "openai"
@@ -142,13 +153,20 @@ def __init__(
         repeat_penalty: float | None = None,
         presence_penalty: float | None = None,
         chat_template_kwargs: dict[str, Any] | None = None,
-        mode: str = "auto",
+        mode: str = "native",
         timeout: float = 300.0,
         think: bool | None = None,
         cache_prompt: bool = True,
         slot_id: int | None = None,
         recommended_sampling: bool = False,
     ) -> None:
+        if mode not in ("native", "prompt"):
+            raise ValueError(
+                f"mode must be 'native' or 'prompt', got {mode!r}. "
+                "Runtime auto-detection was removed — declare the backend "
+                "capability explicitly (native-first; 'prompt' for non-FC "
+                "backends)."
+            )
         self.base_url = base_url
         # gguf_path is the canonical identity. self.model is the stem (no
         # .gguf / .llamafile suffix) — used for the wire-format model field
@@ -175,17 +193,12 @@ def __init__(
         )
         self.mode = mode
         self._http = httpx.AsyncClient(timeout=timeout)
-        self._think: bool = think if think is not None else True  # auto = capture
+        self._think: bool = think if think is not None else True  # think=None → capture
         self._cache_prompt = cache_prompt
         self._slot_id = slot_id
 
         self.last_usage: dict[int, TokenUsage] = {}
 
-        if mode in ("native", "prompt"):
-            self.resolved_mode: str | None = mode
-        else:
-            self.resolved_mode = None
-
     async def aclose(self) -> None:
         """Close the underlying httpx connection pool."""
         await self._http.aclose()
@@ -272,7 +285,7 @@ async def send(
         inbound_anthropic_body: dict[str, Any] | None = None,
         raw_openai_tools: RawOpenAITools | None = None,
     ) -> LLMResponse:
-        """Resolve mode on first call with tools, then dispatch.
+        """Dispatch to the native or prompt-injected path per the declared mode.
 
         ``inbound_anthropic_body`` is accepted for protocol symmetry and
         silently ignored — LlamafileClient only speaks OpenAI shape.
@@ -281,16 +294,11 @@ async def send(
         backend's ``tools`` array on the native path; the prompt path
         accepts and ignores it (it keeps forge's prompt-injection format).
         """
-        if self.resolved_mode is None:
-            return await self._resolve_and_send(
-                messages, tools, sampling, passthrough, raw_openai_tools,
-            )
-        elif self.resolved_mode == "native":
+        if self.mode == "native":
             return await self._send_native(
                 messages, tools, sampling, passthrough, raw_openai_tools,
             )
-        else:
-            return await self._send_prompt(messages, tools, sampling, passthrough)
+        return await self._send_prompt(messages, tools, sampling, passthrough)
 
     async def send_stream(
         self,
@@ -307,13 +315,7 @@ async def send_stream(
         ``raw_openai_tools`` (proxy use) is forwarded verbatim on the native
         path; ignored on the prompt path.
         """
-        if self.resolved_mode is None:
-            # Probe with a non-streaming call to resolve native vs prompt.
-            # Result is discarded — the runner will use the streamed response.
-            await self._resolve_and_send(
-                messages, tools, sampling, passthrough, raw_openai_tools,
-            )
-        mode = self.resolved_mode
+        mode = self.mode
 
         body: dict[str, Any] = dict(passthrough or {})
         body.update({
@@ -469,39 +471,6 @@ async def get_context_length(self) -> int | None:
         except (ValueError, KeyError, TypeError) as exc:
             raise ContextDiscoveryError(exc) from exc
 
-    async def _resolve_and_send(
-        self,
-        messages: list[dict[str, str]],
-        tools: list[ToolSpec] | None,
-        sampling: dict[str, Any] | None = None,
-        passthrough: dict[str, Any] | None = None,
-        raw_openai_tools: RawOpenAITools | None = None,
-    ) -> LLMResponse:
-        """Auto-resolve mode on first send with tools.
-
-        Only falls back to prompt-injected mode on an HTTP error (backend
-        doesn't support the tools parameter). A TextResponse with tools
-        provided is not a fallback signal — it means native FC is supported
-        but the model chose not to call a tool. The runner's retry logic
-        handles that case.
-        """
-        if not tools:
-            # No tools to test with — send without tools, defer resolution
-            self.resolved_mode = "native"
-            return await self._send_native(
-                messages, tools, sampling, passthrough, raw_openai_tools,
-            )
-
-        try:
-            result = await self._send_native(
-                messages, tools, sampling, passthrough, raw_openai_tools,
-            )
-            self.resolved_mode = "native"
-            return result
-        except (httpx.HTTPStatusError, BackendError):
-            self.resolved_mode = "prompt"
-            return await self._send_prompt(messages, tools, sampling, passthrough)
-
     async def _send_native(
         self,
         messages: list[dict[str, str]],
diff --git a/tests/eval/eval_runner.py b/tests/eval/eval_runner.py
index 231e56e..b503594 100644
--- a/tests/eval/eval_runner.py
+++ b/tests/eval/eval_runner.py
@@ -507,8 +507,8 @@ async def main() -> None:
     parser.add_argument("--scenario", nargs="*", help="Run specific scenario(s) by name")
     parser.add_argument(
         "--llamafile-mode",
-        choices=["native", "prompt", "auto"],
-        default="auto",
+        choices=["native", "prompt"],
+        default="native",
     )
     parser.add_argument(
         "--budget-mode",
diff --git a/tests/unit/test_llamafile_client.py b/tests/unit/test_llamafile_client.py
index 3785ee7..c8ebc63 100644
--- a/tests/unit/test_llamafile_client.py
+++ b/tests/unit/test_llamafile_client.py
@@ -369,133 +369,6 @@ async def test_tool_role_downgraded_in_prompt_mode(self) -> None:
         assert sent_messages[3]["content"] == "fetch_result"
 
 
-# ── auto mode ────────────────────────────────────────────────────
-
-
-class TestLlamafileAutoMode:
-    @pytest.mark.asyncio
-    async def test_auto_resolves_native_on_tool_call(self) -> None:
-        client = _make_client("auto")
-        assert client.resolved_mode is None
-        client._http.post.return_value = _mock_response(
-            _openai_tool_call_response()
-        )
-        result = await client.send(
-            [{"role": "user", "content": "test"}], tools=[_make_spec()]
-        )
-        assert isinstance(result, list)
-        assert client.resolved_mode == "native"
-
-    @pytest.mark.asyncio
-    async def test_auto_stays_native_on_text_response(self) -> None:
-        """TextResponse with tools provided means FC is supported but the
-        model chose not to call a tool. Should resolve to native, not
-        fall back to prompt mode."""
-        client = _make_client("auto")
-
-        client._http.post.return_value = _mock_response(
-            _openai_text_response("Let me think about this...")
-        )
-
-        result = await client.send(
-            [{"role": "system", "content": "sys"}, {"role": "user", "content": "test"}],
-            tools=[_make_spec()],
-        )
-        assert client.resolved_mode == "native"
-        assert isinstance(result, TextResponse)
-        assert result.content == "Let me think about this..."
-
-    @pytest.mark.asyncio
-    async def test_auto_falls_back_on_http_error(self) -> None:
-        client = _make_client("auto")
-
-        # First call (native attempt) raises HTTP error
-        error_resp = _mock_response({}, status_code=400)
-        prompt_resp = _mock_response(
-            _openai_text_response('{"tool": "get_pricing", "args": {}}')
-        )
-        client._http.post.side_effect = [error_resp, prompt_resp]
-
-        result = await client.send(
-            [{"role": "system", "content": "sys"}, {"role": "user", "content": "test"}],
-            tools=[_make_spec()],
-        )
-        assert client.resolved_mode == "prompt"
-        assert isinstance(result, list)
-
-    @pytest.mark.asyncio
-    async def test_auto_without_tools_defaults_native(self) -> None:
-        client = _make_client("auto")
-        client._http.post.return_value = _mock_response(
-            _openai_text_response("hello")
-        )
-        result = await client.send([{"role": "user", "content": "hi"}])
-        assert isinstance(result, TextResponse)
-        assert client.resolved_mode == "native"
-
-    @pytest.mark.asyncio
-    async def test_send_stream_auto_resolves_native(self) -> None:
-        """send_stream() probes mode before streaming when resolved_mode is None."""
-        client = _make_client("auto")
-        assert client.resolved_mode is None
-
-        # Probe call resolves to native
-        client._http.post.return_value = _mock_response(
-            _openai_tool_call_response()
-        )
-        # Streaming call returns a tool call
-        sse_lines = [
-            f'data: {json.dumps({"choices": [{"delta": {"tool_calls": [{"index": 0, "function": {"name": "get_pricing", "arguments": ""}}]}}]})}',
-            f'data: {json.dumps({"choices": [{"delta": {"tool_calls": [{"index": 0, "function": {"arguments": "{\"part\": \"X\"}"}}]}}]})}',
-            "data: [DONE]",
-        ]
-        client._http.stream.return_value = _MockSSEStreamResponse(sse_lines)
-
-        chunks = []
-        async for chunk in client.send_stream(
-            [{"role": "system", "content": "sys"}, {"role": "user", "content": "test"}],
-            tools=[_make_spec()],
-        ):
-            chunks.append(chunk)
-
-        assert client.resolved_mode == "native"
-        finals = [c for c in chunks if c.type == ChunkType.FINAL]
-        assert len(finals) == 1
-        assert isinstance(finals[0].response, list)
-
-    @pytest.mark.asyncio
-    async def test_send_stream_auto_falls_back_to_prompt(self) -> None:
-        """send_stream() falls back to prompt mode when native probe fails."""
-        client = _make_client("auto")
-        assert client.resolved_mode is None
-
-        # Probe: native fails with HTTP error, prompt fallback succeeds
-        error_resp = _mock_response({}, status_code=400)
-        prompt_resp = _mock_response(
-            _openai_text_response('{"tool": "get_pricing", "args": {"part": "X"}}')
-        )
-        client._http.post.side_effect = [error_resp, prompt_resp]
-
-        # Streaming call (now in prompt mode) returns extracted tool call
-        sse_lines = [
-            f'data: {json.dumps({"choices": [{"delta": {"content": "{\"tool\": \"get_pricing\", \"args\": {\"part\": \"Y\"}}"}, "finish_reason": "stop"}]})}',
-        ]
-        client._http.stream.return_value = _MockSSEStreamResponse(sse_lines)
-
-        chunks = []
-        async for chunk in client.send_stream(
-            [{"role": "system", "content": "sys"}, {"role": "user", "content": "test"}],
-            tools=[_make_spec()],
-        ):
-            chunks.append(chunk)
-
-        assert client.resolved_mode == "prompt"
-        finals = [c for c in chunks if c.type == ChunkType.FINAL]
-        assert len(finals) == 1
-        assert isinstance(finals[0].response, list)
-        assert finals[0].response[0].tool == "get_pricing"
-
-
 # ── get_context_length ───────────────────────────────────────────
 
 
@@ -678,21 +551,22 @@ async def test_streaming_no_reasoning_when_no_content(self) -> None:
         assert final.response[0].reasoning is None
 
 
-# ── resolved_mode ────────────────────────────────────────────────
+# ── mode ─────────────────────────────────────────────────────────
 
 
-class TestResolvedMode:
-    def test_native_mode_set_immediately(self) -> None:
-        client = LlamafileClient(gguf_path="test", mode="native")
-        assert client.resolved_mode == "native"
+class TestMode:
+    def test_native_is_default(self) -> None:
+        client = LlamafileClient(gguf_path="test")
+        assert client.mode == "native"
 
-    def test_prompt_mode_set_immediately(self) -> None:
+    def test_prompt_mode(self) -> None:
         client = LlamafileClient(gguf_path="test", mode="prompt")
-        assert client.resolved_mode == "prompt"
+        assert client.mode == "prompt"
 
-    def test_auto_mode_unset(self) -> None:
-        client = LlamafileClient(gguf_path="test", mode="auto")
-        assert client.resolved_mode is None
+    def test_auto_mode_rejected(self) -> None:
+        # Runtime auto-detection was removed — capability is declared-and-frozen.
+        with pytest.raises(ValueError, match="mode must be 'native' or 'prompt'"):
+            LlamafileClient(gguf_path="test", mode="auto")
 
 
 # ── _apply_sampling ──────────────────────────────────────────────

From 4d7debc8228646dd2779ecf6b54b6954db10da46 Mon Sep 17 00:00:00 2001
From: Antoine Zambelli <antoine.zambelli@gmail.com>
Date: Mon, 1 Jun 2026 00:42:17 -0500
Subject: [PATCH 7/8] fix(eval): honor manual context budget in batch_eval via
 start_with_budget
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

batch_eval brought servers up with a bare server.start() (no ctx_override)
and resolved the budget separately via server.resolve_budget(), so
--budget-mode manual --num-ctx N was a no-op for llama-server: the server
booted at the model's full native context (no -c), and resolve_budget(MANUAL)
just read that full value back from /props. (Ollama was unaffected — its
context is per-request via set_num_ctx.)

Route both the initial bring-up and _recover_server through the prod
start_with_budget() path, which threads manual_tokens -> ctx_override -> -c
at launch and returns the resolved budget. _recover_server gains
budget_mode/manual_tokens params so a restarted server reuses the same
budget. Drops the now-redundant standalone resolve_budget() on the happy
path (still used on the recovery branch to read back the resolved value).

This also fixes FORGE_FAST mode, which the old bare-start() path never
supported.

Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000):
server boots with -c, rows record budget_tokens=20224 (server-clamped)
instead of the previous 262144 full-native read-back.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 tests/eval/batch_eval.py | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/tests/eval/batch_eval.py b/tests/eval/batch_eval.py
index 9f04e82..e919437 100644
--- a/tests/eval/batch_eval.py
+++ b/tests/eval/batch_eval.py
@@ -420,9 +420,15 @@ async def _recover_server(
     gguf_path: str,
     extra_flags: list[str] | None,
     crash_count: int,
+    budget_mode: BudgetMode,
+    manual_tokens: int | None,
 ) -> bool:
     """Attempt to restart the server after a crash.
 
+    Restarts through the prod ``start_with_budget`` path so the recovered
+    server is launched with the same budget (e.g. ``-c manual_tokens`` for
+    MANUAL mode) as the original.
+
     Returns True if recovery succeeded, False if circuit breaker tripped.
     """
     if crash_count > len(_RECOVERY_BACKOFFS):
@@ -447,10 +453,12 @@ async def _recover_server(
     # GGUF path for non-Ollama (matches run_batch and setup_backend).
     cache_identity = config.model if config.backend == "ollama" else gguf_path
     try:
-        await server.start(
+        await server.start_with_budget(
             model=cache_identity,
             gguf_path=gguf_path,
             mode=config.mode,
+            budget_mode=budget_mode,
+            manual_tokens=manual_tokens,
             extra_flags=extra_flags,
         )
         print("  [!] Server restarted successfully.", flush=True)
@@ -751,10 +759,15 @@ async def run_batch(
             extra_flags = _get_server_flags(config.model, config.mode)
             cache_identity = config.model if config.backend == "ollama" else gguf_path
             try:
-                await server.start(
+                # Prod path: launches with the budget-appropriate context
+                # (e.g. -c manual_tokens for MANUAL) and returns the resolved
+                # budget, instead of starting raw and reading back full ctx.
+                resolved_budget = await server.start_with_budget(
                     model=cache_identity,
                     gguf_path=gguf_path,
                     mode=config.mode,
+                    budget_mode=budget_mode,
+                    manual_tokens=manual_tokens,
                     extra_flags=extra_flags if extra_flags else None,
                 )
             except RuntimeError:
@@ -763,20 +776,19 @@ async def run_batch(
                     server, config, gguf_path,
                     extra_flags if extra_flags else None,
                     crash_count=1,
+                    budget_mode=budget_mode, manual_tokens=manual_tokens,
                 )
                 if not recovered:
                     print(f"  SKIP (server failed to start)", flush=True)
                     total_skipped += total_scenarios
                     continue
+                resolved_budget = await server.resolve_budget(budget_mode, manual_tokens)
 
             prev_backend = config.backend
             prev_server = server
 
             # Build client
             client = _build_client(config, models_dir)
-
-            # Resolve budget through prod ServerManager path
-            resolved_budget = await server.resolve_budget(budget_mode, manual_tokens)
             if hasattr(client, "set_num_ctx"):
                 client.set_num_ctx(resolved_budget)
 
@@ -844,6 +856,7 @@ async def run_batch(
                             server, config, gguf_path,
                             extra_flags if extra_flags else None,
                             crash_count,
+                            budget_mode=budget_mode, manual_tokens=manual_tokens,
                         )
                         if not recovered:
                             print(

From eb7e7754e2c63cba06d87caa68ffff4654c0b7f6 Mon Sep 17 00:00:00 2001
From: Antoine Zambelli <antoine.zambelli@gmail.com>
Date: Mon, 1 Jun 2026 00:56:51 -0500
Subject: [PATCH 8/8] =?UTF-8?q?chore(release):=200.7.3=20=E2=80=94=20nativ?=
 =?UTF-8?q?e-first=20proxy?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this
branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient
#89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94).

Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed
to --backend-capability (no alias; --mode was only introduced in 0.7.1).
Native is the default and only auto-selected protocol; prompt-injection is
an explicit opt-in for non-FC llama.cpp/llamafile backends.

USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode
tends to degrade on more complex multi-step interactions. BACKEND_SETUP,
EVAL_GUIDE, and ADR-012 were already updated earlier on this branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 CHANGELOG.md       | 27 +++++++++++++++++++++++++++
 docs/USER_GUIDE.md |  2 +-
 pyproject.toml     |  2 +-
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 29aa581..f2e0d0b 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,33 @@
 
 All notable changes to forge are documented here.
 
+## [0.7.3] — 2026-06-01
+
+Native-first proxy. With native function calling now well-supported across modern local models, the proxy defaults to — and is optimized for — native tool calling, forwarding the client's OpenAI `tools` / `messages` to the backend verbatim. Prompt-injection remains available as an explicit opt-in for llama.cpp / llamafile backends that lack a function-calling template, but it is no longer the default path. This release also folds in the OpenAI-compatible client and several proxy / eval fixes that landed on `main` since 0.7.2.
+
+### Added
+- **`OpenAICompatClient`** for arbitrary OpenAI-compatible endpoints. #89 (thanks @lucasgerads).
+- **`--backend-timeout` proxy option** — configurable backend response timeout (default 300s). #91.
+- **`--backend-capability {native,prompt}` proxy flag** — `native` (default) forwards the client's tools / messages verbatim to a function-calling-capable backend; `prompt` opts into prompt-injection for non-FC llama.cpp / llamafile backends. Declared once at startup and frozen — never probed or switched mid-stream.
+- Effective `backend_timeout` logged at proxy startup.
+
+### Changed
+- **BREAKING — `--mode {native,prompt}` renamed to `--backend-capability {native,prompt}`** (and `ProxyServer(mode=…)` → `ProxyServer(backend_capability=…)`). `--mode` collided with the proxy's managed / external deployment mode; the new name states what it controls — the backend's tool-calling protocol — and reflects that the choice is declared once and frozen, never probed at runtime. There is **no deprecation alias** (`--mode` was introduced in 0.7.1). Migration: `--mode native` → drop it (native is the default) or `--backend-capability native`; `--mode prompt` → `--backend-capability prompt`.
+- **Native function calling is now transparent passthrough** — the proxy forwards the client's OpenAI tool / message payloads to the backend verbatim instead of round-tripping them through forge's internal `ToolSpec` representation, which dropped schema detail.
+- **vLLM model identity** consolidated to a single source of truth (the wire `model_path` and the registry `model` key are now set together). #75.
+- The `prompt` capability is now **rejected loudly** for ollama / vllm / anthropic backends — previously it was silently ignored for ollama.
+- `stream_options` is excluded from proxy passthrough. #94 (thanks @alexandergunnarson).
+
+### Fixed
+- **Consistent malformed-tool-call / unexpected-response handling** across the OpenAI-shape clients — malformed model tool args drive a retry (`TextResponse`) instead of degrading silently or raising inconsistently, and non-streaming responses are guarded so a broken provider envelope fails loud.
+- `Guardrails.record()` no longer drops tool args for prerequisite tracking. #72 (thanks @hobostay).
+- Deprecated asyncio API replaced; proxy server input validation added. #71 (thanks @hobostay).
+- Proxy input hardening, non-blocking Ollama stop, client shutdown, and loud arg decode. #86.
+- Dead code and a fragile variable reference cleaned up in `LlamafileClient`. #73 (thanks @hobostay).
+
+### Removed
+- Runtime `auto` function-calling mode in `LlamafileClient` — the proxy never used it, and its mid-request probe-and-switch behavior is replaced by the declared-and-frozen `--backend-capability`.
+
 ## [0.7.2] — 2026-05-24
 
 vLLM backend support — serve AWQ/GPTQ and other vLLM-hosted models behind forge's guardrails, in both proxy modes and via `WorkflowRunner`.
diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md
index 019792c..ac04a1f 100644
--- a/docs/USER_GUIDE.md
+++ b/docs/USER_GUIDE.md
@@ -83,7 +83,7 @@ claude
 
 `ANTHROPIC_AUTH_TOKEN` can be any non-empty string — forge ignores it. The model name Claude Code sends is also ignored; forge serves whatever backend the proxy was started with.
 
-**Function-calling mode.** `--mode native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--mode prompt` injects the tool surface into the prompt for backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model, so prefer native when the backend supports it.
+**Function-calling capability.** `--backend-capability native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--backend-capability prompt` injects the tool surface into the prompt for llama.cpp/llamafile backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model — and tends to degrade on more complex, multi-step interactions — so prefer native whenever the backend supports it. The capability is declared at startup and frozen.
 
 **Downstream protocol.**
 
diff --git a/pyproject.toml b/pyproject.toml
index 37dcf08..d7a0f2e 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "forge-guardrails"
-version = "0.7.2"
+version = "0.7.3"
 description = "A reliability layer for self-hosted LLM tool-calling. Guardrails, context management, and backend adapters for multi-step agentic workflows."
 requires-python = ">=3.12"
 license = "MIT"