antoinezambelli · antoinezambelli · Jun 1, 2026 · May 30, 2026 · May 31, 2026 · May 31, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,33 @@
 
 All notable changes to forge are documented here.
 
+## [0.7.3] — 2026-06-01
+
+Native-first proxy. With native function calling now well-supported across modern local models, the proxy defaults to — and is optimized for — native tool calling, forwarding the client's OpenAI `tools` / `messages` to the backend verbatim. Prompt-injection remains available as an explicit opt-in for llama.cpp / llamafile backends that lack a function-calling template, but it is no longer the default path. This release also folds in the OpenAI-compatible client and several proxy / eval fixes that landed on `main` since 0.7.2.
+
+### Added
+- **`OpenAICompatClient`** for arbitrary OpenAI-compatible endpoints. #89 (thanks @lucasgerads).
+- **`--backend-timeout` proxy option** — configurable backend response timeout (default 300s). #91.
+- **`--backend-capability {native,prompt}` proxy flag** — `native` (default) forwards the client's tools / messages verbatim to a function-calling-capable backend; `prompt` opts into prompt-injection for non-FC llama.cpp / llamafile backends. Declared once at startup and frozen — never probed or switched mid-stream.
+- Effective `backend_timeout` logged at proxy startup.
+
+### Changed
+- **BREAKING — `--mode {native,prompt}` renamed to `--backend-capability {native,prompt}`** (and `ProxyServer(mode=…)` → `ProxyServer(backend_capability=…)`). `--mode` collided with the proxy's managed / external deployment mode; the new name states what it controls — the backend's tool-calling protocol — and reflects that the choice is declared once and frozen, never probed at runtime. There is **no deprecation alias** (`--mode` was introduced in 0.7.1). Migration: `--mode native` → drop it (native is the default) or `--backend-capability native`; `--mode prompt` → `--backend-capability prompt`.
+- **Native function calling is now transparent passthrough** — the proxy forwards the client's OpenAI tool / message payloads to the backend verbatim instead of round-tripping them through forge's internal `ToolSpec` representation, which dropped schema detail.
+- **vLLM model identity** consolidated to a single source of truth (the wire `model_path` and the registry `model` key are now set together). #75.
+- The `prompt` capability is now **rejected loudly** for ollama / vllm / anthropic backends — previously it was silently ignored for ollama.
+- `stream_options` is excluded from proxy passthrough. #94 (thanks @alexandergunnarson).
+
+### Fixed
+- **Consistent malformed-tool-call / unexpected-response handling** across the OpenAI-shape clients — malformed model tool args drive a retry (`TextResponse`) instead of degrading silently or raising inconsistently, and non-streaming responses are guarded so a broken provider envelope fails loud.
+- `Guardrails.record()` no longer drops tool args for prerequisite tracking. #72 (thanks @hobostay).
+- Deprecated asyncio API replaced; proxy server input validation added. #71 (thanks @hobostay).
+- Proxy input hardening, non-blocking Ollama stop, client shutdown, and loud arg decode. #86.
+- Dead code and a fragile variable reference cleaned up in `LlamafileClient`. #73 (thanks @hobostay).
+
+### Removed
+- Runtime `auto` function-calling mode in `LlamafileClient` — the proxy never used it, and its mid-request probe-and-switch behavior is replaced by the declared-and-frozen `--backend-capability`.
+
 ## [0.7.2] — 2026-05-24
 
 vLLM backend support — serve AWQ/GPTQ and other vLLM-hosted models behind forge's guardrails, in both proxy modes and via `WorkflowRunner`.

diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md
@@ -73,7 +73,9 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999
 | `-ngl 999` | Offload all layers to GPU |
 | `-m <path>` | Path to GGUF |
 
-llamafile does **not** support native function calling — forge's `LlamafileClient` falls back to prompt-injected mode automatically (`mode="auto"`), or you can force it with `mode="prompt"`.
+`LlamafileClient` is **native-first**: `mode="native"` (the default) forwards tools via the backend's `tools` parameter and requires native function calling (llama.cpp with `--jinja`). For a backend without native FC, declare `mode="prompt"` to inject tool descriptions into the prompt and parse the JSON call back out. The capability is declared at construction and frozen — there is no runtime auto-detection. Native-first is the default because local-model FC support has matured into the more reliable path; prompt-injection stays fully supported as an explicit opt-in, but note that on more complex, multi-step interactions models tend to struggle to drive the prompt-injected protocol reliably, so reach for it only when the backend leaves no alternative.
+
+> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012.
 
 Smoke-test:
 
@@ -88,7 +90,7 @@ from forge.clients import LlamafileClient
 
 client = LlamafileClient(
     gguf_path="path/to/model.gguf",
-    mode="prompt",  # or "auto" to try native first
+    mode="prompt",  # default is "native"; use "prompt" only for non-FC backends
     recommended_sampling=True,
 )
 ```

diff --git a/docs/EVAL_GUIDE.md b/docs/EVAL_GUIDE.md
@@ -30,7 +30,7 @@ python -m tests.eval.eval_runner --backend anthropic --model claude-haiku-4-5-20
 | `--verbose`, `-v` | flag | off | Print live per-message trace |
 | `--tags` | `plumbing`, `model_quality`, `advanced_reasoning`, `compaction`, `stateful`, `reasoning`, `error_recovery` | all | Filter scenarios by tag |
 | `--scenario` | name(s) | all | Run specific scenario(s) by name |
-| `--llamafile-mode` | `native`, `prompt`, `auto` | `auto` | FC mode for llamafile/llama-server backend |
+| `--llamafile-mode` | `native`, `prompt` | `native` | FC mode for llamafile/llama-server backend (native-first; `prompt` for non-FC backends) |
 | `--think` | `true`, `false`, `auto` | `auto` | Thinking mode. Ollama: controls `think` param. Llamafile: captures `[THINK]` tags and `reasoning_content` |
 | `--budget-mode` | `backend`, `manual`, `forge-full`, `forge-fast` | `forge-full` | Context budget strategy. Compaction scenarios always override with their own budget |
 | `--num-ctx` | int | none | Exact token budget (requires `--budget-mode manual`) |

diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md
@@ -83,7 +83,7 @@ claude
 
 `ANTHROPIC_AUTH_TOKEN` can be any non-empty string — forge ignores it. The model name Claude Code sends is also ignored; forge serves whatever backend the proxy was started with.
 
-**Function-calling mode.** `--mode native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--mode prompt` injects the tool surface into the prompt for backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model, so prefer native when the backend supports it.
+**Function-calling capability.** `--backend-capability native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--backend-capability prompt` injects the tool surface into the prompt for llama.cpp/llamafile backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model — and tends to degrade on more complex, multi-step interactions — so prefer native whenever the backend supports it. The capability is declared at startup and frozen.
 
 **Downstream protocol.**
 

diff --git a/docs/decisions/012-openai-proxy.md b/docs/decisions/012-openai-proxy.md
@@ -104,6 +104,52 @@ The proxy fully buffers each response from the backend before deciding what to d
 4. **Client disconnect handling** -- detect TCP drop, cancel in-flight backend request, release inference lock.
 5. **Testing** -- unit tests for extraction, integration tests with mock backend, smoke test with real llama-server.
 
+### Revision: native-first, with opt-in prompt capability
+
+The proxy is **native-first**. By default (`--backend-capability native`) it
+targets backends that speak the native OpenAI tools API (llama.cpp with a
+tool-calling chat template / `--jinja`, vLLM, Ollama, Anthropic) and forwards
+the client's request verbatim (below).
+
+Prompt-injection is available as an **explicit opt-in**
+(`--backend-capability prompt`, llama.cpp/llamafile only) for non-FC backends —
+it reuses the WorkflowRunner's prompt path (`build_tool_prompt`,
+`_downgrade_messages`, `extract_tool_call`) so there is **one** prompt
+implementation, not a proxy-specific fork. The capability is **declared once at
+construction and frozen** — there is deliberately **no `mode="auto"` runtime
+probe** (the old auto/HTTP-error fallback that mutated state mid-request was the
+root of the original tangle; it is not reintroduced). In prompt capability the
+verbatim passthrough is suppressed (`native_passthrough=False`): tools are
+serialized into the prompt, so a raw native transcript would be meaningless.
+
+History: this revision originally cut prompt mode from the proxy entirely
+("native-only"). Prompt was then re-added as the opt-in capability above —
+native-first is a cleaner story than a backwards-incompatible drop, and non-FC
+backends (e.g. llamafile) stay usable through the proxy.
+
+Rationale: the proxy is a transparent layer for an external agent that already
+speaks native FC to a native-FC backend. A traced capture showed the native
+path forwards the client's request byte-for-byte. The earlier eval regression
+(prompt-mode proxy underperforming) was a prompt-injection artifact on an
+FC-capable backend, not proxy overhead.
+
+To preserve that transparency, the proxy forwards the client's **verbatim
+OpenAI `tools` and `messages`** to the backend on the clean first attempt
+(`raw_openai_tools` / `raw_openai_messages`), bypassing the lossy
+`ToolSpec.from_json_schema` → `format_tool` round-trip that dropped schema
+detail and leaked empty tool names. The parsed `ToolSpec` list is kept only as
+forge's validation sidecar. On any forge mutation (retry / compaction / context
+warning) the proxy falls back to the folded/serialized form — see the
+`use_raw_messages` gate in `run_inference`, which mirrors the ADR-015
+`inbound_anthropic_body` drop-on-mutation logic.
+
+The synthetic `respond` tool is **opt-in** (`--inject-respond-tool`, default
+off): the proxy forwards the client's tools untouched unless asked to inject it.
+
+If a backend lacking native FC is placed behind the proxy, it degrades to
+passing the model's text through (no auto-downgrade) — **bring an FC-capable
+backend.**
+
 ### What this is NOT
 
 - **Not a model server.** Forge sits in front of one.

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "forge-guardrails"
-version = "0.7.2"
+version = "0.7.3"
 description = "A reliability layer for self-hosted LLM tool-calling. Guardrails, context management, and backend adapters for multi-step agentic workflows."
 requires-python = ">=3.12"
 license = "MIT"

diff --git a/src/forge/clients/anthropic.py b/src/forge/clients/anthropic.py
@@ -288,6 +288,7 @@ async def send(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: list[dict[str, Any]] | None = None,
     ) -> LLMResponse:
         """Send messages via the Anthropic Messages API.
 
@@ -296,6 +297,8 @@ async def send(
         forge. ``passthrough`` merges inbound-body extras into the SDK call.
         ``inbound_anthropic_body`` (path 1) triggers verbatim emit — see
         ADR-015 for the cache_control preservation rationale.
+        ``raw_openai_tools`` accepted for protocol symmetry, ignored
+        (Anthropic uses its own tool conversion).
         """
         if sampling:
             log.debug(
@@ -327,12 +330,14 @@ async def send_stream(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: list[dict[str, Any]] | None = None,
     ) -> AsyncIterator[StreamChunk]:
         """Stream via the Anthropic Messages API.
 
         ``sampling`` is accepted for protocol symmetry but ignored.
         ``passthrough`` merges inbound-body extras into the SDK call.
         ``inbound_anthropic_body`` (path 1) triggers verbatim emit; see ADR-015.
+        ``raw_openai_tools`` accepted for protocol symmetry, ignored.
         """
         if sampling:
             log.debug(

diff --git a/src/forge/clients/base.py b/src/forge/clients/base.py
@@ -9,6 +9,12 @@
 
 from forge.core.workflow import LLMResponse, ToolCall, TextResponse, ToolSpec
 
+# Verbatim OpenAI-shape payloads forwarded by the proxy. The proxy hands the
+# client the user's original ``tools`` array so the backend sees the exact
+# schema the client authored, instead of forge's reconstructed ToolSpec.
+RawOpenAITools = list[dict[str, Any]]
+RawOpenAIMessages = list[dict[str, Any]]
+
 
 @dataclass(frozen=True)
 class TokenUsage:
@@ -86,6 +92,7 @@ async def send(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: RawOpenAITools | None = None,
     ) -> LLMResponse:
         """Send messages and return a parsed response.
 
@@ -116,6 +123,11 @@ async def send(
                 forge-mutation (retry / compaction / context warning) so
                 only the clean first-attempt call rides verbatim. Other
                 clients accept and ignore. See ADR-015.
+            raw_openai_tools: Proxy-only — the client's verbatim OpenAI
+                ``tools`` array. When set, LlamafileClient's native path sends
+                it as-is instead of re-emitting ``format_tool(spec)``, so the
+                backend sees the original schema (no name/schema drift). Other
+                clients accept and ignore.
         """
         ...
 
@@ -126,6 +138,7 @@ async def send_stream(
         sampling: dict[str, Any] | None = None,
         passthrough: dict[str, Any] | None = None,
         inbound_anthropic_body: dict[str, Any] | None = None,
+        raw_openai_tools: RawOpenAITools | None = None,
     ) -> AsyncIterator[StreamChunk]:
         """Send messages and yield streaming chunks.
 
@@ -143,6 +156,7 @@ async def send_stream(
                 Per-call values win over instance state without mutating self.
             passthrough: Optional inbound-body extras dict (see ``send``).
             inbound_anthropic_body: Optional path-1 verbatim body (see ``send``).
+            raw_openai_tools: Optional verbatim OpenAI tools array (see ``send``).
         """
         ...