From f21f2de05f1a53fa227447ff8d9f65bbf5acf73d Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Sat, 30 May 2026 02:32:30 -0500 Subject: [PATCH 1/8] Proxy: native-only + transparent OpenAI passthrough Make the OpenAI-compatible proxy native-tool-call-only and forward the client's tools/messages verbatim, bypassing the lossy ToolSpec round-trip that dropped schema detail and leaked empty tool names. - Remove the proxy's --mode surface; the proxy always drives the backend client native. LlamafileClient's prompt-injection machinery is retained for non-proxy WorkflowRunner / direct-client use (it still wins for some models in full-guardrail workflow evals). - Add raw_openai_tools to the LLMClient protocol; LlamafileClient's native path sends it verbatim. Other clients accept-and-ignore (vLLM also gains the previously-missing passthrough/inbound_anthropic_body kwargs). - run_inference forwards raw OpenAI messages/tools only on the clean first attempt (use_raw_messages gate); any mutation falls back to fold+serialize. - respond tool is now opt-in (--inject-respond-tool, default off). - No instrumentation (proxy_trace/guardrail_stats deliberately not ported). - Tests: drop removed mode-guard tests; respond tests opt in explicitly; add native-passthrough, detachment, respond-default, and first-attempt-gate coverage. Docs: ADR-012 revision + BACKEND_SETUP proxy note. Co-Authored-By: Claude Opus 4.8 --- docs/BACKEND_SETUP.md | 2 + docs/decisions/012-openai-proxy.md | 33 ++++++++ src/forge/clients/anthropic.py | 5 ++ src/forge/clients/base.py | 14 ++++ src/forge/clients/llamafile.py | 51 +++++++++--- src/forge/clients/ollama.py | 10 ++- src/forge/clients/vllm.py | 19 ++++- src/forge/core/inference.py | 44 +++++++++- src/forge/proxy/__main__.py | 15 ++-- src/forge/proxy/handler.py | 61 ++++++++++++-- src/forge/proxy/proxy.py | 41 ++++------ src/forge/proxy/server.py | 3 + tests/unit/test_inference_passthrough.py | 100 +++++++++++++++++++++++ tests/unit/test_proxy_handler.py | 90 +++++++++++++++++++- tests/unit/test_proxy_path1.py | 8 -- tests/unit/test_proxy_proxy.py | 21 ++--- 16 files changed, 428 insertions(+), 89 deletions(-) create mode 100644 tests/unit/test_inference_passthrough.py diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md index 8d5cdb2..0bb0297 100644 --- a/docs/BACKEND_SETUP.md +++ b/docs/BACKEND_SETUP.md @@ -75,6 +75,8 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999 llamafile does **not** support native function calling — forge's `LlamafileClient` falls back to prompt-injected mode automatically (`mode="auto"`), or you can force it with `mode="prompt"`. +> **Proxy note:** prompt-injection mode is a **direct-client / WorkflowRunner** feature. The OpenAI-compatible **proxy is native-only** — it forwards the client's tools verbatim and does not prompt-inject (see ADR-012). Put an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) behind the proxy; a non-FC backend like llamafile will degrade to passing text through. + Smoke-test: ```bash diff --git a/docs/decisions/012-openai-proxy.md b/docs/decisions/012-openai-proxy.md index 644e5c0..90bc822 100644 --- a/docs/decisions/012-openai-proxy.md +++ b/docs/decisions/012-openai-proxy.md @@ -104,6 +104,39 @@ The proxy fully buffers each response from the backend before deciding what to d 4. **Client disconnect handling** -- detect TCP drop, cancel in-flight backend request, release inference lock. 5. **Testing** -- unit tests for extraction, integration tests with mock backend, smoke test with real llama-server. +### Revision: native-only + transparent passthrough + +The proxy is **native-tool-call-only**. It targets backends that speak the +native OpenAI tools API (llama.cpp with a tool-calling chat template / `--jinja`, +vLLM, Ollama, Anthropic). There is no `--mode` flag and no prompt-injection +fallback in the proxy — prompt-injection mode (`build_tool_prompt`, +`_downgrade_messages`, the `mode="auto"` HTTP-error fallback) is a non-proxy +**WorkflowRunner / direct-client** feature only, retained because it still wins +for some models in full-guardrail workflow evals. + +Rationale: the proxy is a transparent layer for an external agent that already +speaks native FC to a native-FC backend. A traced capture showed the native +path forwards the client's request byte-for-byte. The earlier eval regression +(prompt-mode proxy underperforming) was a prompt-injection artifact on an +FC-capable backend, not proxy overhead. + +To preserve that transparency, the proxy forwards the client's **verbatim +OpenAI `tools` and `messages`** to the backend on the clean first attempt +(`raw_openai_tools` / `raw_openai_messages`), bypassing the lossy +`ToolSpec.from_json_schema` → `format_tool` round-trip that dropped schema +detail and leaked empty tool names. The parsed `ToolSpec` list is kept only as +forge's validation sidecar. On any forge mutation (retry / compaction / context +warning) the proxy falls back to the folded/serialized form — see the +`use_raw_messages` gate in `run_inference`, which mirrors the ADR-015 +`inbound_anthropic_body` drop-on-mutation logic. + +The synthetic `respond` tool is **opt-in** (`--inject-respond-tool`, default +off): the proxy forwards the client's tools untouched unless asked to inject it. + +If a backend lacking native FC is placed behind the proxy, it degrades to +passing the model's text through (no auto-downgrade) — **bring an FC-capable +backend.** + ### What this is NOT - **Not a model server.** Forge sits in front of one. diff --git a/src/forge/clients/anthropic.py b/src/forge/clients/anthropic.py index 5f9e062..ad9954c 100644 --- a/src/forge/clients/anthropic.py +++ b/src/forge/clients/anthropic.py @@ -288,6 +288,7 @@ async def send( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: list[dict[str, Any]] | None = None, ) -> LLMResponse: """Send messages via the Anthropic Messages API. @@ -296,6 +297,8 @@ async def send( forge. ``passthrough`` merges inbound-body extras into the SDK call. ``inbound_anthropic_body`` (path 1) triggers verbatim emit — see ADR-015 for the cache_control preservation rationale. + ``raw_openai_tools`` accepted for protocol symmetry, ignored + (Anthropic uses its own tool conversion). """ if sampling: log.debug( @@ -327,12 +330,14 @@ async def send_stream( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: list[dict[str, Any]] | None = None, ) -> AsyncIterator[StreamChunk]: """Stream via the Anthropic Messages API. ``sampling`` is accepted for protocol symmetry but ignored. ``passthrough`` merges inbound-body extras into the SDK call. ``inbound_anthropic_body`` (path 1) triggers verbatim emit; see ADR-015. + ``raw_openai_tools`` accepted for protocol symmetry, ignored. """ if sampling: log.debug( diff --git a/src/forge/clients/base.py b/src/forge/clients/base.py index ec63b22..2a500ca 100644 --- a/src/forge/clients/base.py +++ b/src/forge/clients/base.py @@ -9,6 +9,12 @@ from forge.core.workflow import LLMResponse, ToolCall, TextResponse, ToolSpec +# Verbatim OpenAI-shape payloads forwarded by the proxy. The proxy hands the +# client the user's original ``tools`` array so the backend sees the exact +# schema the client authored, instead of forge's reconstructed ToolSpec. +RawOpenAITools = list[dict[str, Any]] +RawOpenAIMessages = list[dict[str, Any]] + @dataclass(frozen=True) class TokenUsage: @@ -86,6 +92,7 @@ async def send( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: RawOpenAITools | None = None, ) -> LLMResponse: """Send messages and return a parsed response. @@ -116,6 +123,11 @@ async def send( forge-mutation (retry / compaction / context warning) so only the clean first-attempt call rides verbatim. Other clients accept and ignore. See ADR-015. + raw_openai_tools: Proxy-only — the client's verbatim OpenAI + ``tools`` array. When set, LlamafileClient's native path sends + it as-is instead of re-emitting ``format_tool(spec)``, so the + backend sees the original schema (no name/schema drift). Other + clients accept and ignore. """ ... @@ -126,6 +138,7 @@ async def send_stream( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: RawOpenAITools | None = None, ) -> AsyncIterator[StreamChunk]: """Send messages and yield streaming chunks. @@ -143,6 +156,7 @@ async def send_stream( Per-call values win over instance state without mutating self. passthrough: Optional inbound-body extras dict (see ``send``). inbound_anthropic_body: Optional path-1 verbatim body (see ``send``). + raw_openai_tools: Optional verbatim OpenAI tools array (see ``send``). """ ... diff --git a/src/forge/clients/llamafile.py b/src/forge/clients/llamafile.py index 2529a62..7140794 100644 --- a/src/forge/clients/llamafile.py +++ b/src/forge/clients/llamafile.py @@ -10,7 +10,7 @@ import httpx -from forge.clients.base import ChunkType, StreamChunk, TokenUsage, format_tool +from forge.clients.base import ChunkType, RawOpenAITools, StreamChunk, TokenUsage, format_tool from forge.clients.sampling_defaults import apply_sampling_defaults from forge.core.workflow import LLMResponse, TextResponse, ToolCall, ToolSpec from forge.errors import BackendError, ContextDiscoveryError @@ -270,16 +270,25 @@ async def send( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: RawOpenAITools | None = None, ) -> LLMResponse: """Resolve mode on first call with tools, then dispatch. ``inbound_anthropic_body`` is accepted for protocol symmetry and silently ignored — LlamafileClient only speaks OpenAI shape. + + ``raw_openai_tools`` (proxy use) is forwarded verbatim as the + backend's ``tools`` array on the native path; the prompt path + accepts and ignores it (it keeps forge's prompt-injection format). """ if self.resolved_mode is None: - return await self._resolve_and_send(messages, tools, sampling, passthrough) + return await self._resolve_and_send( + messages, tools, sampling, passthrough, raw_openai_tools, + ) elif self.resolved_mode == "native": - return await self._send_native(messages, tools, sampling, passthrough) + return await self._send_native( + messages, tools, sampling, passthrough, raw_openai_tools, + ) else: return await self._send_prompt(messages, tools, sampling, passthrough) @@ -290,15 +299,20 @@ async def send_stream( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: RawOpenAITools | None = None, ) -> AsyncIterator[StreamChunk]: """Stream via SSE, handling both native FC and prompt-injected paths. ``inbound_anthropic_body`` accepted for protocol symmetry, ignored. + ``raw_openai_tools`` (proxy use) is forwarded verbatim on the native + path; ignored on the prompt path. """ if self.resolved_mode is None: # Probe with a non-streaming call to resolve native vs prompt. # Result is discarded — the runner will use the streamed response. - await self._resolve_and_send(messages, tools, sampling, passthrough) + await self._resolve_and_send( + messages, tools, sampling, passthrough, raw_openai_tools, + ) mode = self.resolved_mode body: dict[str, Any] = dict(passthrough or {}) @@ -315,8 +329,12 @@ async def send_stream( prepared = _merge_consecutive(messages) else: prepared = _merge_consecutive(_downgrade_messages(messages)) - if mode == "native" and tools: - body["tools"] = [format_tool(t) for t in tools] + if mode == "native" and (raw_openai_tools is not None or tools): + body["tools"] = ( + raw_openai_tools + if raw_openai_tools is not None + else [format_tool(t) for t in tools] + ) body["messages"] = prepared elif mode == "prompt" and tools: tool_prompt = build_tool_prompt(tools) @@ -457,6 +475,7 @@ async def _resolve_and_send( tools: list[ToolSpec] | None, sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, + raw_openai_tools: RawOpenAITools | None = None, ) -> LLMResponse: """Auto-resolve mode on first send with tools. @@ -469,10 +488,14 @@ async def _resolve_and_send( if not tools: # No tools to test with — send without tools, defer resolution self.resolved_mode = "native" - return await self._send_native(messages, tools, sampling, passthrough) + return await self._send_native( + messages, tools, sampling, passthrough, raw_openai_tools, + ) try: - result = await self._send_native(messages, tools, sampling, passthrough) + result = await self._send_native( + messages, tools, sampling, passthrough, raw_openai_tools, + ) self.resolved_mode = "native" return result except (httpx.HTTPStatusError, BackendError): @@ -485,8 +508,14 @@ async def _send_native( tools: list[ToolSpec] | None, sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, + raw_openai_tools: RawOpenAITools | None = None, ) -> LLMResponse: - """Send using native function calling (OpenAI tools parameter).""" + """Send using native function calling (OpenAI tools parameter). + + When ``raw_openai_tools`` is supplied (proxy native passthrough), it is + sent as the ``tools`` array verbatim so the backend sees the client's + original schema instead of forge's re-emitted ``format_tool(spec)``. + """ merged = _merge_consecutive(messages) body: dict[str, Any] = dict(passthrough or {}) body.update({ @@ -496,7 +525,9 @@ async def _send_native( body.setdefault("model", self.model) self._apply_slot_id(body) self._apply_sampling(body, sampling) - if tools: + if raw_openai_tools is not None: + body["tools"] = raw_openai_tools + elif tools: body["tools"] = [format_tool(t) for t in tools] resp = await self._http.post( diff --git a/src/forge/clients/ollama.py b/src/forge/clients/ollama.py index 275e7d4..9f93a6c 100644 --- a/src/forge/clients/ollama.py +++ b/src/forge/clients/ollama.py @@ -145,6 +145,7 @@ async def send( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: list[dict[str, Any]] | None = None, ) -> LLMResponse: """Send messages via /api/chat and parse the response. @@ -153,8 +154,8 @@ async def send( (forge proxy uses LlamafileClient for external mode). Adding Ollama passthrough is a follow-up. - ``inbound_anthropic_body`` accepted for protocol symmetry, ignored - (Ollama is OpenAI-shape only). + ``inbound_anthropic_body`` / ``raw_openai_tools`` accepted for protocol + symmetry, ignored (Ollama is OpenAI-shape only). """ body: dict[str, Any] = { "model": self.model, @@ -215,11 +216,12 @@ async def send_stream( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: list[dict[str, Any]] | None = None, ) -> AsyncIterator[StreamChunk]: """Stream via NDJSON from /api/chat. - ``passthrough`` / ``inbound_anthropic_body`` accepted for protocol - symmetry; see ``send`` notes. + ``passthrough`` / ``inbound_anthropic_body`` / ``raw_openai_tools`` + accepted for protocol symmetry; see ``send`` notes. """ body: dict[str, Any] = { "model": self.model, diff --git a/src/forge/clients/vllm.py b/src/forge/clients/vllm.py index 5983423..ec0b574 100644 --- a/src/forge/clients/vllm.py +++ b/src/forge/clients/vllm.py @@ -165,8 +165,16 @@ async def send( messages: list[dict[str, str]], tools: list[ToolSpec] | None = None, sampling: dict[str, Any] | None = None, + passthrough: dict[str, Any] | None = None, + inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: list[dict[str, Any]] | None = None, ) -> LLMResponse: - """Send messages via /v1/chat/completions and parse the response.""" + """Send messages via /v1/chat/completions and parse the response. + + ``passthrough`` / ``inbound_anthropic_body`` / ``raw_openai_tools`` are + accepted for protocol symmetry and ignored — vLLM parses tools and + reasoning server-side and is native-only. + """ body: dict[str, Any] = { "model": self.model_path, "messages": messages, @@ -213,8 +221,15 @@ async def send_stream( messages: list[dict[str, str]], tools: list[ToolSpec] | None = None, sampling: dict[str, Any] | None = None, + passthrough: dict[str, Any] | None = None, + inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: list[dict[str, Any]] | None = None, ) -> AsyncIterator[StreamChunk]: - """Stream via SSE from /v1/chat/completions.""" + """Stream via SSE from /v1/chat/completions. + + ``passthrough`` / ``inbound_anthropic_body`` / ``raw_openai_tools`` + accepted for protocol symmetry and ignored (see ``send``). + """ body: dict[str, Any] = { "model": self.model_path, "messages": messages, diff --git a/src/forge/core/inference.py b/src/forge/core/inference.py index f22c528..421eaae 100644 --- a/src/forge/core/inference.py +++ b/src/forge/core/inference.py @@ -13,7 +13,14 @@ from dataclasses import dataclass, field from typing import Any -from forge.clients.base import ChunkType, LLMClient, StreamChunk, TokenUsage +from forge.clients.base import ( + ChunkType, + LLMClient, + RawOpenAIMessages, + RawOpenAITools, + StreamChunk, + TokenUsage, +) from forge.context.manager import ContextManager from forge.core.messages import Message, MessageMeta, MessageRole, MessageType, ToolCallInfo from forge.core.workflow import LLMResponse, TextResponse, ToolCall, ToolSpec @@ -127,6 +134,8 @@ async def run_inference( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_messages: RawOpenAIMessages | None = None, + raw_openai_tools: RawOpenAITools | None = None, ) -> InferenceResult | None: """Send messages to the LLM with compaction, folding, validation, and retry. @@ -197,8 +206,21 @@ async def run_inference( if context_warning: verbatim_body = None # mutation - # Fold and serialize - api_messages = fold_and_serialize(messages, api_format) + # Fold and serialize. Proxy callers may supply the client's raw OpenAI + # transcript; on the clean first attempt (no compaction, no warning) we + # forward it verbatim so the backend sees the client-authored shape + # instead of forge's parsed/re-emitted form. Any forge mutation + # (compaction / context warning / retry) falls back to folding. + use_raw_messages = ( + raw_openai_messages is not None + and _attempt == 0 + and compacted is messages + and not context_warning + ) + if use_raw_messages: + api_messages = raw_openai_messages + else: + api_messages = fold_and_serialize(messages, api_format) # Inject context warning as transient user message (not persisted # in conversation history). Uses "user" role because mid-conversation @@ -213,16 +235,27 @@ async def run_inference( MessageMeta(MessageType.CONTEXT_WARNING, step_index=step_index), )) + # Forward raw tools only on the clean first attempt — on retries forge + # has appended nudge/tool-error messages, so the parsed tool_specs path + # (format_tool) is the correct serialization. Pass the kwarg only when + # set so non-proxy callers (and their client doubles) keep the original + # call signature. + raw_tools_kwarg: dict[str, Any] = {} + if raw_openai_tools is not None and _attempt == 0: + raw_tools_kwarg["raw_openai_tools"] = raw_openai_tools + # Send if stream: response = await _send_streaming( client, api_messages, tool_specs, on_chunk, sampling, passthrough, inbound_anthropic_body=verbatim_body, + **raw_tools_kwarg, ) else: response = await client.send( api_messages, tools=tool_specs, sampling=sampling, passthrough=passthrough, inbound_anthropic_body=verbatim_body, + **raw_tools_kwarg, ) # Subsequent attempts (retries) are mutations regardless of outcome. verbatim_body = None @@ -329,12 +362,17 @@ async def _send_streaming( sampling: dict[str, Any] | None = None, passthrough: dict[str, Any] | None = None, inbound_anthropic_body: dict[str, Any] | None = None, + raw_openai_tools: RawOpenAITools | None = None, ) -> LLMResponse: """Send via streaming, forwarding chunks to on_chunk callback.""" response = None + raw_tools_kwarg: dict[str, Any] = {} + if raw_openai_tools is not None: + raw_tools_kwarg["raw_openai_tools"] = raw_openai_tools async for chunk in client.send_stream( api_messages, tools=tool_specs, sampling=sampling, passthrough=passthrough, inbound_anthropic_body=inbound_anthropic_body, + **raw_tools_kwarg, ): if on_chunk is not None: await on_chunk(chunk) diff --git a/src/forge/proxy/__main__.py b/src/forge/proxy/__main__.py index 4ef0688..16adf19 100644 --- a/src/forge/proxy/__main__.py +++ b/src/forge/proxy/__main__.py @@ -45,13 +45,6 @@ def main() -> None: ) parser.add_argument("--budget-tokens", type=int, help="Manual token budget") parser.add_argument("--extra-flags", nargs="*", help="Additional backend CLI flags") - parser.add_argument( - "--mode", - choices=["native", "prompt"], - default="native", - help="Function-calling mode (default: native). Use 'prompt' for " - "OpenAI-compatible backends without a function-calling template.", - ) parser.add_argument( "--backend-protocol", choices=["openai", "anthropic"], @@ -74,6 +67,12 @@ def main() -> None: help="Backend response timeout in seconds (default: 300)", ) parser.add_argument("--no-rescue", action="store_true", help="Disable rescue parsing") + parser.add_argument( + "--inject-respond-tool", + action="store_true", + help="Inject forge's synthetic respond() tool when the client sends " + "tools (keeps small models in tool-calling mode). Default off.", + ) parser.add_argument("--verbose", "-v", action="store_true", help="Verbose logging") args = parser.parse_args() @@ -108,7 +107,7 @@ def main() -> None: serialize=serialize, max_retries=args.max_retries, rescue_enabled=not args.no_rescue, - mode=args.mode, + inject_respond_tool=args.inject_respond_tool, backend_protocol=args.backend_protocol, backend_timeout=args.backend_timeout, ) diff --git a/src/forge/proxy/handler.py b/src/forge/proxy/handler.py index 1a6bcba..c95f30a 100644 --- a/src/forge/proxy/handler.py +++ b/src/forge/proxy/handler.py @@ -3,9 +3,10 @@ from __future__ import annotations import logging +from copy import deepcopy from typing import Any, Literal -from forge.clients.base import LLMClient +from forge.clients.base import LLMClient, format_tool from forge.context.manager import ContextManager from forge.core.inference import _get_usage, fold_and_serialize, run_inference from forge.core.workflow import ToolCall, ToolSpec, TextResponse @@ -100,12 +101,27 @@ def _extract_tool_names(tool_specs: list[ToolSpec]) -> list[str]: return [s.name for s in tool_specs] +def _raw_openai_tools(request_tools: Any) -> list[dict[str, Any]] | None: + """Return a detached deep copy of the inbound OpenAI tools array.""" + if not isinstance(request_tools, list) or not request_tools: + return None + return [deepcopy(tool) for tool in request_tools if isinstance(tool, dict)] + + +def _raw_openai_messages(request_messages: Any) -> list[dict[str, Any]] | None: + """Return a detached deep copy of the inbound OpenAI messages array.""" + if not isinstance(request_messages, list) or not request_messages: + return None + return [deepcopy(msg) for msg in request_messages if isinstance(msg, dict)] + + async def handle_chat_completions( body: dict[str, Any], client: LLMClient, context_manager: ContextManager, max_retries: int = 3, rescue_enabled: bool = True, + inject_respond_tool: bool = False, protocol: Literal["openai", "anthropic"] = "openai", ) -> dict[str, Any] | list[dict[str, Any]]: """Handle an inbound completions request. @@ -120,6 +136,11 @@ async def handle_chat_completions( context_manager: For context compaction. max_retries: Max consecutive retries for bad responses. rescue_enabled: Whether to attempt rescue parsing. + inject_respond_tool: When True and the client request supplies tools, + inject forge's synthetic respond() tool so the model stays in + tool-calling mode (the call is stripped from the outbound + response). Default False — the proxy forwards the client's tools + untouched unless explicitly opted in. protocol: Inbound wire format. ``openai`` for ``/v1/chat/completions``; ``anthropic`` for ``/v1/messages``. @@ -148,18 +169,38 @@ async def handle_chat_completions( # ADR-015. inbound_anthropic_body = body else: - messages = openai_to_messages(body.get("messages", [])) - tool_specs = _extract_tool_specs(body.get("tools")) + request_messages = body.get("messages", []) + request_tools = body.get("tools") + messages = openai_to_messages(request_messages) + tool_specs = _extract_tool_specs(request_tools) sampling = _extract_sampling(body) passthrough = _extract_passthrough(body) inbound_anthropic_body = None + # Detached verbatim copies of the client's OpenAI tools/messages. + # Forwarded to the native backend on the clean first attempt so it + # sees the exact schema/transcript the client authored, bypassing the + # lossy ToolSpec round-trip. tool_specs stays as forge's validation + # sidecar. (Anthropic protocol converts shapes itself → None.) + raw_tools_for_backend = _raw_openai_tools(request_tools) + raw_messages_for_backend = _raw_openai_messages(request_messages) + + if protocol == "anthropic": + raw_tools_for_backend = None + raw_messages_for_backend = None - # Inject respond tool when tools are present. The model calls - # respond(message="...") instead of producing bare text, keeping it - # in tool-calling mode where guardrails apply. The respond call is + # Optionally inject the respond tool (default off). When on, the model + # calls respond(message="...") instead of producing bare text, keeping it + # in tool-calling mode where guardrails apply. The respond call is # stripped from the outbound response — the client never sees it. - if tool_specs and not any(s.name == RESPOND_TOOL_NAME for s in tool_specs): - tool_specs.append(respond_spec()) + if ( + inject_respond_tool + and tool_specs + and not any(s.name == RESPOND_TOOL_NAME for s in tool_specs) + ): + respond = respond_spec() + tool_specs.append(respond) + if raw_tools_for_backend is not None: + raw_tools_for_backend.append(format_tool(respond)) tool_names = _extract_tool_names(tool_specs) @@ -168,7 +209,7 @@ async def handle_chat_completions( if not tool_specs: logger.info("No tools in request, passing through to backend") api_format = getattr(client, "api_format", "ollama") - api_messages = fold_and_serialize(messages, api_format) + api_messages = raw_messages_for_backend or fold_and_serialize(messages, api_format) response = await client.send( api_messages, tools=None, sampling=sampling, passthrough=passthrough, inbound_anthropic_body=inbound_anthropic_body, @@ -193,6 +234,8 @@ async def handle_chat_completions( sampling=sampling, passthrough=passthrough, inbound_anthropic_body=inbound_anthropic_body, + raw_openai_messages=raw_messages_for_backend, + raw_openai_tools=raw_tools_for_backend, ) except ToolCallError as exc: # Retries exhausted — the model kept returning text instead of tool diff --git a/src/forge/proxy/proxy.py b/src/forge/proxy/proxy.py index a4d3c8d..67cfd55 100644 --- a/src/forge/proxy/proxy.py +++ b/src/forge/proxy/proxy.py @@ -67,7 +67,7 @@ def __init__( serialize: bool | None = None, max_retries: int = 3, rescue_enabled: bool = True, - mode: Literal["native", "prompt"] = "native", + inject_respond_tool: bool = False, backend_protocol: Literal["openai", "anthropic"] = "openai", backend_timeout: float = 300.0, ) -> None: @@ -93,11 +93,11 @@ def __init__( managed, False for external). max_retries: Max consecutive retries for bad LLM responses. rescue_enabled: Attempt rescue parsing of text responses. - mode: Function-calling mode for OpenAI-compatible backends — - "native" uses the backend's native tools API, "prompt" - uses forge's prompt-injection fallback for backends - without a function-calling template. Not applicable to vLLM - (parses tool calls server-side) or the Anthropic protocol. + inject_respond_tool: When True, inject forge's synthetic respond() + tool into requests that already carry tools (keeps the model in + tool-calling mode). Default False. The proxy is native-only and + forwards the client's tools verbatim; prompt-injection mode is a + non-proxy WorkflowRunner feature. backend_protocol: Wire format of the external backend. ``openai`` (default) for llama.cpp, vLLM, Ollama. ``anthropic`` for Anthropic-shape downstreams (the official Anthropic API, @@ -108,13 +108,6 @@ def __init__( """ if backend_url is None and backend is None: raise ValueError("Provide either backend_url (external) or backend (managed)") - if backend_protocol == "anthropic" and mode == "prompt": - raise ValueError( - "mode='prompt' is not supported with backend_protocol='anthropic' — " - "Anthropic protocol has native tool calling; the prompt-injection " - "fallback only applies to OpenAI-shape backends without a function-" - "calling template." - ) if backend_protocol == "anthropic" and backend_url is None: raise ValueError( "backend_protocol='anthropic' requires external mode (backend_url=...). " @@ -125,11 +118,6 @@ def __init__( "backend='vllm' speaks the OpenAI protocol; backend_protocol='anthropic' " "is not applicable." ) - if backend == "vllm" and mode == "prompt": - raise ValueError( - "backend='vllm' parses tool calls server-side (native only); " - "mode='prompt' is not applicable." - ) if not math.isfinite(backend_timeout) or backend_timeout <= 0: raise ValueError("backend_timeout must be a finite value greater than 0") # Managed mode: each backend requires its own identity field. Fail @@ -155,7 +143,7 @@ def __init__( self._port = port self._max_retries = max_retries self._rescue_enabled = rescue_enabled - self._mode = mode + self._inject_respond_tool = inject_respond_tool self._backend_protocol = backend_protocol self._backend_timeout = backend_timeout @@ -236,6 +224,7 @@ async def _async_start(self, ready: threading.Event) -> None: serialize_requests=self._serialize, max_retries=self._max_retries, rescue_enabled=self._rescue_enabled, + inject_respond_tool=self._inject_respond_tool, ) await self._http_server.start() self._started = True @@ -310,7 +299,7 @@ async def _setup_external(self) -> tuple[LLMClient, ContextManager]: client = LlamafileClient( gguf_path=self._model or "default", base_url=base, - mode=self._mode, + mode="native", timeout=self._backend_timeout, ) @@ -336,11 +325,11 @@ async def _setup_managed(self) -> tuple[LLMClient, ContextManager]: assert self._backend is not None client = self._build_managed_client() - # The backend process is always launched in native mode (--jinja is - # harmless and enables the native tools API where available); prompt - # mode is a client-side injection concern carried by the client. - # Pass each backend only its own identity field — setup_backend - # enforces mutual exclusivity. + # The backend process is launched in native mode (--jinja enables the + # native tools API). The proxy is native-only — it forwards the + # client's tools verbatim and never prompt-injects (prompt mode is a + # non-proxy WorkflowRunner feature). Pass each backend only its own + # identity field — setup_backend enforces mutual exclusivity. server, context_manager = await setup_backend( backend=self._backend, model=self._model if self._backend == "ollama" else None, @@ -369,7 +358,7 @@ def _build_managed_client(self) -> LLMClient: return LlamafileClient( gguf_path=self._gguf or "default", base_url=base_url, - mode=self._mode, + mode="native", timeout=self._backend_timeout, ) if self._backend == "vllm": diff --git a/src/forge/proxy/server.py b/src/forge/proxy/server.py index 56174da..619af54 100644 --- a/src/forge/proxy/server.py +++ b/src/forge/proxy/server.py @@ -49,6 +49,7 @@ def __init__( serialize_requests: bool = True, max_retries: int = 3, rescue_enabled: bool = True, + inject_respond_tool: bool = False, ) -> None: self._client = client self._context_manager = context_manager @@ -56,6 +57,7 @@ def __init__( self._port = port self._max_retries = max_retries self._rescue_enabled = rescue_enabled + self._inject_respond_tool = inject_respond_tool self._server: asyncio.Server | None = None self._serialize = serialize_requests self._queue: asyncio.Queue[_QueueItem] = asyncio.Queue() @@ -306,6 +308,7 @@ async def _run_handler( context_manager=self._context_manager, max_retries=self._max_retries, rescue_enabled=self._rescue_enabled, + inject_respond_tool=self._inject_respond_tool, protocol=protocol, ) except Exception as exc: diff --git a/tests/unit/test_inference_passthrough.py b/tests/unit/test_inference_passthrough.py new file mode 100644 index 0000000..78aa851 --- /dev/null +++ b/tests/unit/test_inference_passthrough.py @@ -0,0 +1,100 @@ +"""Tests for run_inference's raw-OpenAI passthrough first-attempt gate. + +The proxy hands run_inference the client's verbatim OpenAI transcript/tools. +They must be forwarded ONLY on the clean first attempt; any forge mutation +(retry here) falls back to fold_and_serialize + the parsed tool_specs. +""" + +from unittest.mock import AsyncMock + +import pytest + +from forge.context.manager import ContextManager +from forge.context.strategies import NoCompact +from forge.core.inference import run_inference +from forge.core.messages import Message, MessageMeta, MessageRole, MessageType +from forge.core.workflow import TextResponse, ToolCall, ToolSpec +from forge.guardrails import ErrorTracker, ResponseValidator + + +def _client(*responses): + client = AsyncMock() + client.api_format = "ollama" + client.send = AsyncMock(side_effect=list(responses)) + client.last_usage = {} + client._slot_id = 0 + return client + + +def _ctx(): + return ContextManager(strategy=NoCompact(), budget_tokens=8192) + + +def _search_spec(): + return ToolSpec.from_json_schema( + name="search", description="", schema={"type": "object", "properties": {}}, + ) + + +@pytest.mark.asyncio +async def test_raw_used_on_first_attempt_folded_on_retry(): + # Attempt 0: text (invalid → retry). Attempt 1: valid tool call. + client = _client( + TextResponse(content="just narrating, no tool"), + [ToolCall(tool="search", args={})], + ) + messages = [Message( + MessageRole.USER, "folded-form", + MessageMeta(MessageType.USER_INPUT), + )] + raw_messages = [{"role": "user", "content": "VERBATIM", "name": "u1"}] + raw_tools = [{"type": "function", "function": {"name": "search", "parameters": {}}}] + + result = await run_inference( + messages=messages, + client=client, + context_manager=_ctx(), + validator=ResponseValidator(["search"], rescue_enabled=True), + error_tracker=ErrorTracker(max_retries=2), + tool_specs=[_search_spec()], + raw_openai_messages=raw_messages, + raw_openai_tools=raw_tools, + ) + + assert result is not None + assert client.send.await_count == 2 + + # Attempt 0 (clean): forwarded the verbatim raw messages + raw tools. + first = client.send.call_args_list[0] + assert first.args[0] == raw_messages + assert first.kwargs["raw_openai_tools"] == raw_tools + + # Attempt 1 (post-retry mutation): folded messages, no raw tools kwarg. + second = client.send.call_args_list[1] + assert second.args[0] != raw_messages + assert second.args[0][0]["content"] == "folded-form" + assert "raw_openai_tools" not in second.kwargs + + +@pytest.mark.asyncio +async def test_no_raw_falls_back_to_fold(): + """Without raw_openai_* (the non-proxy runner path), folding is used and + no raw_openai_tools kwarg is passed to the client.""" + client = _client([ToolCall(tool="search", args={})]) + messages = [Message( + MessageRole.USER, "hello", + MessageMeta(MessageType.USER_INPUT), + )] + + await run_inference( + messages=messages, + client=client, + context_manager=_ctx(), + validator=ResponseValidator(["search"], rescue_enabled=True), + error_tracker=ErrorTracker(max_retries=1), + tool_specs=[_search_spec()], + ) + + call = client.send.call_args + assert call.args[0][0]["content"] == "hello" + assert "raw_openai_tools" not in call.kwargs diff --git a/tests/unit/test_proxy_handler.py b/tests/unit/test_proxy_handler.py index 1a4203a..d417317 100644 --- a/tests/unit/test_proxy_handler.py +++ b/tests/unit/test_proxy_handler.py @@ -148,12 +148,13 @@ async def test_tool_call_stream(self): @pytest.mark.asyncio async def test_respond_tool_auto_injected(self): - """Respond tool is injected — model calling respond returns text.""" + """With inject_respond_tool=True, a respond() call is stripped to text.""" client = _mock_client([ToolCall(tool="respond", args={"message": "Hi!"})]) client.last_usage = {0: TokenUsage(prompt_tokens=10, completion_tokens=5, total_tokens=15)} - + result = await handle_chat_completions( _body(tools=[_tool_def("search")]), client, _context_manager(), + inject_respond_tool=True, ) # respond is stripped — client sees text, not a tool call assert result["choices"][0]["message"]["content"] == "Hi!" @@ -183,9 +184,10 @@ async def test_mixed_respond_and_tool_calls(self): ToolCall(tool="respond", args={"message": "also this"}), ]) client.last_usage = {0: TokenUsage(prompt_tokens=10, completion_tokens=5, total_tokens=15)} - + result = await handle_chat_completions( _body(tools=[_tool_def("search")]), client, _context_manager(), + inject_respond_tool=True, ) tc = result["choices"][0]["message"]["tool_calls"] assert len(tc) == 1 @@ -201,8 +203,9 @@ async def test_respond_not_double_injected(self): tools = [_tool_def("search"), _tool_def("respond")] result = await handle_chat_completions( _body(tools=tools), client, _context_manager(), + inject_respond_tool=True, ) - # Should still work — respond stripped to text + # Should still work — respond stripped to text (not double-injected) assert result["choices"][0]["message"]["content"] == "Hi!" assert result["usage"] == {"prompt_tokens": 10, "completion_tokens": 5, "total_tokens": 15} @@ -467,3 +470,82 @@ async def test_system_top_level_flows_into_messages(self): api_messages = client.send.call_args.args[0] assert api_messages[0]["role"] == "system" assert api_messages[0]["content"] == "You are helpful." + + +# ── Native transparent passthrough ────────────────────────── + + +class TestNativePassthrough: + """The proxy forwards the client's OpenAI tools/messages verbatim on the + clean first attempt, bypassing the lossy ToolSpec round-trip.""" + + @pytest.mark.asyncio + async def test_raw_tools_forwarded_verbatim(self): + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + params = { + "type": "object", + "properties": {"q": {"type": "string", "description": "the query"}}, + "required": ["q"], + "additionalProperties": False, + } + tools = [_tool_def("search", parameters=params)] + await handle_chat_completions( + _body(tools=tools), client, _context_manager(), + ) + # The backend sees the client's exact tools array (full schema, no + # name/schema drift), not forge's reconstructed format_tool output. + sent = client.send.call_args.kwargs["raw_openai_tools"] + assert sent == tools + # Respond is NOT appended by default. + assert [t["function"]["name"] for t in sent] == ["search"] + # tool_specs (validation sidecar) still passed separately. + assert client.send.call_args.kwargs["tools"][0].name == "search" + + @pytest.mark.asyncio + async def test_raw_messages_forwarded_verbatim(self): + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + # An extra non-standard key proves no normalization/folding happened. + messages = [{"role": "user", "content": "hi", "name": "u1"}] + await handle_chat_completions( + _body(messages=messages, tools=[_tool_def("search")]), + client, _context_manager(), + ) + sent_messages = client.send.call_args.args[0] + assert sent_messages == messages + + @pytest.mark.asyncio + async def test_inbound_body_mutation_does_not_affect_sent(self): + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + tools = [_tool_def("search")] + body = _body(tools=tools) + await handle_chat_completions(body, client, _context_manager()) + # Mutate the caller's body after the call — detached copy is unaffected. + body["tools"][0]["function"]["name"] = "MUTATED" + body["messages"][0]["content"] = "MUTATED" + sent_tools = client.send.call_args.kwargs["raw_openai_tools"] + sent_messages = client.send.call_args.args[0] + assert sent_tools[0]["function"]["name"] == "search" + assert sent_messages[0]["content"] == "hi" + + @pytest.mark.asyncio + async def test_respond_not_injected_by_default(self): + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + await handle_chat_completions( + _body(tools=[_tool_def("search")]), client, _context_manager(), + ) + sent = client.send.call_args.kwargs["raw_openai_tools"] + names = [t["function"]["name"] for t in sent] + assert "respond" not in names + spec_names = [s.name for s in client.send.call_args.kwargs["tools"]] + assert "respond" not in spec_names + + @pytest.mark.asyncio + async def test_respond_injected_into_raw_tools_when_opted_in(self): + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + await handle_chat_completions( + _body(tools=[_tool_def("search")]), client, _context_manager(), + inject_respond_tool=True, + ) + sent = client.send.call_args.kwargs["raw_openai_tools"] + names = [t["function"]["name"] for t in sent] + assert names == ["search", "respond"] diff --git a/tests/unit/test_proxy_path1.py b/tests/unit/test_proxy_path1.py index 6a1d591..4cc8862 100644 --- a/tests/unit/test_proxy_path1.py +++ b/tests/unit/test_proxy_path1.py @@ -25,14 +25,6 @@ class TestProxyServerValidation: - def test_anthropic_with_prompt_mode_rejected(self): - with pytest.raises(ValueError, match="mode='prompt'"): - ProxyServer( - backend_url="http://localhost:8080", - backend_protocol="anthropic", - mode="prompt", - ) - def test_anthropic_in_managed_mode_rejected(self): with pytest.raises(ValueError, match="external mode"): ProxyServer( diff --git a/tests/unit/test_proxy_proxy.py b/tests/unit/test_proxy_proxy.py index 56f8c34..ca8792f 100644 --- a/tests/unit/test_proxy_proxy.py +++ b/tests/unit/test_proxy_proxy.py @@ -21,7 +21,7 @@ class TestConstructorValidation: - """__init__ validation: mode/protocol guards and managed identity rules.""" + """__init__ validation: protocol guards and managed identity rules.""" def test_neither_url_nor_backend_rejected(self) -> None: with pytest.raises(ValueError, match="Provide either backend_url"): @@ -31,20 +31,10 @@ def test_anthropic_requires_external(self) -> None: with pytest.raises(ValueError, match="requires external mode"): ProxyServer(backend="llamaserver", gguf="m.gguf", backend_protocol="anthropic") - def test_anthropic_rejects_prompt_mode(self) -> None: - with pytest.raises(ValueError, match="mode='prompt' is not supported"): - ProxyServer( - backend_url="http://x", backend_protocol="anthropic", mode="prompt", - ) - def test_vllm_rejects_anthropic_protocol(self) -> None: with pytest.raises(ValueError, match="speaks the OpenAI protocol"): ProxyServer(backend_url="http://x:8000", backend="vllm", backend_protocol="anthropic") - def test_vllm_rejects_prompt_mode(self) -> None: - with pytest.raises(ValueError, match="parses tool calls server-side"): - ProxyServer(backend="vllm", model_path="/m", mode="prompt") - # Managed identity rules def test_managed_ollama_requires_model(self) -> None: with pytest.raises(ValueError, match="backend='ollama' requires model"): @@ -288,9 +278,10 @@ async def test_ollama_wiring(self) -> None: assert kwargs["client"] is client @pytest.mark.asyncio - async def test_managed_llamafile_carries_client_mode(self) -> None: - # prompt mode is a client-side concern; the server still starts native. - proxy = ProxyServer(backend="llamafile", gguf="/m/x.gguf", mode="prompt") + async def test_managed_llamafile_client_is_native(self) -> None: + # The proxy is native-only: the managed LlamafileClient is built in + # native mode and the backend process is launched native too. + proxy = ProxyServer(backend="llamafile", gguf="/m/x.gguf") mock_ctx = ContextManager.__new__(ContextManager) mock_ctx.budget_tokens = 8192 with patch( @@ -299,7 +290,7 @@ async def test_managed_llamafile_carries_client_mode(self) -> None: ) as mock_setup: client, _ = await proxy._setup_managed() assert isinstance(client, LlamafileClient) - assert client.mode == "prompt" + assert client.mode == "native" assert mock_setup.await_args.kwargs["mode"] == "native" From fa9be50f34784f88f1f0652c454b0c8feefe31e6 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Sun, 31 May 2026 17:08:02 -0500 Subject: [PATCH 2/8] Add prompt-injection as opt-in proxy capability (--backend-capability) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The proxy serves tool-call-capable backends natively (verbatim tool/message passthrough). This adds prompt-injection back as an explicit opt-in for non-function-calling backends (llama.cpp / llamafile without a tool template). - New --backend-capability {native,prompt} (default native), declared once at construction and frozen — no runtime probing or mid-request mode mutation. - prompt capability reuses LlamafileClient's existing prompt path (build the tool prompt, downgrade tool/assistant-tool_call history to text, parse the JSON tool call back into native tool_calls). No client changes. - Handler suppresses verbatim raw passthrough when in prompt mode so inference folds normally and the client injects the tool prompt. - Rejected for backends that are native-only (vLLM, Ollama, anthropic protocol). - Docs: BACKEND_SETUP + ADR-012 updated to native-first + prompt opt-in. Co-Authored-By: Claude Opus 4.8 --- docs/BACKEND_SETUP.md | 2 +- docs/decisions/012-openai-proxy.md | 31 ++++++++---- src/forge/proxy/__main__.py | 12 +++++ src/forge/proxy/handler.py | 22 ++++++++- src/forge/proxy/proxy.py | 50 +++++++++++++++---- src/forge/proxy/server.py | 3 ++ tests/unit/test_proxy_handler.py | 46 +++++++++++++++++ tests/unit/test_proxy_proxy.py | 79 ++++++++++++++++++++++++++++++ 8 files changed, 223 insertions(+), 22 deletions(-) diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md index 0bb0297..024e762 100644 --- a/docs/BACKEND_SETUP.md +++ b/docs/BACKEND_SETUP.md @@ -75,7 +75,7 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999 llamafile does **not** support native function calling — forge's `LlamafileClient` falls back to prompt-injected mode automatically (`mode="auto"`), or you can force it with `mode="prompt"`. -> **Proxy note:** prompt-injection mode is a **direct-client / WorkflowRunner** feature. The OpenAI-compatible **proxy is native-only** — it forwards the client's tools verbatim and does not prompt-inject (see ADR-012). Put an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) behind the proxy; a non-FC backend like llamafile will degrade to passing text through. +> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012. Smoke-test: diff --git a/docs/decisions/012-openai-proxy.md b/docs/decisions/012-openai-proxy.md index 90bc822..8a5d8d0 100644 --- a/docs/decisions/012-openai-proxy.md +++ b/docs/decisions/012-openai-proxy.md @@ -104,15 +104,28 @@ The proxy fully buffers each response from the backend before deciding what to d 4. **Client disconnect handling** -- detect TCP drop, cancel in-flight backend request, release inference lock. 5. **Testing** -- unit tests for extraction, integration tests with mock backend, smoke test with real llama-server. -### Revision: native-only + transparent passthrough - -The proxy is **native-tool-call-only**. It targets backends that speak the -native OpenAI tools API (llama.cpp with a tool-calling chat template / `--jinja`, -vLLM, Ollama, Anthropic). There is no `--mode` flag and no prompt-injection -fallback in the proxy — prompt-injection mode (`build_tool_prompt`, -`_downgrade_messages`, the `mode="auto"` HTTP-error fallback) is a non-proxy -**WorkflowRunner / direct-client** feature only, retained because it still wins -for some models in full-guardrail workflow evals. +### Revision: native-first, with opt-in prompt capability + +The proxy is **native-first**. By default (`--backend-capability native`) it +targets backends that speak the native OpenAI tools API (llama.cpp with a +tool-calling chat template / `--jinja`, vLLM, Ollama, Anthropic) and forwards +the client's request verbatim (below). + +Prompt-injection is available as an **explicit opt-in** +(`--backend-capability prompt`, llama.cpp/llamafile only) for non-FC backends — +it reuses the WorkflowRunner's prompt path (`build_tool_prompt`, +`_downgrade_messages`, `extract_tool_call`) so there is **one** prompt +implementation, not a proxy-specific fork. The capability is **declared once at +construction and frozen** — there is deliberately **no `mode="auto"` runtime +probe** (the old auto/HTTP-error fallback that mutated state mid-request was the +root of the original tangle; it is not reintroduced). In prompt capability the +verbatim passthrough is suppressed (`native_passthrough=False`): tools are +serialized into the prompt, so a raw native transcript would be meaningless. + +History: this revision originally cut prompt mode from the proxy entirely +("native-only"). Prompt was then re-added as the opt-in capability above — +native-first is a cleaner story than a backwards-incompatible drop, and non-FC +backends (e.g. llamafile) stay usable through the proxy. Rationale: the proxy is a transparent layer for an external agent that already speaks native FC to a native-FC backend. A traced capture showed the native diff --git a/src/forge/proxy/__main__.py b/src/forge/proxy/__main__.py index 16adf19..45b7424 100644 --- a/src/forge/proxy/__main__.py +++ b/src/forge/proxy/__main__.py @@ -67,6 +67,17 @@ def main() -> None: help="Backend response timeout in seconds (default: 300)", ) parser.add_argument("--no-rescue", action="store_true", help="Disable rescue parsing") + parser.add_argument( + "--backend-capability", + choices=["native", "prompt"], + default="native", + help="Tool-calling protocol for the backend (default: native). " + "'native' forwards the client's tools verbatim to a " + "function-calling-capable backend. 'prompt' opts into " + "prompt-injection for non-FC llama.cpp/llamafile backends " + "(strips tools into the prompt, parses the JSON call back). " + "Frozen at startup — never probed or switched mid-stream.", + ) parser.add_argument( "--inject-respond-tool", action="store_true", @@ -107,6 +118,7 @@ def main() -> None: serialize=serialize, max_retries=args.max_retries, rescue_enabled=not args.no_rescue, + backend_capability=args.backend_capability, inject_respond_tool=args.inject_respond_tool, backend_protocol=args.backend_protocol, backend_timeout=args.backend_timeout, diff --git a/src/forge/proxy/handler.py b/src/forge/proxy/handler.py index c95f30a..b0ecffc 100644 --- a/src/forge/proxy/handler.py +++ b/src/forge/proxy/handler.py @@ -121,6 +121,7 @@ async def handle_chat_completions( context_manager: ContextManager, max_retries: int = 3, rescue_enabled: bool = True, + native_passthrough: bool = True, inject_respond_tool: bool = False, protocol: Literal["openai", "anthropic"] = "openai", ) -> dict[str, Any] | list[dict[str, Any]]: @@ -136,6 +137,13 @@ async def handle_chat_completions( context_manager: For context compaction. max_retries: Max consecutive retries for bad responses. rescue_enabled: Whether to attempt rescue parsing. + native_passthrough: When True (default, native capability), forward the + client's verbatim OpenAI tools/messages to the backend on the clean + first attempt (transparent passthrough). When False (prompt + capability), suppress the raw passthrough so the request folds + normally and the client's prompt path injects the tool prompt and + downgrades tool history — the raw passthrough is meaningless when + tools are serialized into the prompt text. inject_respond_tool: When True and the client request supplies tools, inject forge's synthetic respond() tool so the model stays in tool-calling mode (the call is stripped from the outbound @@ -181,8 +189,18 @@ async def handle_chat_completions( # sees the exact schema/transcript the client authored, bypassing the # lossy ToolSpec round-trip. tool_specs stays as forge's validation # sidecar. (Anthropic protocol converts shapes itself → None.) - raw_tools_for_backend = _raw_openai_tools(request_tools) - raw_messages_for_backend = _raw_openai_messages(request_messages) + # + # In prompt capability (native_passthrough=False) we suppress the raw + # passthrough: the request folds normally and the client's prompt path + # (LlamafileClient._send_prompt) strips the tools into the prompt and + # downgrades tool history. A verbatim native transcript is meaningless + # once tools are injected as prompt text. + if native_passthrough: + raw_tools_for_backend = _raw_openai_tools(request_tools) + raw_messages_for_backend = _raw_openai_messages(request_messages) + else: + raw_tools_for_backend = None + raw_messages_for_backend = None if protocol == "anthropic": raw_tools_for_backend = None diff --git a/src/forge/proxy/proxy.py b/src/forge/proxy/proxy.py index 67cfd55..bc8b577 100644 --- a/src/forge/proxy/proxy.py +++ b/src/forge/proxy/proxy.py @@ -67,6 +67,7 @@ def __init__( serialize: bool | None = None, max_retries: int = 3, rescue_enabled: bool = True, + backend_capability: Literal["native", "prompt"] = "native", inject_respond_tool: bool = False, backend_protocol: Literal["openai", "anthropic"] = "openai", backend_timeout: float = 300.0, @@ -93,11 +94,20 @@ def __init__( managed, False for external). max_retries: Max consecutive retries for bad LLM responses. rescue_enabled: Attempt rescue parsing of text responses. + backend_capability: Tool-calling protocol for the backend. + ``native`` (default) forwards the client's OpenAI tools/messages + verbatim to a function-calling-capable backend (transparent + passthrough). ``prompt`` opts into prompt-injection for a non-FC + llama.cpp/llamafile backend — tools are stripped into the prompt + and the JSON tool call is parsed back out (the same path the + WorkflowRunner uses). Only valid for llama.cpp/llamafile + backends; rejected for vllm/ollama and the anthropic protocol. + Selected once at construction and frozen — never probed or + switched mid-stream. inject_respond_tool: When True, inject forge's synthetic respond() tool into requests that already carry tools (keeps the model in - tool-calling mode). Default False. The proxy is native-only and - forwards the client's tools verbatim; prompt-injection mode is a - non-proxy WorkflowRunner feature. + tool-calling mode). Default False. Orthogonal to + backend_capability — works in both native and prompt modes. backend_protocol: Wire format of the external backend. ``openai`` (default) for llama.cpp, vLLM, Ollama. ``anthropic`` for Anthropic-shape downstreams (the official Anthropic API, @@ -118,6 +128,22 @@ def __init__( "backend='vllm' speaks the OpenAI protocol; backend_protocol='anthropic' " "is not applicable." ) + # Prompt-injection is a llama.cpp/llamafile capability only. vLLM and + # Ollama clients are native-only (they accept-ignore raw tools and have + # no prompt path); the anthropic protocol does its own tool conversion. + # backend=None (external) defaults to the llama.cpp adapter, which + # supports prompt — so only vllm/ollama and anthropic are rejected. + if backend_capability == "prompt": + if backend_protocol == "anthropic": + raise ValueError( + "backend_capability='prompt' is not supported with the " + "anthropic protocol (native tool calling only)." + ) + if backend in ("vllm", "ollama"): + raise ValueError( + f"backend_capability='prompt' is only supported for " + f"llama.cpp/llamafile backends, not backend={backend!r}." + ) if not math.isfinite(backend_timeout) or backend_timeout <= 0: raise ValueError("backend_timeout must be a finite value greater than 0") # Managed mode: each backend requires its own identity field. Fail @@ -143,6 +169,7 @@ def __init__( self._port = port self._max_retries = max_retries self._rescue_enabled = rescue_enabled + self._backend_capability = backend_capability self._inject_respond_tool = inject_respond_tool self._backend_protocol = backend_protocol self._backend_timeout = backend_timeout @@ -224,6 +251,7 @@ async def _async_start(self, ready: threading.Event) -> None: serialize_requests=self._serialize, max_retries=self._max_retries, rescue_enabled=self._rescue_enabled, + native_passthrough=self._backend_capability == "native", inject_respond_tool=self._inject_respond_tool, ) await self._http_server.start() @@ -299,7 +327,7 @@ async def _setup_external(self) -> tuple[LLMClient, ContextManager]: client = LlamafileClient( gguf_path=self._model or "default", base_url=base, - mode="native", + mode=self._backend_capability, timeout=self._backend_timeout, ) @@ -325,11 +353,13 @@ async def _setup_managed(self) -> tuple[LLMClient, ContextManager]: assert self._backend is not None client = self._build_managed_client() - # The backend process is launched in native mode (--jinja enables the - # native tools API). The proxy is native-only — it forwards the - # client's tools verbatim and never prompt-injects (prompt mode is a - # non-proxy WorkflowRunner feature). Pass each backend only its own - # identity field — setup_backend enforces mutual exclusivity. + # The backend process is always launched in native mode (--jinja enables + # the native tools API). This is independent of backend_capability: in + # prompt capability the proxy simply doesn't send native tools, so a + # native-launched backend (jinja template present but unused) serves the + # prompt-injected request fine. Keeping launch native avoids changing + # backend startup flags for the opt-in path. Pass each backend only its + # own identity field — setup_backend enforces mutual exclusivity. server, context_manager = await setup_backend( backend=self._backend, model=self._model if self._backend == "ollama" else None, @@ -358,7 +388,7 @@ def _build_managed_client(self) -> LLMClient: return LlamafileClient( gguf_path=self._gguf or "default", base_url=base_url, - mode="native", + mode=self._backend_capability, timeout=self._backend_timeout, ) if self._backend == "vllm": diff --git a/src/forge/proxy/server.py b/src/forge/proxy/server.py index 619af54..3ca3149 100644 --- a/src/forge/proxy/server.py +++ b/src/forge/proxy/server.py @@ -49,6 +49,7 @@ def __init__( serialize_requests: bool = True, max_retries: int = 3, rescue_enabled: bool = True, + native_passthrough: bool = True, inject_respond_tool: bool = False, ) -> None: self._client = client @@ -57,6 +58,7 @@ def __init__( self._port = port self._max_retries = max_retries self._rescue_enabled = rescue_enabled + self._native_passthrough = native_passthrough self._inject_respond_tool = inject_respond_tool self._server: asyncio.Server | None = None self._serialize = serialize_requests @@ -308,6 +310,7 @@ async def _run_handler( context_manager=self._context_manager, max_retries=self._max_retries, rescue_enabled=self._rescue_enabled, + native_passthrough=self._native_passthrough, inject_respond_tool=self._inject_respond_tool, protocol=protocol, ) diff --git a/tests/unit/test_proxy_handler.py b/tests/unit/test_proxy_handler.py index d417317..e106667 100644 --- a/tests/unit/test_proxy_handler.py +++ b/tests/unit/test_proxy_handler.py @@ -549,3 +549,49 @@ async def test_respond_injected_into_raw_tools_when_opted_in(self): sent = client.send.call_args.kwargs["raw_openai_tools"] names = [t["function"]["name"] for t in sent] assert names == ["search", "respond"] + + +# ── Prompt capability handoff ─────────────────────────────── + + +class TestPromptCapabilityHandoff: + """In prompt capability (native_passthrough=False) the handler suppresses + the verbatim passthrough so the request folds normally and the client's + prompt path injects the tools. (The injection itself is covered by the + LlamafileClient prompt-mode tests.)""" + + @pytest.mark.asyncio + async def test_prompt_mode_suppresses_raw_tools(self): + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + await handle_chat_completions( + _body(tools=[_tool_def("search")]), client, _context_manager(), + native_passthrough=False, + ) + # No verbatim tools forwarded — the client's prompt path injects them. + assert "raw_openai_tools" not in client.send.call_args.kwargs + # tool_specs (the source for build_tool_prompt) are still passed. + assert client.send.call_args.kwargs["tools"][0].name == "search" + + @pytest.mark.asyncio + async def test_prompt_mode_folds_messages_not_verbatim(self): + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + # A non-standard key would survive verbatim passthrough but is dropped + # by fold_and_serialize — proving the raw transcript was NOT forwarded. + messages = [{"role": "user", "content": "hi", "name": "u1"}] + await handle_chat_completions( + _body(messages=messages, tools=[_tool_def("search")]), + client, _context_manager(), native_passthrough=False, + ) + sent_messages = client.send.call_args.args[0] + assert sent_messages != messages + assert "name" not in sent_messages[0] + + @pytest.mark.asyncio + async def test_native_default_still_forwards_raw(self): + # Sanity: default (native) path is unaffected by the new param. + client = _mock_client([ToolCall(tool="search", args={"q": "x"})]) + tools = [_tool_def("search")] + await handle_chat_completions( + _body(tools=tools), client, _context_manager(), + ) + assert client.send.call_args.kwargs["raw_openai_tools"] == tools diff --git a/tests/unit/test_proxy_proxy.py b/tests/unit/test_proxy_proxy.py index ca8792f..181c2d1 100644 --- a/tests/unit/test_proxy_proxy.py +++ b/tests/unit/test_proxy_proxy.py @@ -294,6 +294,85 @@ async def test_managed_llamafile_client_is_native(self) -> None: assert mock_setup.await_args.kwargs["mode"] == "native" +class TestBackendCapability: + """backend_capability selects the tool-calling protocol, declared once at + construction and frozen. native (default) = verbatim passthrough; prompt = + opt-in prompt-injection for non-FC llama.cpp/llamafile backends.""" + + def test_default_is_native(self) -> None: + assert ProxyServer(backend_url="http://x:8080")._backend_capability == "native" + + def test_prompt_stored(self) -> None: + proxy = ProxyServer(backend_url="http://x:8080", backend_capability="prompt") + assert proxy._backend_capability == "prompt" + + # Guards: prompt is a llama.cpp/llamafile capability only. + def test_prompt_rejects_vllm(self) -> None: + with pytest.raises(ValueError, match="only supported for"): + ProxyServer(backend_url="http://x:8000", backend="vllm", backend_capability="prompt") + + def test_prompt_rejects_ollama(self) -> None: + with pytest.raises(ValueError, match="only supported for"): + ProxyServer(backend="ollama", model="m", backend_capability="prompt") + + def test_prompt_rejects_anthropic_protocol(self) -> None: + with pytest.raises(ValueError, match="not supported with the anthropic"): + ProxyServer( + backend_url="http://x:8080", + backend_protocol="anthropic", + backend_capability="prompt", + ) + + def test_prompt_allowed_for_external_llamacpp(self) -> None: + # backend=None (external) defaults to the llama.cpp adapter → prompt ok. + ProxyServer(backend_url="http://x:8080", backend_capability="prompt") + ProxyServer(backend="llamafile", gguf="m.gguf", backend_capability="prompt") + + @pytest.mark.asyncio + async def test_external_default_builds_native_client(self) -> None: + proxy = ProxyServer(backend_url="http://localhost:8080", budget_tokens=8192) + client, _ = await proxy._setup_external() + assert isinstance(client, LlamafileClient) + assert client.mode == "native" + + @pytest.mark.asyncio + async def test_external_prompt_builds_prompt_client(self) -> None: + proxy = ProxyServer( + backend_url="http://localhost:8080", + backend_capability="prompt", + budget_tokens=8192, + ) + client, _ = await proxy._setup_external() + assert isinstance(client, LlamafileClient) + assert client.mode == "prompt" + + @pytest.mark.asyncio + async def test_managed_prompt_client_is_prompt_but_launch_native(self) -> None: + # The managed LlamafileClient runs in prompt mode, but the backend + # process is still launched native (--jinja present, just unused). + proxy = ProxyServer( + backend="llamafile", gguf="/m/x.gguf", backend_capability="prompt", + ) + mock_ctx = ContextManager.__new__(ContextManager) + mock_ctx.budget_tokens = 8192 + with patch( + "forge.proxy.proxy.setup_backend", + new_callable=AsyncMock, return_value=(MagicMock(), mock_ctx), + ) as mock_setup: + client, _ = await proxy._setup_managed() + assert isinstance(client, LlamafileClient) + assert client.mode == "prompt" + assert mock_setup.await_args.kwargs["mode"] == "native" + + def test_native_passthrough_forwarded_to_http_server(self) -> None: + # native → native_passthrough True; prompt → False. + assert (ProxyServer(backend_url="http://x")._backend_capability == "native") + assert ( + ProxyServer(backend_url="http://x", backend_capability="prompt") + ._backend_capability == "prompt" + ) + + class TestLifecycle: """start()/stop() thread + state management.""" From de936109d71a46ad275798318ec25264a17a6f3a Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Sun, 31 May 2026 17:59:24 -0500 Subject: [PATCH 3/8] Proxy: log effective backend_timeout at startup The configurable backend_timeout (#91) was validated, stored, and threaded into every client request, but never surfaced at launch. Extend the "Proxy ready" line to report the effective value so the operative timeout is visible/diagnosable from the startup log. Co-Authored-By: Claude Opus 4.8 --- src/forge/proxy/proxy.py | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/forge/proxy/proxy.py b/src/forge/proxy/proxy.py index bc8b577..c30cfca 100644 --- a/src/forge/proxy/proxy.py +++ b/src/forge/proxy/proxy.py @@ -211,7 +211,11 @@ def start(self) -> None: if not self._started: raise RuntimeError("Proxy failed to start") - logger.info("Proxy ready at %s", self.url) + logger.info( + "Proxy ready at %s (backend_timeout=%.1fs)", + self.url, + self._backend_timeout, + ) def stop(self) -> None: """Stop the proxy (and managed backend if applicable).""" From 49645da613fdb6d4ea7f339f1b554f736bd71608 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Sun, 31 May 2026 18:05:19 -0500 Subject: [PATCH 4/8] vLLM: single source of truth for model identity (#75) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit VLLMClient kept two identity fields with distinct roles — model_path (the verbatim wire "model" field, which vLLM validates against its --served-model-name) and model (the derived registry-lookup key). The proxy's external-mode served-name adoption set both by hand (model_path = served; model = served), duplicating the derivation logic and storing the full served name where the constructor's rule stores the stem. Extract the path->key derivation into _derive_model_field and wrap both assignments in _set_model_identity, then call it from __init__ and from the proxy. External adoption now upholds the same (model_path, model) invariant as construction: an HF-repo-id served name reaches the wire verbatim while the registry key is the derived stem. Co-Authored-By: Claude Opus 4.8 --- src/forge/clients/vllm.py | 52 +++++++++++++++++++++++----------- src/forge/proxy/proxy.py | 3 +- tests/unit/test_proxy_proxy.py | 16 +++++++++++ 3 files changed, 52 insertions(+), 19 deletions(-) diff --git a/src/forge/clients/vllm.py b/src/forge/clients/vllm.py index ec0b574..5bdcb96 100644 --- a/src/forge/clients/vllm.py +++ b/src/forge/clients/vllm.py @@ -60,23 +60,12 @@ def __init__( recommended_sampling: bool = False, ) -> None: self.base_url = base_url - # model_path is the canonical identity. vLLM accepts either a local - # directory containing safetensors + config or a HuggingFace repo id - # (e.g. "google/gemma-4-26B-A4B-it"). We pass it through as-is in - # the wire-format "model" field and as the sampling-defaults lookup - # key (using the path stem for directory paths so registry lookups - # match the existing GGUF-stem convention). - self.model_path = str(model_path) - path_obj = Path(self.model_path) - # If model_path is a filesystem path, use the directory name as the - # registry lookup key. If it's an HF repo id (no leading slash, has - # a "/"), use the trailing segment. Otherwise the full string. - if path_obj.is_absolute() or path_obj.exists(): - self.model = path_obj.name - elif "/" in self.model_path: - self.model = self.model_path.split("/")[-1] - else: - self.model = self.model_path + # model_path is the canonical identity, sent verbatim in the wire + # "model" field. self.model is the derived registry-lookup key. Both + # are set together so the (model_path, model) invariant holds — see + # _set_model_identity. Must run before apply_sampling_defaults below, + # which reads self.model. + self._set_model_identity(model_path) # Apply per-model recommended sampling defaults. Caller's explicit # (non-None) kwargs win over the map field-by-field. @@ -103,6 +92,35 @@ async def aclose(self) -> None: """Close the underlying httpx connection pool.""" await self._http.aclose() + @staticmethod + def _derive_model_field(model_path: str) -> str: + """Derive the sampling-registry lookup key from the canonical path. + + vLLM accepts either a local directory (safetensors + config) or an HF + repo id (e.g. "google/gemma-4-26B-A4B-it"). The lookup key uses the + path stem so registry lookups match the existing GGUF-stem convention: + a filesystem path → its directory name; an HF repo id (has "/") → its + trailing segment; anything else → the string unchanged. + """ + path_obj = Path(model_path) + if path_obj.is_absolute() or path_obj.exists(): + return path_obj.name + if "/" in model_path: + return model_path.split("/")[-1] + return model_path + + def _set_model_identity(self, model_path: str | Path) -> None: + """Set both identity fields atomically from one canonical path. + + ``model_path`` is the wire "model" field (sent verbatim); ``model`` is + the derived registry key. Used by ``__init__`` and by the proxy's + external-mode served-name adoption, so the ``(model_path, model)`` + invariant holds the same way in both — instead of mutating the two + fields separately after served-name discovery. + """ + self.model_path = str(model_path) + self.model = self._derive_model_field(self.model_path) + # Sampling fields recognized in per-call overrides. ``seed`` is # accepted only as a per-call override (not an instance field). # ``chat_template_kwargs`` is a nested dict of Jinja template variables diff --git a/src/forge/proxy/proxy.py b/src/forge/proxy/proxy.py index c30cfca..6a471cd 100644 --- a/src/forge/proxy/proxy.py +++ b/src/forge/proxy/proxy.py @@ -314,8 +314,7 @@ async def _setup_external(self) -> tuple[LLMClient, ContextManager]: served = await client.get_served_model_name() if served: logger.info("Discovered vLLM served model name: %s", served) - client.model_path = served - client.model = served + client._set_model_identity(served) else: logger.warning( "Could not discover a served model name from %s/models; " diff --git a/tests/unit/test_proxy_proxy.py b/tests/unit/test_proxy_proxy.py index 181c2d1..871ac45 100644 --- a/tests/unit/test_proxy_proxy.py +++ b/tests/unit/test_proxy_proxy.py @@ -145,6 +145,22 @@ async def test_vllm_adopts_served_model_name(self) -> None: assert client.model_path == "my-awq-model" assert client.model == "my-awq-model" + @pytest.mark.asyncio + async def test_vllm_served_repo_id_keeps_wire_path_derives_registry_key(self) -> None: + # An HF-repo-id served name must reach the wire verbatim (vLLM validates + # it), while the registry key is the derived stem — the (model_path, + # model) invariant, applied to served-name adoption. + proxy = ProxyServer( + backend_url="http://localhost:8000", backend="vllm", budget_tokens=8192, + ) + with patch.object( + VLLMClient, "get_served_model_name", + new_callable=AsyncMock, return_value="google/gemma-4-26B-A4B-it", + ): + client, _ = await proxy._setup_external() + assert client.model_path == "google/gemma-4-26B-A4B-it" + assert client.model == "gemma-4-26B-A4B-it" + @pytest.mark.asyncio async def test_vllm_keeps_placeholder_when_discovery_fails(self) -> None: proxy = ProxyServer( From f18242cf3c8ff711068d7c3f0597113621c4df64 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Sun, 31 May 2026 18:16:03 -0500 Subject: [PATCH 5/8] Clients: consistent malformed-tool-call + response-shape handling Audited malformed-tool-call and unexpected-payload handling across the OpenAI-shape clients against the reference set by OpenAICompatClient (#89) and LlamafileClient. Standardize on one principle, applied uniformly: - Malformed argument JSON (a model mistake) -> TextResponse, routing the raw output back through the inference loop so the rescue/retry path can recover. - A broken provider envelope (missing choices/message) or unexpected args type (a contract violation, not the model's fault) -> BackendError: fail loud and consistent, never a stray KeyError/IndexError. Changes: - vLLM: replace the bare-json.loads _parse_tool_args (which *raised* on malformed args, unlike llamafile's retry-driving TextResponse) with a _parse_tool_calls mirroring the reference. Route both send() and send_stream() through it so streaming and non-streaming agree: a fully accumulated but unparseable arguments string finalizes as a TextResponse, not an exception. - llamafile / openai_compat: guard the bare data["choices"][0]["message"] subscripts -> BackendError on a broken envelope (matching what vLLM already did for choices). llamafile also hardens function/name access. - ollama: defensive .get on function/name (both paths); document that Ollama emits dict args by contract, so no json.loads is needed there. Tests: vLLM _parse_tool_calls (string/dict/empty/malformed/unexpected/missing- function/reasoning) + streaming malformed-fragment parity; envelope-guard tests for llamafile and openai_compat. 1092 unit tests green. Co-Authored-By: Claude Opus 4.8 --- src/forge/clients/llamafile.py | 10 ++- src/forge/clients/ollama.py | 12 ++- src/forge/clients/openai_compat.py | 5 +- src/forge/clients/vllm.py | 98 +++++++++++++++++-------- tests/unit/test_llamafile_client.py | 10 +++ tests/unit/test_openai_compat_client.py | 10 +++ tests/unit/test_vllm_client.py | 76 +++++++++++++++++-- 7 files changed, 175 insertions(+), 46 deletions(-) diff --git a/src/forge/clients/llamafile.py b/src/forge/clients/llamafile.py index 7140794..c04bcec 100644 --- a/src/forge/clients/llamafile.py +++ b/src/forge/clients/llamafile.py @@ -540,8 +540,10 @@ async def _send_native( data = resp.json() self._record_usage(data) - top_choice = data["choices"][0] - choice = top_choice["message"] + choices = data.get("choices") or [] + if not choices: + raise BackendError(500, f"response has no choices: {data}") + choice = choices[0].get("message", {}) raw_tool_calls = choice.get("tool_calls") if raw_tool_calls: reasoning = self._resolve_reasoning( @@ -550,7 +552,7 @@ async def _send_native( ) result_calls: list[ToolCall] = [] for i, tc_entry in enumerate(raw_tool_calls): - tc_func = tc_entry["function"] + tc_func = tc_entry.get("function", {}) args = tc_func.get("arguments", "{}") if isinstance(args, str): try: @@ -558,7 +560,7 @@ async def _send_native( except json.JSONDecodeError: return TextResponse(content=choice.get("content", args)) result_calls.append(ToolCall( - tool=tc_func["name"], + tool=tc_func.get("name", ""), args=args, reasoning=reasoning if i == 0 else None, )) diff --git a/src/forge/clients/ollama.py b/src/forge/clients/ollama.py index 9f93a6c..5a9cce8 100644 --- a/src/forge/clients/ollama.py +++ b/src/forge/clients/ollama.py @@ -198,10 +198,14 @@ async def send( reasoning = self._resolve_reasoning( msg.get("thinking", ""), msg.get("content", ""), ) + # Ollama returns tool-call arguments already decoded as a dict + # (unlike vLLM/llama.cpp, which send a JSON string) — no json.loads + # needed. Defensive .get on function/name so a broken tool-call + # entry degrades to empty rather than raising KeyError. return [ ToolCall( - tool=tc["function"]["name"], - args=tc["function"].get("arguments", {}), + tool=tc.get("function", {}).get("name", ""), + args=tc.get("function", {}).get("arguments", {}), reasoning=reasoning if i == 0 else None, ) for i, tc in enumerate(tool_calls) @@ -304,8 +308,8 @@ async def _iter_stream( ) final: LLMResponse = [ ToolCall( - tool=tc["function"]["name"], - args=tc["function"].get("arguments", {}), + tool=tc.get("function", {}).get("name", ""), + args=tc.get("function", {}).get("arguments", {}), reasoning=reasoning if i == 0 else None, ) for i, tc in enumerate(tool_calls) diff --git a/src/forge/clients/openai_compat.py b/src/forge/clients/openai_compat.py index 96ead28..e6b0319 100644 --- a/src/forge/clients/openai_compat.py +++ b/src/forge/clients/openai_compat.py @@ -208,7 +208,10 @@ async def send( data = resp.json() self._record_usage(data) - msg = data["choices"][0]["message"] + choices = data.get("choices") or [] + if not choices: + raise BackendError(500, f"response has no choices: {data}") + msg = choices[0].get("message", {}) tool_calls = msg.get("tool_calls") if tool_calls: return self._parse_tool_calls(tool_calls, fallback_content=msg.get("content") or "") diff --git a/src/forge/clients/vllm.py b/src/forge/clients/vllm.py index 5bdcb96..ff50e12 100644 --- a/src/forge/clients/vllm.py +++ b/src/forge/clients/vllm.py @@ -222,15 +222,11 @@ async def send( tool_calls = message.get("tool_calls") or [] if tool_calls: - reasoning = self._resolve_reasoning(message) - return [ - ToolCall( - tool=tc["function"]["name"], - args=self._parse_tool_args(tc["function"].get("arguments", {})), - reasoning=reasoning if i == 0 else None, - ) - for i, tc in enumerate(tool_calls) - ] + return self._parse_tool_calls( + tool_calls, + reasoning=self._resolve_reasoning(message), + fallback_content=message.get("content") or "", + ) return TextResponse(content=message.get("content") or "") @@ -318,37 +314,75 @@ async def send_stream( type=ChunkType.TEXT_DELTA, content=content, ) - # Build the final response + # Build the final response. Reassemble the accumulated deltas into the + # OpenAI tool-call shape and route through the same parser as send(), so + # streaming and non-streaming agree on malformed-args handling: a fully + # accumulated but unparseable arguments string yields a retry-driving + # TextResponse, not an exception. if tool_call_parts: - reasoning = self._resolve_reasoning( - accumulated_reasoning, accumulated_content, - ) - final: LLMResponse = [ - ToolCall( - tool=part["name"], - args=self._parse_tool_args(part["args"]), - reasoning=reasoning if i == 0 else None, - ) - for i, part in enumerate( - tool_call_parts[k] for k in sorted(tool_call_parts) - ) + reassembled = [ + {"function": {"name": part["name"], "arguments": part["args"]}} + for part in (tool_call_parts[k] for k in sorted(tool_call_parts)) ] + final: LLMResponse = self._parse_tool_calls( + reassembled, + reasoning=self._resolve_reasoning( + accumulated_reasoning, accumulated_content, + ), + fallback_content=accumulated_content, + ) else: final = TextResponse(content=accumulated_content) yield StreamChunk(type=ChunkType.FINAL, response=final) @staticmethod - def _parse_tool_args(raw: Any) -> dict[str, Any]: - """Tool args from vLLM arrive as JSON-encoded string in the - OpenAI native format. Decode to dict. + def _parse_tool_calls( + tool_calls: list[dict[str, Any]], + reasoning: str | None, + fallback_content: str, + ) -> LLMResponse: + """Parse vLLM ``tool_calls`` into ``ToolCall`` objects (or TextResponse). + + Mirrors ``OpenAICompatClient`` / ``LlamafileClient`` so every + OpenAI-shape client behaves the same. Tool-call ``arguments`` arrive as + a JSON string (vLLM's native format) or an already-decoded dict. + Forge is fail-loud on the right axis: + + - **Malformed argument JSON** is NOT coerced into empty args (that would + let a model silently proceed with wrong arguments). We return a + ``TextResponse``, routing the raw output back through the inference + loop so the rescue/retry path can recover — matching llamafile. + - An **unexpected args type** (neither str nor dict) is a provider + contract violation, not a model mistake → ``BackendError``. + + Defensive ``.get`` on ``function`` / ``name`` keeps a broken tool-call + entry from raising ``KeyError``. Used by both send() and send_stream() + for parity (the stream path reassembles deltas into this shape first). """ - if isinstance(raw, dict): - return raw - if isinstance(raw, str): - if not raw: - return {} - return json.loads(raw) - raise BackendError(500, f"unexpected tool args shape: {type(raw).__name__}") + parsed: list[ToolCall] = [] + for i, tc in enumerate(tool_calls): + fn = tc.get("function", {}) + raw_args = fn.get("arguments", {}) + if isinstance(raw_args, str): + if not raw_args: + args: dict[str, Any] = {} + else: + try: + args = json.loads(raw_args) + except json.JSONDecodeError: + return TextResponse(content=fallback_content or raw_args) + elif isinstance(raw_args, dict): + args = raw_args + else: + raise BackendError( + 500, f"unexpected tool args shape: {type(raw_args).__name__}", + ) + parsed.append(ToolCall( + tool=fn.get("name", ""), + args=args, + reasoning=reasoning if i == 0 else None, + )) + return parsed async def get_context_length(self) -> int | None: """Query the vLLM /v1/models endpoint for max_model_len. diff --git a/tests/unit/test_llamafile_client.py b/tests/unit/test_llamafile_client.py index 68abe34..3785ee7 100644 --- a/tests/unit/test_llamafile_client.py +++ b/tests/unit/test_llamafile_client.py @@ -9,6 +9,7 @@ from forge.clients.llamafile import LlamafileClient, _extract_think_tags, _merge_consecutive from forge.core.workflow import TextResponse, ToolCall, ToolSpec +from forge.errors import BackendError from pydantic import BaseModel, Field from forge.clients.base import ChunkType @@ -116,6 +117,15 @@ async def test_returns_text_response(self) -> None: assert isinstance(result, TextResponse) assert result.content == "I need more info" + @pytest.mark.asyncio + async def test_missing_choices_raises_backend_error(self) -> None: + # Broken provider envelope (200, no choices) → fail loud and consistent + # rather than KeyError/IndexError on data["choices"][0]. + client = _make_client("native") + client._http.post.return_value = _mock_response({"object": "error"}) + with pytest.raises(BackendError, match="response has no choices"): + await client.send([{"role": "user", "content": "test"}]) + @pytest.mark.asyncio async def test_arguments_parsed_from_string(self) -> None: """OpenAI format sends arguments as JSON string, not dict.""" diff --git a/tests/unit/test_openai_compat_client.py b/tests/unit/test_openai_compat_client.py index 9aaff77..55fddd0 100644 --- a/tests/unit/test_openai_compat_client.py +++ b/tests/unit/test_openai_compat_client.py @@ -97,6 +97,16 @@ async def test_returns_text_response(self) -> None: assert isinstance(result, TextResponse) assert result.content == "I need more info" + @pytest.mark.asyncio + async def test_missing_choices_raises_backend_error(self) -> None: + # A broken provider envelope (200 with no choices) is a contract + # violation, not a model mistake — fail loud and consistent rather + # than KeyError/IndexError on data["choices"][0]. + client = _make_client() + client._http.post.return_value = _mock_response({"object": "error"}) + with pytest.raises(BackendError, match="response has no choices"): + await client.send([{"role": "user", "content": "test"}]) + @pytest.mark.asyncio async def test_null_content_returns_empty_text(self) -> None: client = _make_client() diff --git a/tests/unit/test_vllm_client.py b/tests/unit/test_vllm_client.py index 0576040..0de394b 100644 --- a/tests/unit/test_vllm_client.py +++ b/tests/unit/test_vllm_client.py @@ -326,6 +326,31 @@ async def test_yields_tool_call_delta_then_final(self) -> None: assert result[0].tool == "get_weather" assert result[0].args == {"city": "Paris"} + @pytest.mark.asyncio + async def test_malformed_accumulated_args_finalize_as_text_response(self) -> None: + # Streaming/non-streaming parity: once all fragments are accumulated, + # an unparseable arguments string must yield a retry-driving + # TextResponse (same as send()), not raise out of the stream. + client = _make_client() + client._http.stream.return_value = _MockStreamResponse([ + _sse({"choices": [{"delta": { + "tool_calls": [{ + "index": 0, + "function": {"name": "get_weather", "arguments": '{"city": '} + }], + }}]}), + _sse({"choices": [{"delta": {"content": "let me try again"}}]}), + "data: [DONE]", + ]) + chunks = [] + async for chunk in client.send_stream( + [{"role": "user", "content": "x"}], tools=[_make_spec()], + ): + chunks.append(chunk) + finals = [c for c in chunks if c.type == ChunkType.FINAL] + assert len(finals) == 1 + assert isinstance(finals[0].response, TextResponse) + @pytest.mark.asyncio async def test_accumulates_reasoning_across_deltas(self) -> None: client = _make_client(think=True) @@ -473,16 +498,57 @@ async def test_usage_only_chunk_records_usage_and_continues(self) -> None: assert usage.completion_tokens == 3 -class TestParseToolArgs: +class TestParseToolCalls: + @staticmethod + def _call(arguments: object) -> object: + return VLLMClient._parse_tool_calls( + [{"function": {"name": "lookup", "arguments": arguments}}], + reasoning=None, + fallback_content="raw model text", + ) + + def test_string_args_decoded(self) -> None: + """vLLM's native format — arguments arrive as a JSON string.""" + assert self._call('{"city": "Paris"}') == [ + ToolCall(tool="lookup", args={"city": "Paris"}), + ] + def test_dict_passed_through(self) -> None: """Some downstream wrappers send dict args directly — pass through.""" - assert VLLMClient._parse_tool_args({"city": "Paris"}) == {"city": "Paris"} + assert self._call({"city": "Paris"}) == [ + ToolCall(tool="lookup", args={"city": "Paris"}), + ] def test_empty_string_returns_empty_dict(self) -> None: """No-arg tool calls — empty string args is valid.""" - assert VLLMClient._parse_tool_args("") == {} + assert self._call("") == [ToolCall(tool="lookup", args={})] + + def test_malformed_json_returns_textresponse(self) -> None: + """Matches llamafile/openai_compat: malformed args drive a retry via + TextResponse, never silent {} or an exception.""" + assert self._call('{"city": ') == TextResponse(content="raw model text") def test_unexpected_type_raises(self) -> None: - """Unknown shape (list, int, etc.) — fail loud.""" + """Unknown shape (list, int, etc.) — provider contract violation.""" with pytest.raises(BackendError, match="unexpected tool args shape"): - VLLMClient._parse_tool_args(123) # type: ignore[arg-type] + self._call(123) + + def test_missing_function_is_defensive(self) -> None: + """A broken tool-call entry (no "function") must not KeyError.""" + assert VLLMClient._parse_tool_calls( + [{}], reasoning=None, fallback_content="", + ) == [ToolCall(tool="", args={})] + + def test_reasoning_attached_to_first_call_only(self) -> None: + result = VLLMClient._parse_tool_calls( + [ + {"function": {"name": "a", "arguments": "{}"}}, + {"function": {"name": "b", "arguments": "{}"}}, + ], + reasoning="because", + fallback_content="", + ) + assert result == [ + ToolCall(tool="a", args={}, reasoning="because"), + ToolCall(tool="b", args={}, reasoning=None), + ] From 91f660debfe90a97d31ae9647317c06be67396e5 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Sun, 31 May 2026 18:24:04 -0500 Subject: [PATCH 6/8] LlamafileClient: remove runtime auto mode; native-first, frozen capability MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Drop mode="auto" and its runtime probe-and-mutate (_resolve_and_send: try native, fall back to prompt on HTTP error, recording resolved_mode). This was the last vestige of the mid-request capability mutation the proxy rewrite excised everywhere else; the proxy already declares its capability up front via --backend-capability. With auto gone, resolved_mode is always == self.mode, so the whole tri-state indirection collapses to a direct dispatch on self.mode. The default is now native. This is both hardening and a deliberate posture shift: local-model function-calling support has matured into the more reliable path, so native-first is the right default. Prompt-injection is preserved as an explicit opt-in (mode="prompt") and is the theoretically correct fallback for non-FC backends — but it is honestly flagged, in the docstring and docs, that models tend to struggle to drive the prompt-injected protocol reliably on more complex, multi-step interactions. Capability is declared-and-frozen: an invalid mode (including the old "auto") now raises ValueError rather than silently degrading. - llamafile.py: validate mode in __init__; default native; delete _resolve_and_send and the resolved_mode attribute/branches; dispatch send / send_stream on self.mode; rewrite the class docstring (native-first rationale + prompt caveat). - eval_runner.py: --llamafile-mode choices [native, prompt], default native. - docs (BACKEND_SETUP, EVAL_GUIDE): native-first wording + the prompt caveat. - tests: drop the auto-mode suite; assert native default + ValueError on "auto". Consumers verified unaffected: the proxy (both sites), batch_eval, and the integration script all pass mode explicitly. 1086 unit tests green. Co-Authored-By: Claude Opus 4.8 --- docs/BACKEND_SETUP.md | 4 +- docs/EVAL_GUIDE.md | 2 +- src/forge/clients/llamafile.py | 91 ++++++----------- tests/eval/eval_runner.py | 4 +- tests/unit/test_llamafile_client.py | 148 +++------------------------- 5 files changed, 46 insertions(+), 203 deletions(-) diff --git a/docs/BACKEND_SETUP.md b/docs/BACKEND_SETUP.md index 024e762..75e667d 100644 --- a/docs/BACKEND_SETUP.md +++ b/docs/BACKEND_SETUP.md @@ -73,7 +73,7 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999 | `-ngl 999` | Offload all layers to GPU | | `-m ` | Path to GGUF | -llamafile does **not** support native function calling — forge's `LlamafileClient` falls back to prompt-injected mode automatically (`mode="auto"`), or you can force it with `mode="prompt"`. +`LlamafileClient` is **native-first**: `mode="native"` (the default) forwards tools via the backend's `tools` parameter and requires native function calling (llama.cpp with `--jinja`). For a backend without native FC, declare `mode="prompt"` to inject tool descriptions into the prompt and parse the JSON call back out. The capability is declared at construction and frozen — there is no runtime auto-detection. Native-first is the default because local-model FC support has matured into the more reliable path; prompt-injection stays fully supported as an explicit opt-in, but note that on more complex, multi-step interactions models tend to struggle to drive the prompt-injected protocol reliably, so reach for it only when the backend leaves no alternative. > **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012. @@ -90,7 +90,7 @@ from forge.clients import LlamafileClient client = LlamafileClient( gguf_path="path/to/model.gguf", - mode="prompt", # or "auto" to try native first + mode="prompt", # default is "native"; use "prompt" only for non-FC backends recommended_sampling=True, ) ``` diff --git a/docs/EVAL_GUIDE.md b/docs/EVAL_GUIDE.md index 07b4757..6ac9a69 100644 --- a/docs/EVAL_GUIDE.md +++ b/docs/EVAL_GUIDE.md @@ -30,7 +30,7 @@ python -m tests.eval.eval_runner --backend anthropic --model claude-haiku-4-5-20 | `--verbose`, `-v` | flag | off | Print live per-message trace | | `--tags` | `plumbing`, `model_quality`, `advanced_reasoning`, `compaction`, `stateful`, `reasoning`, `error_recovery` | all | Filter scenarios by tag | | `--scenario` | name(s) | all | Run specific scenario(s) by name | -| `--llamafile-mode` | `native`, `prompt`, `auto` | `auto` | FC mode for llamafile/llama-server backend | +| `--llamafile-mode` | `native`, `prompt` | `native` | FC mode for llamafile/llama-server backend (native-first; `prompt` for non-FC backends) | | `--think` | `true`, `false`, `auto` | `auto` | Thinking mode. Ollama: controls `think` param. Llamafile: captures `[THINK]` tags and `reasoning_content` | | `--budget-mode` | `backend`, `manual`, `forge-full`, `forge-fast` | `forge-full` | Context budget strategy. Compaction scenarios always override with their own budget | | `--num-ctx` | int | none | Exact token budget (requires `--budget-mode manual`) | diff --git a/src/forge/clients/llamafile.py b/src/forge/clients/llamafile.py index c04bcec..9fd4ab7 100644 --- a/src/forge/clients/llamafile.py +++ b/src/forge/clients/llamafile.py @@ -121,12 +121,23 @@ def _downgrade_messages(messages: list[dict[str, Any]]) -> list[dict[str, Any]]: class LlamafileClient: - """OpenAI-compatible client for Llamafile. - - mode="native" uses the tools parameter (requires Llamafile with FC support). - mode="prompt" injects tool descriptions into the prompt and extracts JSON. - mode="auto" tries native first, falls back to prompt on failure — with - an explicit warning log and resolved_mode set for caller inspection. + """OpenAI-compatible client for Llamafile / llama.cpp. + + The capability is declared once at construction and frozen — there is no + runtime auto-detection. ``mode`` is one of: + + - ``"native"`` (default): forwards tools via the ``tools`` parameter + (requires a backend with native function calling — llama.cpp ``--jinja``). + - ``"prompt"``: injects tool descriptions into the prompt and parses the + JSON tool call back out; for backends without native FC. + + Native-first is the default because function-calling support across local + models has matured to the point where it is the more reliable path. + Prompt-injection remains fully supported as an explicit opt-in: it is the + theoretically correct fallback when a backend can't do native FC, but be + aware that on more complex, multi-step interactions models tend to struggle + to drive the prompt-injected protocol reliably. Choose ``"prompt"`` only + when the backend leaves no alternative. """ api_format: str = "openai" @@ -142,13 +153,20 @@ def __init__( repeat_penalty: float | None = None, presence_penalty: float | None = None, chat_template_kwargs: dict[str, Any] | None = None, - mode: str = "auto", + mode: str = "native", timeout: float = 300.0, think: bool | None = None, cache_prompt: bool = True, slot_id: int | None = None, recommended_sampling: bool = False, ) -> None: + if mode not in ("native", "prompt"): + raise ValueError( + f"mode must be 'native' or 'prompt', got {mode!r}. " + "Runtime auto-detection was removed — declare the backend " + "capability explicitly (native-first; 'prompt' for non-FC " + "backends)." + ) self.base_url = base_url # gguf_path is the canonical identity. self.model is the stem (no # .gguf / .llamafile suffix) — used for the wire-format model field @@ -175,17 +193,12 @@ def __init__( ) self.mode = mode self._http = httpx.AsyncClient(timeout=timeout) - self._think: bool = think if think is not None else True # auto = capture + self._think: bool = think if think is not None else True # think=None → capture self._cache_prompt = cache_prompt self._slot_id = slot_id self.last_usage: dict[int, TokenUsage] = {} - if mode in ("native", "prompt"): - self.resolved_mode: str | None = mode - else: - self.resolved_mode = None - async def aclose(self) -> None: """Close the underlying httpx connection pool.""" await self._http.aclose() @@ -272,7 +285,7 @@ async def send( inbound_anthropic_body: dict[str, Any] | None = None, raw_openai_tools: RawOpenAITools | None = None, ) -> LLMResponse: - """Resolve mode on first call with tools, then dispatch. + """Dispatch to the native or prompt-injected path per the declared mode. ``inbound_anthropic_body`` is accepted for protocol symmetry and silently ignored — LlamafileClient only speaks OpenAI shape. @@ -281,16 +294,11 @@ async def send( backend's ``tools`` array on the native path; the prompt path accepts and ignores it (it keeps forge's prompt-injection format). """ - if self.resolved_mode is None: - return await self._resolve_and_send( - messages, tools, sampling, passthrough, raw_openai_tools, - ) - elif self.resolved_mode == "native": + if self.mode == "native": return await self._send_native( messages, tools, sampling, passthrough, raw_openai_tools, ) - else: - return await self._send_prompt(messages, tools, sampling, passthrough) + return await self._send_prompt(messages, tools, sampling, passthrough) async def send_stream( self, @@ -307,13 +315,7 @@ async def send_stream( ``raw_openai_tools`` (proxy use) is forwarded verbatim on the native path; ignored on the prompt path. """ - if self.resolved_mode is None: - # Probe with a non-streaming call to resolve native vs prompt. - # Result is discarded — the runner will use the streamed response. - await self._resolve_and_send( - messages, tools, sampling, passthrough, raw_openai_tools, - ) - mode = self.resolved_mode + mode = self.mode body: dict[str, Any] = dict(passthrough or {}) body.update({ @@ -469,39 +471,6 @@ async def get_context_length(self) -> int | None: except (ValueError, KeyError, TypeError) as exc: raise ContextDiscoveryError(exc) from exc - async def _resolve_and_send( - self, - messages: list[dict[str, str]], - tools: list[ToolSpec] | None, - sampling: dict[str, Any] | None = None, - passthrough: dict[str, Any] | None = None, - raw_openai_tools: RawOpenAITools | None = None, - ) -> LLMResponse: - """Auto-resolve mode on first send with tools. - - Only falls back to prompt-injected mode on an HTTP error (backend - doesn't support the tools parameter). A TextResponse with tools - provided is not a fallback signal — it means native FC is supported - but the model chose not to call a tool. The runner's retry logic - handles that case. - """ - if not tools: - # No tools to test with — send without tools, defer resolution - self.resolved_mode = "native" - return await self._send_native( - messages, tools, sampling, passthrough, raw_openai_tools, - ) - - try: - result = await self._send_native( - messages, tools, sampling, passthrough, raw_openai_tools, - ) - self.resolved_mode = "native" - return result - except (httpx.HTTPStatusError, BackendError): - self.resolved_mode = "prompt" - return await self._send_prompt(messages, tools, sampling, passthrough) - async def _send_native( self, messages: list[dict[str, str]], diff --git a/tests/eval/eval_runner.py b/tests/eval/eval_runner.py index 231e56e..b503594 100644 --- a/tests/eval/eval_runner.py +++ b/tests/eval/eval_runner.py @@ -507,8 +507,8 @@ async def main() -> None: parser.add_argument("--scenario", nargs="*", help="Run specific scenario(s) by name") parser.add_argument( "--llamafile-mode", - choices=["native", "prompt", "auto"], - default="auto", + choices=["native", "prompt"], + default="native", ) parser.add_argument( "--budget-mode", diff --git a/tests/unit/test_llamafile_client.py b/tests/unit/test_llamafile_client.py index 3785ee7..c8ebc63 100644 --- a/tests/unit/test_llamafile_client.py +++ b/tests/unit/test_llamafile_client.py @@ -369,133 +369,6 @@ async def test_tool_role_downgraded_in_prompt_mode(self) -> None: assert sent_messages[3]["content"] == "fetch_result" -# ── auto mode ──────────────────────────────────────────────────── - - -class TestLlamafileAutoMode: - @pytest.mark.asyncio - async def test_auto_resolves_native_on_tool_call(self) -> None: - client = _make_client("auto") - assert client.resolved_mode is None - client._http.post.return_value = _mock_response( - _openai_tool_call_response() - ) - result = await client.send( - [{"role": "user", "content": "test"}], tools=[_make_spec()] - ) - assert isinstance(result, list) - assert client.resolved_mode == "native" - - @pytest.mark.asyncio - async def test_auto_stays_native_on_text_response(self) -> None: - """TextResponse with tools provided means FC is supported but the - model chose not to call a tool. Should resolve to native, not - fall back to prompt mode.""" - client = _make_client("auto") - - client._http.post.return_value = _mock_response( - _openai_text_response("Let me think about this...") - ) - - result = await client.send( - [{"role": "system", "content": "sys"}, {"role": "user", "content": "test"}], - tools=[_make_spec()], - ) - assert client.resolved_mode == "native" - assert isinstance(result, TextResponse) - assert result.content == "Let me think about this..." - - @pytest.mark.asyncio - async def test_auto_falls_back_on_http_error(self) -> None: - client = _make_client("auto") - - # First call (native attempt) raises HTTP error - error_resp = _mock_response({}, status_code=400) - prompt_resp = _mock_response( - _openai_text_response('{"tool": "get_pricing", "args": {}}') - ) - client._http.post.side_effect = [error_resp, prompt_resp] - - result = await client.send( - [{"role": "system", "content": "sys"}, {"role": "user", "content": "test"}], - tools=[_make_spec()], - ) - assert client.resolved_mode == "prompt" - assert isinstance(result, list) - - @pytest.mark.asyncio - async def test_auto_without_tools_defaults_native(self) -> None: - client = _make_client("auto") - client._http.post.return_value = _mock_response( - _openai_text_response("hello") - ) - result = await client.send([{"role": "user", "content": "hi"}]) - assert isinstance(result, TextResponse) - assert client.resolved_mode == "native" - - @pytest.mark.asyncio - async def test_send_stream_auto_resolves_native(self) -> None: - """send_stream() probes mode before streaming when resolved_mode is None.""" - client = _make_client("auto") - assert client.resolved_mode is None - - # Probe call resolves to native - client._http.post.return_value = _mock_response( - _openai_tool_call_response() - ) - # Streaming call returns a tool call - sse_lines = [ - f'data: {json.dumps({"choices": [{"delta": {"tool_calls": [{"index": 0, "function": {"name": "get_pricing", "arguments": ""}}]}}]})}', - f'data: {json.dumps({"choices": [{"delta": {"tool_calls": [{"index": 0, "function": {"arguments": "{\"part\": \"X\"}"}}]}}]})}', - "data: [DONE]", - ] - client._http.stream.return_value = _MockSSEStreamResponse(sse_lines) - - chunks = [] - async for chunk in client.send_stream( - [{"role": "system", "content": "sys"}, {"role": "user", "content": "test"}], - tools=[_make_spec()], - ): - chunks.append(chunk) - - assert client.resolved_mode == "native" - finals = [c for c in chunks if c.type == ChunkType.FINAL] - assert len(finals) == 1 - assert isinstance(finals[0].response, list) - - @pytest.mark.asyncio - async def test_send_stream_auto_falls_back_to_prompt(self) -> None: - """send_stream() falls back to prompt mode when native probe fails.""" - client = _make_client("auto") - assert client.resolved_mode is None - - # Probe: native fails with HTTP error, prompt fallback succeeds - error_resp = _mock_response({}, status_code=400) - prompt_resp = _mock_response( - _openai_text_response('{"tool": "get_pricing", "args": {"part": "X"}}') - ) - client._http.post.side_effect = [error_resp, prompt_resp] - - # Streaming call (now in prompt mode) returns extracted tool call - sse_lines = [ - f'data: {json.dumps({"choices": [{"delta": {"content": "{\"tool\": \"get_pricing\", \"args\": {\"part\": \"Y\"}}"}, "finish_reason": "stop"}]})}', - ] - client._http.stream.return_value = _MockSSEStreamResponse(sse_lines) - - chunks = [] - async for chunk in client.send_stream( - [{"role": "system", "content": "sys"}, {"role": "user", "content": "test"}], - tools=[_make_spec()], - ): - chunks.append(chunk) - - assert client.resolved_mode == "prompt" - finals = [c for c in chunks if c.type == ChunkType.FINAL] - assert len(finals) == 1 - assert isinstance(finals[0].response, list) - assert finals[0].response[0].tool == "get_pricing" - - # ── get_context_length ─────────────────────────────────────────── @@ -678,21 +551,22 @@ async def test_streaming_no_reasoning_when_no_content(self) -> None: assert final.response[0].reasoning is None -# ── resolved_mode ──────────────────────────────────────────────── +# ── mode ───────────────────────────────────────────────────────── -class TestResolvedMode: - def test_native_mode_set_immediately(self) -> None: - client = LlamafileClient(gguf_path="test", mode="native") - assert client.resolved_mode == "native" +class TestMode: + def test_native_is_default(self) -> None: + client = LlamafileClient(gguf_path="test") + assert client.mode == "native" - def test_prompt_mode_set_immediately(self) -> None: + def test_prompt_mode(self) -> None: client = LlamafileClient(gguf_path="test", mode="prompt") - assert client.resolved_mode == "prompt" + assert client.mode == "prompt" - def test_auto_mode_unset(self) -> None: - client = LlamafileClient(gguf_path="test", mode="auto") - assert client.resolved_mode is None + def test_auto_mode_rejected(self) -> None: + # Runtime auto-detection was removed — capability is declared-and-frozen. + with pytest.raises(ValueError, match="mode must be 'native' or 'prompt'"): + LlamafileClient(gguf_path="test", mode="auto") # ── _apply_sampling ────────────────────────────────────────────── From 4d7debc8228646dd2779ecf6b54b6954db10da46 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Mon, 1 Jun 2026 00:42:17 -0500 Subject: [PATCH 7/8] fix(eval): honor manual context budget in batch_eval via start_with_budget MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit batch_eval brought servers up with a bare server.start() (no ctx_override) and resolved the budget separately via server.resolve_budget(), so --budget-mode manual --num-ctx N was a no-op for llama-server: the server booted at the model's full native context (no -c), and resolve_budget(MANUAL) just read that full value back from /props. (Ollama was unaffected — its context is per-request via set_num_ctx.) Route both the initial bring-up and _recover_server through the prod start_with_budget() path, which threads manual_tokens -> ctx_override -> -c at launch and returns the resolved budget. _recover_server gains budget_mode/manual_tokens params so a restarted server reuses the same budget. Drops the now-redundant standalone resolve_budget() on the happy path (still used on the recovery branch to read back the resolved value). This also fixes FORGE_FAST mode, which the old bare-start() path never supported. Smoke-tested live (Ministral-3 14B-Reasoning, native, --num-ctx 20000): server boots with -c, rows record budget_tokens=20224 (server-clamped) instead of the previous 262144 full-native read-back. Co-Authored-By: Claude Opus 4.8 --- tests/eval/batch_eval.py | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/tests/eval/batch_eval.py b/tests/eval/batch_eval.py index 9f04e82..e919437 100644 --- a/tests/eval/batch_eval.py +++ b/tests/eval/batch_eval.py @@ -420,9 +420,15 @@ async def _recover_server( gguf_path: str, extra_flags: list[str] | None, crash_count: int, + budget_mode: BudgetMode, + manual_tokens: int | None, ) -> bool: """Attempt to restart the server after a crash. + Restarts through the prod ``start_with_budget`` path so the recovered + server is launched with the same budget (e.g. ``-c manual_tokens`` for + MANUAL mode) as the original. + Returns True if recovery succeeded, False if circuit breaker tripped. """ if crash_count > len(_RECOVERY_BACKOFFS): @@ -447,10 +453,12 @@ async def _recover_server( # GGUF path for non-Ollama (matches run_batch and setup_backend). cache_identity = config.model if config.backend == "ollama" else gguf_path try: - await server.start( + await server.start_with_budget( model=cache_identity, gguf_path=gguf_path, mode=config.mode, + budget_mode=budget_mode, + manual_tokens=manual_tokens, extra_flags=extra_flags, ) print(" [!] Server restarted successfully.", flush=True) @@ -751,10 +759,15 @@ async def run_batch( extra_flags = _get_server_flags(config.model, config.mode) cache_identity = config.model if config.backend == "ollama" else gguf_path try: - await server.start( + # Prod path: launches with the budget-appropriate context + # (e.g. -c manual_tokens for MANUAL) and returns the resolved + # budget, instead of starting raw and reading back full ctx. + resolved_budget = await server.start_with_budget( model=cache_identity, gguf_path=gguf_path, mode=config.mode, + budget_mode=budget_mode, + manual_tokens=manual_tokens, extra_flags=extra_flags if extra_flags else None, ) except RuntimeError: @@ -763,20 +776,19 @@ async def run_batch( server, config, gguf_path, extra_flags if extra_flags else None, crash_count=1, + budget_mode=budget_mode, manual_tokens=manual_tokens, ) if not recovered: print(f" SKIP (server failed to start)", flush=True) total_skipped += total_scenarios continue + resolved_budget = await server.resolve_budget(budget_mode, manual_tokens) prev_backend = config.backend prev_server = server # Build client client = _build_client(config, models_dir) - - # Resolve budget through prod ServerManager path - resolved_budget = await server.resolve_budget(budget_mode, manual_tokens) if hasattr(client, "set_num_ctx"): client.set_num_ctx(resolved_budget) @@ -844,6 +856,7 @@ async def run_batch( server, config, gguf_path, extra_flags if extra_flags else None, crash_count, + budget_mode=budget_mode, manual_tokens=manual_tokens, ) if not recovered: print( From eb7e7754e2c63cba06d87caa68ffff4654c0b7f6 Mon Sep 17 00:00:00 2001 From: Antoine Zambelli Date: Mon, 1 Jun 2026 00:56:51 -0500 Subject: [PATCH 8/8] =?UTF-8?q?chore(release):=200.7.3=20=E2=80=94=20nativ?= =?UTF-8?q?e-first=20proxy?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bump version 0.7.2 -> 0.7.3 and add the CHANGELOG entry covering this branch plus the commits that landed on main since 0.7.2 (OpenAICompatClient #89, --backend-timeout #91, and fixes #71/#72/#73/#86/#94). Headline: native-first proxy. BREAKING — the proxy --mode flag is renamed to --backend-capability (no alias; --mode was only introduced in 0.7.1). Native is the default and only auto-selected protocol; prompt-injection is an explicit opt-in for non-FC llama.cpp/llamafile backends. USER_GUIDE: --mode -> --backend-capability, with the caveat that prompt mode tends to degrade on more complex multi-step interactions. BACKEND_SETUP, EVAL_GUIDE, and ADR-012 were already updated earlier on this branch. Co-Authored-By: Claude Opus 4.8 --- CHANGELOG.md | 27 +++++++++++++++++++++++++++ docs/USER_GUIDE.md | 2 +- pyproject.toml | 2 +- 3 files changed, 29 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 29aa581..f2e0d0b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,33 @@ All notable changes to forge are documented here. +## [0.7.3] — 2026-06-01 + +Native-first proxy. With native function calling now well-supported across modern local models, the proxy defaults to — and is optimized for — native tool calling, forwarding the client's OpenAI `tools` / `messages` to the backend verbatim. Prompt-injection remains available as an explicit opt-in for llama.cpp / llamafile backends that lack a function-calling template, but it is no longer the default path. This release also folds in the OpenAI-compatible client and several proxy / eval fixes that landed on `main` since 0.7.2. + +### Added +- **`OpenAICompatClient`** for arbitrary OpenAI-compatible endpoints. #89 (thanks @lucasgerads). +- **`--backend-timeout` proxy option** — configurable backend response timeout (default 300s). #91. +- **`--backend-capability {native,prompt}` proxy flag** — `native` (default) forwards the client's tools / messages verbatim to a function-calling-capable backend; `prompt` opts into prompt-injection for non-FC llama.cpp / llamafile backends. Declared once at startup and frozen — never probed or switched mid-stream. +- Effective `backend_timeout` logged at proxy startup. + +### Changed +- **BREAKING — `--mode {native,prompt}` renamed to `--backend-capability {native,prompt}`** (and `ProxyServer(mode=…)` → `ProxyServer(backend_capability=…)`). `--mode` collided with the proxy's managed / external deployment mode; the new name states what it controls — the backend's tool-calling protocol — and reflects that the choice is declared once and frozen, never probed at runtime. There is **no deprecation alias** (`--mode` was introduced in 0.7.1). Migration: `--mode native` → drop it (native is the default) or `--backend-capability native`; `--mode prompt` → `--backend-capability prompt`. +- **Native function calling is now transparent passthrough** — the proxy forwards the client's OpenAI tool / message payloads to the backend verbatim instead of round-tripping them through forge's internal `ToolSpec` representation, which dropped schema detail. +- **vLLM model identity** consolidated to a single source of truth (the wire `model_path` and the registry `model` key are now set together). #75. +- The `prompt` capability is now **rejected loudly** for ollama / vllm / anthropic backends — previously it was silently ignored for ollama. +- `stream_options` is excluded from proxy passthrough. #94 (thanks @alexandergunnarson). + +### Fixed +- **Consistent malformed-tool-call / unexpected-response handling** across the OpenAI-shape clients — malformed model tool args drive a retry (`TextResponse`) instead of degrading silently or raising inconsistently, and non-streaming responses are guarded so a broken provider envelope fails loud. +- `Guardrails.record()` no longer drops tool args for prerequisite tracking. #72 (thanks @hobostay). +- Deprecated asyncio API replaced; proxy server input validation added. #71 (thanks @hobostay). +- Proxy input hardening, non-blocking Ollama stop, client shutdown, and loud arg decode. #86. +- Dead code and a fragile variable reference cleaned up in `LlamafileClient`. #73 (thanks @hobostay). + +### Removed +- Runtime `auto` function-calling mode in `LlamafileClient` — the proxy never used it, and its mid-request probe-and-switch behavior is replaced by the declared-and-frozen `--backend-capability`. + ## [0.7.2] — 2026-05-24 vLLM backend support — serve AWQ/GPTQ and other vLLM-hosted models behind forge's guardrails, in both proxy modes and via `WorkflowRunner`. diff --git a/docs/USER_GUIDE.md b/docs/USER_GUIDE.md index 019792c..ac04a1f 100644 --- a/docs/USER_GUIDE.md +++ b/docs/USER_GUIDE.md @@ -83,7 +83,7 @@ claude `ANTHROPIC_AUTH_TOKEN` can be any non-empty string — forge ignores it. The model name Claude Code sends is also ignored; forge serves whatever backend the proxy was started with. -**Function-calling mode.** `--mode native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--mode prompt` injects the tool surface into the prompt for backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model, so prefer native when the backend supports it. +**Function-calling capability.** `--backend-capability native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--backend-capability prompt` injects the tool surface into the prompt for llama.cpp/llamafile backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model — and tends to degrade on more complex, multi-step interactions — so prefer native whenever the backend supports it. The capability is declared at startup and frozen. **Downstream protocol.** diff --git a/pyproject.toml b/pyproject.toml index 37dcf08..d7a0f2e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "forge-guardrails" -version = "0.7.2" +version = "0.7.3" description = "A reliability layer for self-hosted LLM tool-calling. Guardrails, context management, and backend adapters for multi-step agentic workflows." requires-python = ">=3.12" license = "MIT"