Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,33 @@

All notable changes to forge are documented here.

## [0.7.3] — 2026-06-01

Native-first proxy. With native function calling now well-supported across modern local models, the proxy defaults to — and is optimized for — native tool calling, forwarding the client's OpenAI `tools` / `messages` to the backend verbatim. Prompt-injection remains available as an explicit opt-in for llama.cpp / llamafile backends that lack a function-calling template, but it is no longer the default path. This release also folds in the OpenAI-compatible client and several proxy / eval fixes that landed on `main` since 0.7.2.

### Added
- **`OpenAICompatClient`** for arbitrary OpenAI-compatible endpoints. #89 (thanks @lucasgerads).
- **`--backend-timeout` proxy option** — configurable backend response timeout (default 300s). #91.
- **`--backend-capability {native,prompt}` proxy flag** — `native` (default) forwards the client's tools / messages verbatim to a function-calling-capable backend; `prompt` opts into prompt-injection for non-FC llama.cpp / llamafile backends. Declared once at startup and frozen — never probed or switched mid-stream.
- Effective `backend_timeout` logged at proxy startup.

### Changed
- **BREAKING — `--mode {native,prompt}` renamed to `--backend-capability {native,prompt}`** (and `ProxyServer(mode=…)` → `ProxyServer(backend_capability=…)`). `--mode` collided with the proxy's managed / external deployment mode; the new name states what it controls — the backend's tool-calling protocol — and reflects that the choice is declared once and frozen, never probed at runtime. There is **no deprecation alias** (`--mode` was introduced in 0.7.1). Migration: `--mode native` → drop it (native is the default) or `--backend-capability native`; `--mode prompt` → `--backend-capability prompt`.
- **Native function calling is now transparent passthrough** — the proxy forwards the client's OpenAI tool / message payloads to the backend verbatim instead of round-tripping them through forge's internal `ToolSpec` representation, which dropped schema detail.
- **vLLM model identity** consolidated to a single source of truth (the wire `model_path` and the registry `model` key are now set together). #75.
- The `prompt` capability is now **rejected loudly** for ollama / vllm / anthropic backends — previously it was silently ignored for ollama.
- `stream_options` is excluded from proxy passthrough. #94 (thanks @alexandergunnarson).

### Fixed
- **Consistent malformed-tool-call / unexpected-response handling** across the OpenAI-shape clients — malformed model tool args drive a retry (`TextResponse`) instead of degrading silently or raising inconsistently, and non-streaming responses are guarded so a broken provider envelope fails loud.
- `Guardrails.record()` no longer drops tool args for prerequisite tracking. #72 (thanks @hobostay).
- Deprecated asyncio API replaced; proxy server input validation added. #71 (thanks @hobostay).
- Proxy input hardening, non-blocking Ollama stop, client shutdown, and loud arg decode. #86.
- Dead code and a fragile variable reference cleaned up in `LlamafileClient`. #73 (thanks @hobostay).

### Removed
- Runtime `auto` function-calling mode in `LlamafileClient` — the proxy never used it, and its mid-request probe-and-switch behavior is replaced by the declared-and-frozen `--backend-capability`.

## [0.7.2] — 2026-05-24

vLLM backend support — serve AWQ/GPTQ and other vLLM-hosted models behind forge's guardrails, in both proxy modes and via `WorkflowRunner`.
Expand Down
6 changes: 4 additions & 2 deletions docs/BACKEND_SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,9 @@ llamafile --server --nobrowser -m path/to/model.gguf --port 8080 -ngl 999
| `-ngl 999` | Offload all layers to GPU |
| `-m <path>` | Path to GGUF |

llamafile does **not** support native function calling — forge's `LlamafileClient` falls back to prompt-injected mode automatically (`mode="auto"`), or you can force it with `mode="prompt"`.
`LlamafileClient` is **native-first**: `mode="native"` (the default) forwards tools via the backend's `tools` parameter and requires native function calling (llama.cpp with `--jinja`). For a backend without native FC, declare `mode="prompt"` to inject tool descriptions into the prompt and parse the JSON call back out. The capability is declared at construction and frozen — there is no runtime auto-detection. Native-first is the default because local-model FC support has matured into the more reliable path; prompt-injection stays fully supported as an explicit opt-in, but note that on more complex, multi-step interactions models tend to struggle to drive the prompt-injected protocol reliably, so reach for it only when the backend leaves no alternative.

> **Proxy note:** the OpenAI-compatible proxy is **native-first**. By default (`--backend-capability native`) it forwards the client's tools verbatim to an FC-capable backend (llama.cpp with `--jinja`, vLLM, Ollama, Anthropic) — the recommended setup. For a non-FC llama.cpp/llamafile backend, opt into prompt-injection with `--backend-capability prompt` (strips tools into the prompt, parses the JSON call back; reuses the same prompt path as the WorkflowRunner). The choice is frozen at startup — there is no runtime auto-detect in the proxy. See ADR-012.

Smoke-test:

Expand All @@ -88,7 +90,7 @@ from forge.clients import LlamafileClient

client = LlamafileClient(
gguf_path="path/to/model.gguf",
mode="prompt", # or "auto" to try native first
mode="prompt", # default is "native"; use "prompt" only for non-FC backends
recommended_sampling=True,
)
```
Expand Down
2 changes: 1 addition & 1 deletion docs/EVAL_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ python -m tests.eval.eval_runner --backend anthropic --model claude-haiku-4-5-20
| `--verbose`, `-v` | flag | off | Print live per-message trace |
| `--tags` | `plumbing`, `model_quality`, `advanced_reasoning`, `compaction`, `stateful`, `reasoning`, `error_recovery` | all | Filter scenarios by tag |
| `--scenario` | name(s) | all | Run specific scenario(s) by name |
| `--llamafile-mode` | `native`, `prompt`, `auto` | `auto` | FC mode for llamafile/llama-server backend |
| `--llamafile-mode` | `native`, `prompt` | `native` | FC mode for llamafile/llama-server backend (native-first; `prompt` for non-FC backends) |
| `--think` | `true`, `false`, `auto` | `auto` | Thinking mode. Ollama: controls `think` param. Llamafile: captures `[THINK]` tags and `reasoning_content` |
| `--budget-mode` | `backend`, `manual`, `forge-full`, `forge-fast` | `forge-full` | Context budget strategy. Compaction scenarios always override with their own budget |
| `--num-ctx` | int | none | Exact token budget (requires `--budget-mode manual`) |
Expand Down
2 changes: 1 addition & 1 deletion docs/USER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ claude

`ANTHROPIC_AUTH_TOKEN` can be any non-empty string — forge ignores it. The model name Claude Code sends is also ignored; forge serves whatever backend the proxy was started with.

**Function-calling mode.** `--mode native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--mode prompt` injects the tool surface into the prompt for backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model, so prefer native when the backend supports it.
**Function-calling capability.** `--backend-capability native` (default) uses the backend's chat-template tool-calling and is the smoother default for Claude Code's heavy multi-turn tool use. `--backend-capability prompt` injects the tool surface into the prompt for llama.cpp/llamafile backends without a tool-calling template; whether a model stays coherent across multi-turn tool results in prompt mode varies by model — and tends to degrade on more complex, multi-step interactions — so prefer native whenever the backend supports it. The capability is declared at startup and frozen.

**Downstream protocol.**

Expand Down
46 changes: 46 additions & 0 deletions docs/decisions/012-openai-proxy.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,52 @@ The proxy fully buffers each response from the backend before deciding what to d
4. **Client disconnect handling** -- detect TCP drop, cancel in-flight backend request, release inference lock.
5. **Testing** -- unit tests for extraction, integration tests with mock backend, smoke test with real llama-server.

### Revision: native-first, with opt-in prompt capability

The proxy is **native-first**. By default (`--backend-capability native`) it
targets backends that speak the native OpenAI tools API (llama.cpp with a
tool-calling chat template / `--jinja`, vLLM, Ollama, Anthropic) and forwards
the client's request verbatim (below).

Prompt-injection is available as an **explicit opt-in**
(`--backend-capability prompt`, llama.cpp/llamafile only) for non-FC backends —
it reuses the WorkflowRunner's prompt path (`build_tool_prompt`,
`_downgrade_messages`, `extract_tool_call`) so there is **one** prompt
implementation, not a proxy-specific fork. The capability is **declared once at
construction and frozen** — there is deliberately **no `mode="auto"` runtime
probe** (the old auto/HTTP-error fallback that mutated state mid-request was the
root of the original tangle; it is not reintroduced). In prompt capability the
verbatim passthrough is suppressed (`native_passthrough=False`): tools are
serialized into the prompt, so a raw native transcript would be meaningless.

History: this revision originally cut prompt mode from the proxy entirely
("native-only"). Prompt was then re-added as the opt-in capability above —
native-first is a cleaner story than a backwards-incompatible drop, and non-FC
backends (e.g. llamafile) stay usable through the proxy.

Rationale: the proxy is a transparent layer for an external agent that already
speaks native FC to a native-FC backend. A traced capture showed the native
path forwards the client's request byte-for-byte. The earlier eval regression
(prompt-mode proxy underperforming) was a prompt-injection artifact on an
FC-capable backend, not proxy overhead.

To preserve that transparency, the proxy forwards the client's **verbatim
OpenAI `tools` and `messages`** to the backend on the clean first attempt
(`raw_openai_tools` / `raw_openai_messages`), bypassing the lossy
`ToolSpec.from_json_schema` → `format_tool` round-trip that dropped schema
detail and leaked empty tool names. The parsed `ToolSpec` list is kept only as
forge's validation sidecar. On any forge mutation (retry / compaction / context
warning) the proxy falls back to the folded/serialized form — see the
`use_raw_messages` gate in `run_inference`, which mirrors the ADR-015
`inbound_anthropic_body` drop-on-mutation logic.

The synthetic `respond` tool is **opt-in** (`--inject-respond-tool`, default
off): the proxy forwards the client's tools untouched unless asked to inject it.

If a backend lacking native FC is placed behind the proxy, it degrades to
passing the model's text through (no auto-downgrade) — **bring an FC-capable
backend.**

### What this is NOT

- **Not a model server.** Forge sits in front of one.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "forge-guardrails"
version = "0.7.2"
version = "0.7.3"
description = "A reliability layer for self-hosted LLM tool-calling. Guardrails, context management, and backend adapters for multi-step agentic workflows."
requires-python = ">=3.12"
license = "MIT"
Expand Down
5 changes: 5 additions & 0 deletions src/forge/clients/anthropic.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,7 @@ async def send(
sampling: dict[str, Any] | None = None,
passthrough: dict[str, Any] | None = None,
inbound_anthropic_body: dict[str, Any] | None = None,
raw_openai_tools: list[dict[str, Any]] | None = None,
) -> LLMResponse:
"""Send messages via the Anthropic Messages API.

Expand All @@ -296,6 +297,8 @@ async def send(
forge. ``passthrough`` merges inbound-body extras into the SDK call.
``inbound_anthropic_body`` (path 1) triggers verbatim emit — see
ADR-015 for the cache_control preservation rationale.
``raw_openai_tools`` accepted for protocol symmetry, ignored
(Anthropic uses its own tool conversion).
"""
if sampling:
log.debug(
Expand Down Expand Up @@ -327,12 +330,14 @@ async def send_stream(
sampling: dict[str, Any] | None = None,
passthrough: dict[str, Any] | None = None,
inbound_anthropic_body: dict[str, Any] | None = None,
raw_openai_tools: list[dict[str, Any]] | None = None,
) -> AsyncIterator[StreamChunk]:
"""Stream via the Anthropic Messages API.

``sampling`` is accepted for protocol symmetry but ignored.
``passthrough`` merges inbound-body extras into the SDK call.
``inbound_anthropic_body`` (path 1) triggers verbatim emit; see ADR-015.
``raw_openai_tools`` accepted for protocol symmetry, ignored.
"""
if sampling:
log.debug(
Expand Down
14 changes: 14 additions & 0 deletions src/forge/clients/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@

from forge.core.workflow import LLMResponse, ToolCall, TextResponse, ToolSpec

# Verbatim OpenAI-shape payloads forwarded by the proxy. The proxy hands the
# client the user's original ``tools`` array so the backend sees the exact
# schema the client authored, instead of forge's reconstructed ToolSpec.
RawOpenAITools = list[dict[str, Any]]
RawOpenAIMessages = list[dict[str, Any]]


@dataclass(frozen=True)
class TokenUsage:
Expand Down Expand Up @@ -86,6 +92,7 @@ async def send(
sampling: dict[str, Any] | None = None,
passthrough: dict[str, Any] | None = None,
inbound_anthropic_body: dict[str, Any] | None = None,
raw_openai_tools: RawOpenAITools | None = None,
) -> LLMResponse:
"""Send messages and return a parsed response.

Expand Down Expand Up @@ -116,6 +123,11 @@ async def send(
forge-mutation (retry / compaction / context warning) so
only the clean first-attempt call rides verbatim. Other
clients accept and ignore. See ADR-015.
raw_openai_tools: Proxy-only — the client's verbatim OpenAI
``tools`` array. When set, LlamafileClient's native path sends
it as-is instead of re-emitting ``format_tool(spec)``, so the
backend sees the original schema (no name/schema drift). Other
clients accept and ignore.
"""
...

Expand All @@ -126,6 +138,7 @@ async def send_stream(
sampling: dict[str, Any] | None = None,
passthrough: dict[str, Any] | None = None,
inbound_anthropic_body: dict[str, Any] | None = None,
raw_openai_tools: RawOpenAITools | None = None,
) -> AsyncIterator[StreamChunk]:
"""Send messages and yield streaming chunks.

Expand All @@ -143,6 +156,7 @@ async def send_stream(
Per-call values win over instance state without mutating self.
passthrough: Optional inbound-body extras dict (see ``send``).
inbound_anthropic_body: Optional path-1 verbatim body (see ``send``).
raw_openai_tools: Optional verbatim OpenAI tools array (see ``send``).
"""
...

Expand Down
Loading
Loading