Skip to content

feat(backend): surface self-hosted reasoning via @ai-sdk/openai-compatible#262

Open
MesoX wants to merge 2 commits into
willdady:mainfrom
MesoX:feature/vllm-reasoning
Open

feat(backend): surface self-hosted reasoning via @ai-sdk/openai-compatible#262
MesoX wants to merge 2 commits into
willdady:mainfrom
MesoX:feature/vllm-reasoning

Conversation

@MesoX

@MesoX MesoX commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Chat-completions mode (apiMode=chat) on an OpenAI provider is, in practice, only used against self-hosted / company OpenAI-compatible servers (vLLM, SGLang, llama.cpp, TGI, LM Studio) — real OpenAI is driven via the Responses API. Those servers expose the model's thinking in a reasoning_content field, but @ai-sdk/openai's chat model only parses role/content/tool_calls and silently drops it. So the existing reasoning UI never received anything from them.

This routes chat-mode language models through @ai-sdk/openai-compatible, which reads reasoning_content (and reasoning) natively and emits reasoning stream parts — rendered by the existing collapsible "Thinking…" block. No frontend changes.

Changes

  • Add @ai-sdk/openai-compatible dependency (depends on the same @ai-sdk/provider@3 as the other adapters).
  • provider.ts: for OpenAI providers with apiMode==="chat", build the language model via createOpenAICompatible(...). Embeddings and native search tools stay on the OpenAI SDK. Responses mode is unchanged.
  • Set includeUsage: true so streamed turns carry stream_options.include_usage — without it self-hosted servers return no token usage on the streaming path (shows as In:0 / Out:0).
  • Tests covering chat→compatible routing and responses→OpenAI dispatch.

Behavior notes

  • Real OpenAI in chat mode is unaffected — it returns no reasoning_content, so no reasoning parts are emitted.
  • supportsStructuredOutputs is left at its default (false): a requested JSON schema downgrades to json_object mode (the AI SDK still validates client-side), which is safe across servers that don't support json_schema response_format. This only affects the title/tag generation call.

Testing

  • typecheck, lint, and the provider unit tests pass.
  • Verified end-to-end against a vLLM server (Qwen3, reasoning parser enabled): reasoning streams into the collapsible block and token usage populates.

🤖 Generated with Claude Code

frantisek.spacek@morosystems.cz and others added 2 commits June 15, 2026 21:18
Chat-completions mode (apiMode=chat) is used in practice only against
self-hosted/company OpenAI-compatible servers (vLLM, SGLang, llama.cpp,
TGI). Those expose model thinking in a `reasoning_content` field that
`@ai-sdk/openai`'s chat model silently drops, so the existing reasoning
UI never received it.

Route chat-mode language models through `@ai-sdk/openai-compatible`,
which reads `reasoning_content` natively and emits reasoning stream
parts. Embeddings and native search tools stay on the OpenAI SDK.
Responses mode is unchanged. Real OpenAI in chat mode is unaffected
(it returns no reasoning_content).

Leaves supportsStructuredOutputs at default false: a requested JSON
schema downgrades to json_object mode (AI SDK still validates
client-side) to stay safe across servers lacking json_schema support.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`@ai-sdk/openai-compatible` only sends `stream_options.include_usage`
when `includeUsage` is set, so self-hosted servers (vLLM, SGLang, …)
returned no token usage on the streaming path — surfacing as In:0/Out:0
in the UI. Non-streaming (e.g. compaction summarize) was unaffected.

Set `includeUsage: true` so streamed turns carry token counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant