WIP: Stream reasoning tokens for OpenAI reasoning models#2778
WIP: Stream reasoning tokens for OpenAI reasoning models#2778onmete wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@onmete: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
b8de481 to
782b4ac
Compare
5385aa1 to
31c8af5
Compare
Description
Draft/WIP reference implementation for streaming reasoning tokens (chain-of-thought) from OpenAI reasoning models (GPT-5, o-series). This enables the UI to show "thinking" progress and preserves reasoning context between tool-calling rounds.
Key architectural decisions
Responses API required — The default Chat Completions API does not expose reasoning tokens for GPT-5 (they are processed server-side and invisible). Switching to the Responses API (
output_version="responses/v1") withreasoning={"effort": "low", "summary": "auto"}makes reasoning summaries available as content blocks.Reasoning must be passed between tool-calling rounds — Per OpenAI's guidance, reasoning items should be kept in context between rounds within a single request. Without this, the model re-reasons from scratch each round, producing repetitive verbose output. The fix accumulates all
AIMessageChunks per round and builds the inter-roundAIMessagewith full content (reasoning + text + tool calls) instead ofcontent="".Reasoning is ephemeral — not cached across requests — Reasoning context is only relevant within a single request's tool-calling loop. It is NOT stored in the conversation cache between separate question/answer pairs (the cache only stores the final text response).
New
StreamedChunk(type="reasoning")and SSEevent: reasoning— Reasoning summaries are yielded as a new chunk type, streamed asevent: reasoningin JSON mode. Intext/plainmode, reasoning text is output directly.Loop termination fix for Responses API — The Chat Completions API uses
finish_reason="stop"for text-only completions andfinish_reason="tool_calls"for tool calls. The Responses API useschunk_position="last"for ALL completions indiscriminately, so it cannot be used for early stop detection. Instead, the tool-calling loop now explicitly breaks when no tool calls are present after a round.Token counter resilience —
GenericTokenCounter.on_llm_new_tokennow handles non-string tokens (the Responses API can send structured content objects).Open items / not yet done
reasoning.effortandverbositylevels —"low"may be too terse,"medium"too verbosesummary: "auto"vs"concise"vs"detailed"is optimaluse_previous_response_id=Trueas an alternative to manual message accumulationType of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Manually tested with GPT-5 via
curlagainst a local OLS instance:text/plainmode outputs reasoning text directlyapplication/jsonmode producesevent: reasoningSSE eventsMade with Cursor