Skip to content

WIP: Stream reasoning tokens for OpenAI reasoning models#2778

Draft
onmete wants to merge 2 commits intoopenshift:mainfrom
onmete:wip/reasoning-token-streaming
Draft

WIP: Stream reasoning tokens for OpenAI reasoning models#2778
onmete wants to merge 2 commits intoopenshift:mainfrom
onmete:wip/reasoning-token-streaming

Conversation

@onmete
Copy link
Contributor

@onmete onmete commented Feb 26, 2026

Description

Draft/WIP reference implementation for streaming reasoning tokens (chain-of-thought) from OpenAI reasoning models (GPT-5, o-series). This enables the UI to show "thinking" progress and preserves reasoning context between tool-calling rounds.

Key architectural decisions

  1. Responses API required — The default Chat Completions API does not expose reasoning tokens for GPT-5 (they are processed server-side and invisible). Switching to the Responses API (output_version="responses/v1") with reasoning={"effort": "low", "summary": "auto"} makes reasoning summaries available as content blocks.

  2. Reasoning must be passed between tool-calling rounds — Per OpenAI's guidance, reasoning items should be kept in context between rounds within a single request. Without this, the model re-reasons from scratch each round, producing repetitive verbose output. The fix accumulates all AIMessageChunks per round and builds the inter-round AIMessage with full content (reasoning + text + tool calls) instead of content="".

  3. Reasoning is ephemeral — not cached across requests — Reasoning context is only relevant within a single request's tool-calling loop. It is NOT stored in the conversation cache between separate question/answer pairs (the cache only stores the final text response).

  4. New StreamedChunk(type="reasoning") and SSE event: reasoning — Reasoning summaries are yielded as a new chunk type, streamed as event: reasoning in JSON mode. In text/plain mode, reasoning text is output directly.

  5. Loop termination fix for Responses API — The Chat Completions API uses finish_reason="stop" for text-only completions and finish_reason="tool_calls" for tool calls. The Responses API uses chunk_position="last" for ALL completions indiscriminately, so it cannot be used for early stop detection. Instead, the tool-calling loop now explicitly breaks when no tool calls are present after a round.

  6. Token counter resilienceGenericTokenCounter.on_llm_new_token now handles non-string tokens (the Responses API can send structured content objects).

Open items / not yet done

  • Tuning reasoning.effort and verbosity levels — "low" may be too terse, "medium" too verbose
  • Unit tests for reasoning extraction and streaming
  • Integration tests with reasoning models
  • Evaluate whether summary: "auto" vs "concise" vs "detailed" is optimal
  • Config-driven reasoning parameters (per-model or per-provider) instead of hardcoded defaults
  • Consider use_previous_response_id=True as an alternative to manual message accumulation

Type of change

  • New feature

Related Tickets & Documents

  • Reference implementation for reasoning token extraction design

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

Manually tested with GPT-5 via curl against a local OLS instance:

  • Verified tool-calling flow completes (model calls tool, gets result, produces concise answer, stops)
  • Verified reasoning is not cached across requests
  • Verified reasoning is passed between tool-calling rounds (no repetitive looping)
  • Verified text/plain mode outputs reasoning text directly
  • Verified application/json mode produces event: reasoning SSE events

Made with Cursor

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2026
@openshift-ci
Copy link

openshift-ci bot commented Feb 26, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link

openshift-ci bot commented Feb 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign onmete for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link

openshift-ci bot commented Feb 27, 2026

@onmete: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ols-evaluation 2d14808 link true /test ols-evaluation

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@onmete onmete force-pushed the wip/reasoning-token-streaming branch from b8de481 to 782b4ac Compare March 12, 2026 14:51
@onmete onmete force-pushed the wip/reasoning-token-streaming branch from 5385aa1 to 31c8af5 Compare March 13, 2026 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant