Skip to content

feat(server): card-driven thinking control + reasoning_content channel + /props schema-4#341

Open
easel wants to merge 3 commits into
Luce-Org:mainfrom
easel:feat/server-thinking-control
Open

feat(server): card-driven thinking control + reasoning_content channel + /props schema-4#341
easel wants to merge 3 commits into
Luce-Org:mainfrom
easel:feat/server-thinking-control

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented Jun 3, 2026

Summary

Card-driven thinking control for the server, split across three load-bearing surfaces:

  • chat_template extends the hardcoded renderer to return a PromptRenderResult with a started_in_thinking provenance flag (Qwen3.6 / Laguna enable_thinking paths pre-open <think> from the prompt suffix), and adds the same provenance sniff to the Jinja path.
  • sse_emitter routes Qwen3.6 / Laguna think-mode output to the reasoning_content SSE channel for both OpenAI-chat and Anthropic-messages formats — reasoning never leaks into content. Includes Pattern-B call:<verb>{ streaming hooks that ride on top of feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340's plain-text tool parser.
  • /props is bumped to schema 4 with new top-level build (git_sha / image_tag / build_time), model.target / model.draft (GGUF identity: size_bytes + sha256 + header fields), and host (verbatim pass-through of /opt/lucebox-hub/HOST_INFO) blocks. http_server.cpp is the source of truth for the new payload; server_main.cpp only adds the IMAGE_INFO / HOST_INFO file loaders and a GgufMetadata → JSON helper.

Matching model_card and spec updates land alongside.

Files

  • server/src/server/chat_template.{h,cpp}PromptRenderResult + started_in_thinking provenance (hardcoded path + Jinja suffix sniff)
  • server/src/server/sse_emitter.{h,cpp} — reasoning_content channel routing for think-mode + Pattern-B call:<verb>{ streaming hooks
  • server/src/server/http_server.{h,cpp}/props schema-4 payload, ParsedRequest.started_in_thinking routing into SseEmitter initial mode, finish_details for thinking
  • server/src/server/server_main.cppIMAGE_INFO / HOST_INFO file loaders and gguf_metadata_to_json() helper (no new CLI flags — soft-close / debug-thinking-logits flags live in feat(server): soft-close thinking termination (qwen35 + gemma4) #339)
  • server/test/test_server_unit.cpp — coverage spanning chat_template provenance, /props schema 4, and reasoning_content routing
  • docs/specs/openapi-props.yaml, docs/specs/props-endpoint.md, docs/specs/thinking-budget.md — spec updates
  • share/model_cards/_schema.json, share/model_cards/qwen3.6-27b.json, share/model_cards/laguna-xs.2.json — card schema + Qwen3.6 / Laguna think-mode opt-ins

Dependencies

This PR carries hard semantic dependencies on three sibling split PRs. The chain should land in order.

Also worth noting: #339 (server-soft-close) earlier merged the BudgetHook::close_token_ids / hard_limit_remaining fields this PR's http_server.cpp consumes at the close-token boundary. That is not a hard build dep here (the fields are already on origin/main), but the --think-soft-close-* and --debug-thinking-logits CLI flags live in #339, not in this PR. close_kind in this PR reports natural / hard only; the soft branch lands with #339's soft_forced_close field.

Risk

Medium — touches request handling and SSE streaming on the hot path. Pattern-B streaming logic interleaves with the existing StreamMode::REASONING state machine and the tool-buffer accumulator; regressions here surface as either reasoning leaking into content or tool calls being dropped mid-stream.

Test plan

  • test_server_unit builds and passes on this branch against the gguf-inspect base
  • /props against a running server returns schema 4 with build, model.target, model.draft, host keys
  • Qwen3.6 / Laguna thinking output streams on the reasoning_content SSE channel, never on content
  • Anthropic-messages format places think-mode output in thinking blocks (not text)
  • Thinking-off (card-driven gate) skips the <think> preamble and emits user-visible content immediately
  • Pattern-B call:<verb>{ streaming detects the plain-text tool open and buffers correctly (requires feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340 on this branch's base)

Note: server-split PRs can't be validated as a fully linked binary standalone because CMake cross-references shared headers (common/model_backend.h, etc.) that evolve in sibling PRs. The unit tests in this PR are self-contained; full integration requires the gguf-inspect PR + #336 + #340 to land for the matching runtime behavior.

🤖 Generated with Claude Code

@easel easel force-pushed the feat/server-thinking-control branch 2 times, most recently from ef5e9b3 to 268301f Compare June 4, 2026 03:37
@easel easel force-pushed the feat/server-thinking-control branch from 268301f to 3f4a7d2 Compare June 4, 2026 04:54
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.

- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
  probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
  docs/experiments/.

server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).

## Why

Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.

## Dependencies

- Luce-Org#336 (server-layer-split): #includes qwen35/c2_gate.h and uses BudgetHook
  fa_window_override + c2_spec_decode_permitted from the shared backend
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Extends server/src/server/tool_parser.{cpp,h} to parse Gemma's
plain-text call:<verb>{} emissions (also accepts the \`\`_call:\`\`
tokenizer-artifact prefix) and render them as Anthropic tool_use +
tool_result blocks. Isolated to tool_parser; the streaming detection
hook in sse_emitter ships with Luce-Org#341. Adds 364 lines of C++ unit
coverage in test_server_unit.cpp plus the call-verb parser plan and
Gemma4-26B parser-fix writeup.

## Why

Gemma4 emits tool calls as plain-text call:<verb>{...} rather than
structured JSON, which breaks the existing Anthropic tool_use pipeline
on agentic workloads. This parser closes that gap so Gemma4 can drive
coding-agent loops end-to-end.

## Dependencies

None - this PR is independent.
@easel easel force-pushed the feat/server-thinking-control branch from 3f4a7d2 to 0bc9580 Compare June 4, 2026 05:03
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.

- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
  probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
  docs/experiments/.

server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).

## Why

Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.

## Attribution

server/src/qwen35/c2_gate.h is included here to keep this branch
self-compiling. It was originally authored by @dusterbloom in PR Luce-Org#274
(commit 52fea3b) — a 31-line pure predicate extracted from
qwen35_backend.cpp for testability. Co-author trailer below.

Co-Authored-By: dusterbloom <32869278+dusterbloom@users.noreply.github.com>
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Extends server/src/server/tool_parser.{cpp,h} to parse Gemma's
plain-text call:<verb>{} emissions (also accepts the \`\`_call:\`\`
tokenizer-artifact prefix) and render them as Anthropic tool_use +
tool_result blocks. Isolated to tool_parser; the streaming detection
hook in sse_emitter ships with Luce-Org#341. Adds 364 lines of C++ unit
coverage in test_server_unit.cpp plus the call-verb parser plan and
Gemma4-26B parser-fix writeup.

## Why

Gemma4 emits tool calls as plain-text call:<verb>{...} rather than
structured JSON, which breaks the existing Anthropic tool_use pipeline
on agentic workloads. This parser closes that gap so Gemma4 can drive
coding-agent loops end-to-end.

## Dependencies

None - this PR is independent.
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
…dmission)

PR Luce-Org#341 decomposition: keep only Erik's OWN work, drop pieces traced to
external/closed PRs in the provenance audit.

Excised (borrowed from dusterbloom):
- chat_template Jinja closed-think prefill block + arch_hint=ChatFormat::QWEN3
  parameter (closed PR Luce-Org#293).
- 5 test_jinja_render_* tests covering the prefill behavior
  (test_jinja_render_qwen3_closes_think_when_thinking_off,
   test_jinja_render_does_not_close_think_when_thinking_on,
   test_jinja_render_does_not_close_think_for_non_qwen3_arch,
   test_jinja_render_does_not_double_append_close_think,
   test_chat_format_for_arch_qwen35moe_returns_qwen3).
- check_admission() pure helper + keep-ratio guard in http_server (PR Luce-Org#274).
- gen_req.fa_window_override assignment (post-pflash widen window, PR Luce-Org#274).
- 8 test_admission_* tests.

Fixes:
- http_server.cpp close_kind: drop the "soft" branch that referenced the
  non-existent result.soft_forced_close field (soft-close lives on a
  sibling PR). Collapse to the natural/hard split that the actual
  GenerateResult.budget_forced_close field supports.
- Restore the original single-line pre-handle context check
  (prompt_tokens + max_output > max_ctx) in place of the removed
  check_admission pre-compression gate.

Erik's OWN surfaces preserved:
- started_in_thinking plumbing (chat_template, http_server, sse_emitter)
- /props schema-4 (IMAGE_INFO / HOST_INFO loaders, build / host / model blocks)
- call:verb SSE hooks (looks_like_plain_text_call, Pattern B, tool_open
  plain-text), close_kind soft/hard/natural refactor (now natural/hard
  only here), Anthropic tool_use+tool_result normalization
- docs/specs (props-endpoint, thinking-budget)
- model_cards _schema.json, qwen3.6-27b, laguna-xs.2 think-mode opt-ins
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.

- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
  probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
  docs/experiments/.

server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).

## Why

Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.
@easel easel marked this pull request as ready for review June 4, 2026 19:18
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 14 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/server/chat_template.cpp Outdated
Comment thread docs/specs/openapi-props.yaml Outdated
Comment thread server/src/server/chat_template.cpp Outdated
Comment thread docs/specs/openapi-props.yaml
Comment thread server/src/server/server_main.cpp Outdated
Comment thread server/src/server/sse_emitter.cpp
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
- chat_template.cpp: drop `enable_thinking &&` gate on the Jinja-path
  suffix-sniff for `started_in_thinking`. The rendered prompt's tail
  is the source of truth: if the template hardcodes `<think>` despite
  enable_thinking=false (custom template, model-card mismatch) we
  still need to route the first generated tokens to the reasoning
  channel — otherwise reasoning leaks into content. Only WARN on the
  mismatch case (sniff=true, enable_thinking=false) so the normal
  success path stops spamming stderr. Adds two regression tests
  covering the override and add_generation_prompt requirement.

- openapi-props.yaml: bump `info.version` and `x-props-schema` to 4
  (matches C++ `kPropsSchema = 4` and props-endpoint.md). Add the
  schema-4 top-level `host` block to `PropsResponse` (required +
  property) and define the full `Host` component schema mirroring
  §4.17 of props-endpoint.md (os_pretty, kernel, wsl_version,
  docker_version, nvidia_driver, nvidia_ctk_version, cpu_model,
  nproc, ram_gb, gpus[], cuda_visible_devices, source, collector,
  collected_at). Update example payloads and props_schema examples
  in the Build / Server component schemas. Add a host block to the
  qwen36 example.

- server_main.cpp: factor a `json_str_or_none()` lambda that uses
  `json::find() + is_string()` instead of `json::value<std::string>()`
  for the image_info and host_info startup logging. `value<T>()`
  throws `type_error` when the key exists but isn't a string (e.g.
  `"kernel": null`), which crashes server startup. Both blocks come
  from the same operator-provided env path via entrypoint.sh, so the
  defensive lookup is shared.

- sse_emitter.cpp: in the TOOL_BUFFER success branch, also fold
  `parsed.cleaned_text` into `responses_streamed_text` for the
  Pattern A (XML envelope) case. The snapshot was taken before the
  parser ran, so any cleaned text emitted as a content delta after
  parsing was missing from the Responses-format .done/.completed
  payloads (client's accumulated buffer disagreed with the server's
  final text). Skip the append for Pattern B (plain-text `call:`)
  since its snapshot already contains the full raw buffer, a
  superset of cleaned_text — appending would double-count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
…easoning_content channel + /props schema-4
…ps schema-4

User-facing thinking-control API across the HTTP server surface:

- chat_template prefills a closed <think> block when thinking is off
  (Qwen3-gated) so the model skips the reasoning preamble without
  losing the assistant turn.
- http_server bumps /props schema 2 -> 4, adding build / model.target /
  model.draft / host blocks for client introspection.
- server_main adds --debug-thinking-logits and --think-soft-close-*
  flags plus image/host-info loaders for card-driven boot.
- sse_emitter routes Qwen3.6/Laguna think-mode output to the
  reasoning_content channel so reasoning never leaks into the
  user-visible content stream (Pattern-B call-verb streaming hook).
- Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards,
  the /props OpenAPI doc, updated thinking-budget spec, and the
  thinking-control protocol/mechanism experiments.
- test_server_unit gets matching coverage (~1100 lines) for prefill,
  /props schema-4, and reasoning_content routing.

Gives clients a single, card-driven API to control thinking budgets,
soft-close behavior, and reasoning visibility - and an introspectable
/props surface to discover what the server supports.

- Luce-Org#336 (server-layer-split): CMake/build references
- Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio +
  pflash_on contracts
- Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks
  rely on tool_parser changes from Luce-Org#340
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.

- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
  probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
  docs/experiments/.

server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).

Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.
easel and others added 2 commits June 5, 2026 16:05
…dmission)

PR Luce-Org#341 decomposition: keep only Erik's OWN work, drop pieces traced to
external/closed PRs in the provenance audit.

Excised (borrowed from dusterbloom):
- chat_template Jinja closed-think prefill block + arch_hint=ChatFormat::QWEN3
  parameter (closed PR Luce-Org#293).
- 5 test_jinja_render_* tests covering the prefill behavior
  (test_jinja_render_qwen3_closes_think_when_thinking_off,
   test_jinja_render_does_not_close_think_when_thinking_on,
   test_jinja_render_does_not_close_think_for_non_qwen3_arch,
   test_jinja_render_does_not_double_append_close_think,
   test_chat_format_for_arch_qwen35moe_returns_qwen3).
- check_admission() pure helper + keep-ratio guard in http_server (PR Luce-Org#274).
- gen_req.fa_window_override assignment (post-pflash widen window, PR Luce-Org#274).
- 8 test_admission_* tests.

Fixes:
- http_server.cpp close_kind: drop the "soft" branch that referenced the
  non-existent result.soft_forced_close field (soft-close lives on a
  sibling PR). Collapse to the natural/hard split that the actual
  GenerateResult.budget_forced_close field supports.
- Restore the original single-line pre-handle context check
  (prompt_tokens + max_output > max_ctx) in place of the removed
  check_admission pre-compression gate.

Erik's OWN surfaces preserved:
- started_in_thinking plumbing (chat_template, http_server, sse_emitter)
- /props schema-4 (IMAGE_INFO / HOST_INFO loaders, build / host / model blocks)
- call:verb SSE hooks (looks_like_plain_text_call, Pattern B, tool_open
  plain-text), close_kind soft/hard/natural refactor (now natural/hard
  only here), Anthropic tool_use+tool_result normalization
- docs/specs (props-endpoint, thinking-budget)
- model_cards _schema.json, qwen3.6-27b, laguna-xs.2 think-mode opt-ins
- chat_template.cpp: drop `enable_thinking &&` gate on the Jinja-path
  suffix-sniff for `started_in_thinking`. The rendered prompt's tail
  is the source of truth: if the template hardcodes `<think>` despite
  enable_thinking=false (custom template, model-card mismatch) we
  still need to route the first generated tokens to the reasoning
  channel — otherwise reasoning leaks into content. Only WARN on the
  mismatch case (sniff=true, enable_thinking=false) so the normal
  success path stops spamming stderr. Adds two regression tests
  covering the override and add_generation_prompt requirement.

- openapi-props.yaml: bump `info.version` and `x-props-schema` to 4
  (matches C++ `kPropsSchema = 4` and props-endpoint.md). Add the
  schema-4 top-level `host` block to `PropsResponse` (required +
  property) and define the full `Host` component schema mirroring
  §4.17 of props-endpoint.md (os_pretty, kernel, wsl_version,
  docker_version, nvidia_driver, nvidia_ctk_version, cpu_model,
  nproc, ram_gb, gpus[], cuda_visible_devices, source, collector,
  collected_at). Update example payloads and props_schema examples
  in the Build / Server component schemas. Add a host block to the
  qwen36 example.

- server_main.cpp: factor a `json_str_or_none()` lambda that uses
  `json::find() + is_string()` instead of `json::value<std::string>()`
  for the image_info and host_info startup logging. `value<T>()`
  throws `type_error` when the key exists but isn't a string (e.g.
  `"kernel": null`), which crashes server startup. Both blocks come
  from the same operator-provided env path via entrypoint.sh, so the
  defensive lookup is shared.

- sse_emitter.cpp: in the TOOL_BUFFER success branch, also fold
  `parsed.cleaned_text` into `responses_streamed_text` for the
  Pattern A (XML envelope) case. The snapshot was taken before the
  parser ran, so any cleaned text emitted as a content delta after
  parsing was missing from the Responses-format .done/.completed
  payloads (client's accumulated buffer disagreed with the server's
  final text). Skip the append for Pattern B (plain-text `call:`)
  since its snapshot already contains the full raw buffer, a
  superset of cleaned_text — appending would double-count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@easel easel force-pushed the feat/server-thinking-control branch from 372d6f3 to b3990d0 Compare June 5, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant