feat(server): card-driven thinking control + reasoning_content channel + /props schema-4 by easel · Pull Request #341 · Luce-Org/lucebox-hub

easel · 2026-06-03T21:51:56Z

Summary

Card-driven thinking control for the server, split across three load-bearing surfaces:

chat_template extends the hardcoded renderer to return a PromptRenderResult with a started_in_thinking provenance flag (Qwen3.6 / Laguna enable_thinking paths pre-open <think> from the prompt suffix), and adds the same provenance sniff to the Jinja path.
sse_emitter routes Qwen3.6 / Laguna think-mode output to the reasoning_content SSE channel for both OpenAI-chat and Anthropic-messages formats — reasoning never leaks into content. Includes Pattern-B call:<verb>{ streaming hooks that ride on top of feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340's plain-text tool parser.
/props is bumped to schema 4 with new top-level build (git_sha / image_tag / build_time), model.target / model.draft (GGUF identity: size_bytes + sha256 + header fields), and host (verbatim pass-through of /opt/lucebox-hub/HOST_INFO) blocks. http_server.cpp is the source of truth for the new payload; server_main.cpp only adds the IMAGE_INFO / HOST_INFO file loaders and a GgufMetadata → JSON helper.

Matching model_card and spec updates land alongside.

Files

server/src/server/chat_template.{h,cpp} — PromptRenderResult + started_in_thinking provenance (hardcoded path + Jinja suffix sniff)
server/src/server/sse_emitter.{h,cpp} — reasoning_content channel routing for think-mode + Pattern-B call:<verb>{ streaming hooks
server/src/server/http_server.{h,cpp} — /props schema-4 payload, ParsedRequest.started_in_thinking routing into SseEmitter initial mode, finish_details for thinking
server/src/server/server_main.cpp — IMAGE_INFO / HOST_INFO file loaders and gguf_metadata_to_json() helper (no new CLI flags — soft-close / debug-thinking-logits flags live in feat(server): soft-close thinking termination (qwen35 + gemma4) #339)
server/test/test_server_unit.cpp — coverage spanning chat_template provenance, /props schema 4, and reasoning_content routing
docs/specs/openapi-props.yaml, docs/specs/props-endpoint.md, docs/specs/thinking-budget.md — spec updates
share/model_cards/_schema.json, share/model_cards/qwen3.6-27b.json, share/model_cards/laguna-xs.2.json — card schema + Qwen3.6 / Laguna think-mode opt-ins

Dependencies

This PR carries hard semantic dependencies on three sibling split PRs. The chain should land in order.

feat/server-gguf-inspect — server_main.cpp (~L956 / L975) calls read_gguf_metadata() and consumes the GgufMetadata struct introduced by that PR (target/draft sha256 + GGUF header fields surfaced under /props.model). The current server/src/common/gguf_inspect.h only exposes GgufModelInfo — the wider metadata reader is the gguf-inspect PR's surface. Must land first.
feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340 (server-call-verb-parser) — sse_emitter.cpp adds Pattern-B call:<verb>{ streaming hooks (looks_like_plain_text_call(), the find_tool_start(…, bool & is_plain_text) signature, the tool_open_is_plain_text_ member, the responses_streamed_text snapshot, and the StreamMode::TOOL_BUFFER plain-text path). These hooks call into tool_parser changes introduced by feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340. Recommend landing feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340 first.
refactor(server): shared layer-split backend + GGUF inspection + c2-gate plumbing #336 (server-layer-split) — general server-build dependency (shared CMake refs in server/src/server/CMakeLists.txt and common/model_backend.h). /props schema 4 also reports layer-split / c2-gate state that originates in refactor(server): shared layer-split backend + GGUF inspection + c2-gate plumbing #336.

Also worth noting: #339 (server-soft-close) earlier merged the BudgetHook::close_token_ids / hard_limit_remaining fields this PR's http_server.cpp consumes at the close-token boundary. That is not a hard build dep here (the fields are already on origin/main), but the --think-soft-close-* and --debug-thinking-logits CLI flags live in #339, not in this PR. close_kind in this PR reports natural / hard only; the soft branch lands with #339's soft_forced_close field.

Risk

Medium — touches request handling and SSE streaming on the hot path. Pattern-B streaming logic interleaves with the existing StreamMode::REASONING state machine and the tool-buffer accumulator; regressions here surface as either reasoning leaking into content or tool calls being dropped mid-stream.

Test plan

test_server_unit builds and passes on this branch against the gguf-inspect base
/props against a running server returns schema 4 with build, model.target, model.draft, host keys
Qwen3.6 / Laguna thinking output streams on the reasoning_content SSE channel, never on content
Anthropic-messages format places think-mode output in thinking blocks (not text)
Thinking-off (card-driven gate) skips the <think> preamble and emits user-visible content immediately
Pattern-B call:<verb>{ streaming detects the plain-text tool open and buffers correctly (requires feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340 on this branch's base)

Note: server-split PRs can't be validated as a fully linked binary standalone because CMake cross-references shared headers (common/model_backend.h, etc.) that evolve in sibling PRs. The unit tests in this PR are self-contained; full integration requires the gguf-inspect PR + #336 + #340 to land for the matching runtime behavior.

🤖 Generated with Claude Code

## What Backend mechanism for soft-close thinking termination via a logit-ratio peek with split probe/inject ids, a min_tokens floor, and max_tokens treated as a response-only budget while thinking is active. - qwen35 originates the mechanism in qwen35_backend.{cpp,h}. - gemma4 ports the same mechanism in gemma4_backend.cpp (plus the probe/inject hooks in gemma4_internal.h owned here). - common/model_backend.h carries the shared hook contract. - test_server_unit.cpp adds 533 lines of coverage for both backends. - docs/specs/thinking-budget.md updated; soft-close design plan under docs/experiments/. server_main CLI flags for this mechanism ship in the thinking-control API PR (Luce-Org#341). ## Why Lets a model finish its reasoning gracefully rather than being hard-cut when a thinking budget is hit, which preserves answer quality on long reasoning traces. ## Dependencies - Luce-Org#336 (server-layer-split): #includes qwen35/c2_gate.h and uses BudgetHook fa_window_override + c2_spec_decode_permitted from the shared backend

## What Extends server/src/server/tool_parser.{cpp,h} to parse Gemma's plain-text call:<verb>{} emissions (also accepts the \`\`_call:\`\` tokenizer-artifact prefix) and render them as Anthropic tool_use + tool_result blocks. Isolated to tool_parser; the streaming detection hook in sse_emitter ships with Luce-Org#341. Adds 364 lines of C++ unit coverage in test_server_unit.cpp plus the call-verb parser plan and Gemma4-26B parser-fix writeup. ## Why Gemma4 emits tool calls as plain-text call:<verb>{...} rather than structured JSON, which breaks the existing Anthropic tool_use pipeline on agentic workloads. This parser closes that gap so Gemma4 can drive coding-agent loops end-to-end. ## Dependencies None - this PR is independent.

@dusterbloom

## What Backend mechanism for soft-close thinking termination via a logit-ratio peek with split probe/inject ids, a min_tokens floor, and max_tokens treated as a response-only budget while thinking is active. - qwen35 originates the mechanism in qwen35_backend.{cpp,h}. - gemma4 ports the same mechanism in gemma4_backend.cpp (plus the probe/inject hooks in gemma4_internal.h owned here). - common/model_backend.h carries the shared hook contract. - test_server_unit.cpp adds 533 lines of coverage for both backends. - docs/specs/thinking-budget.md updated; soft-close design plan under docs/experiments/. server_main CLI flags for this mechanism ship in the thinking-control API PR (Luce-Org#341). ## Why Lets a model finish its reasoning gracefully rather than being hard-cut when a thinking budget is hit, which preserves answer quality on long reasoning traces. ## Attribution server/src/qwen35/c2_gate.h is included here to keep this branch self-compiling. It was originally authored by @dusterbloom in PR Luce-Org#274 (commit 52fea3b) — a 31-line pure predicate extracted from qwen35_backend.cpp for testability. Co-author trailer below. Co-Authored-By: dusterbloom <32869278+dusterbloom@users.noreply.github.com>

## What Extends server/src/server/tool_parser.{cpp,h} to parse Gemma's plain-text call:<verb>{} emissions (also accepts the \`\`_call:\`\` tokenizer-artifact prefix) and render them as Anthropic tool_use + tool_result blocks. Isolated to tool_parser; the streaming detection hook in sse_emitter ships with Luce-Org#341. Adds 364 lines of C++ unit coverage in test_server_unit.cpp plus the call-verb parser plan and Gemma4-26B parser-fix writeup. ## Why Gemma4 emits tool calls as plain-text call:<verb>{...} rather than structured JSON, which breaks the existing Anthropic tool_use pipeline on agentic workloads. This parser closes that gap so Gemma4 can drive coding-agent loops end-to-end. ## Dependencies None - this PR is independent.

…dmission) PR Luce-Org#341 decomposition: keep only Erik's OWN work, drop pieces traced to external/closed PRs in the provenance audit. Excised (borrowed from dusterbloom): - chat_template Jinja closed-think prefill block + arch_hint=ChatFormat::QWEN3 parameter (closed PR Luce-Org#293). - 5 test_jinja_render_* tests covering the prefill behavior (test_jinja_render_qwen3_closes_think_when_thinking_off, test_jinja_render_does_not_close_think_when_thinking_on, test_jinja_render_does_not_close_think_for_non_qwen3_arch, test_jinja_render_does_not_double_append_close_think, test_chat_format_for_arch_qwen35moe_returns_qwen3). - check_admission() pure helper + keep-ratio guard in http_server (PR Luce-Org#274). - gen_req.fa_window_override assignment (post-pflash widen window, PR Luce-Org#274). - 8 test_admission_* tests. Fixes: - http_server.cpp close_kind: drop the "soft" branch that referenced the non-existent result.soft_forced_close field (soft-close lives on a sibling PR). Collapse to the natural/hard split that the actual GenerateResult.budget_forced_close field supports. - Restore the original single-line pre-handle context check (prompt_tokens + max_output > max_ctx) in place of the removed check_admission pre-compression gate. Erik's OWN surfaces preserved: - started_in_thinking plumbing (chat_template, http_server, sse_emitter) - /props schema-4 (IMAGE_INFO / HOST_INFO loaders, build / host / model blocks) - call:verb SSE hooks (looks_like_plain_text_call, Pattern B, tool_open plain-text), close_kind soft/hard/natural refactor (now natural/hard only here), Anthropic tool_use+tool_result normalization - docs/specs (props-endpoint, thinking-budget) - model_cards _schema.json, qwen3.6-27b, laguna-xs.2 think-mode opt-ins

## What Backend mechanism for soft-close thinking termination via a logit-ratio peek with split probe/inject ids, a min_tokens floor, and max_tokens treated as a response-only budget while thinking is active. - qwen35 originates the mechanism in qwen35_backend.{cpp,h}. - gemma4 ports the same mechanism in gemma4_backend.cpp (plus the probe/inject hooks in gemma4_internal.h owned here). - common/model_backend.h carries the shared hook contract. - test_server_unit.cpp adds 533 lines of coverage for both backends. - docs/specs/thinking-budget.md updated; soft-close design plan under docs/experiments/. server_main CLI flags for this mechanism ship in the thinking-control API PR (Luce-Org#341). ## Why Lets a model finish its reasoning gracefully rather than being hard-cut when a thinking budget is hit, which preserves answer quality on long reasoning traces.

cubic-dev-ai

6 issues found across 14 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

- chat_template.cpp: drop `enable_thinking &&` gate on the Jinja-path suffix-sniff for `started_in_thinking`. The rendered prompt's tail is the source of truth: if the template hardcodes `<think>` despite enable_thinking=false (custom template, model-card mismatch) we still need to route the first generated tokens to the reasoning channel — otherwise reasoning leaks into content. Only WARN on the mismatch case (sniff=true, enable_thinking=false) so the normal success path stops spamming stderr. Adds two regression tests covering the override and add_generation_prompt requirement. - openapi-props.yaml: bump `info.version` and `x-props-schema` to 4 (matches C++ `kPropsSchema = 4` and props-endpoint.md). Add the schema-4 top-level `host` block to `PropsResponse` (required + property) and define the full `Host` component schema mirroring §4.17 of props-endpoint.md (os_pretty, kernel, wsl_version, docker_version, nvidia_driver, nvidia_ctk_version, cpu_model, nproc, ram_gb, gpus[], cuda_visible_devices, source, collector, collected_at). Update example payloads and props_schema examples in the Build / Server component schemas. Add a host block to the qwen36 example. - server_main.cpp: factor a `json_str_or_none()` lambda that uses `json::find() + is_string()` instead of `json::value<std::string>()` for the image_info and host_info startup logging. `value<T>()` throws `type_error` when the key exists but isn't a string (e.g. `"kernel": null`), which crashes server startup. Both blocks come from the same operator-provided env path via entrypoint.sh, so the defensive lookup is shared. - sse_emitter.cpp: in the TOOL_BUFFER success branch, also fold `parsed.cleaned_text` into `responses_streamed_text` for the Pattern A (XML envelope) case. The snapshot was taken before the parser ran, so any cleaned text emitted as a content delta after parsing was missing from the Responses-format .done/.completed payloads (client's accumulated buffer disagreed with the server's final text). Skip the append for Pattern B (plain-text `call:`) since its snapshot already contains the full raw buffer, a superset of cleaned_text — appending would double-count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…easoning_content channel + /props schema-4

…ps schema-4 User-facing thinking-control API across the HTTP server surface: - chat_template prefills a closed <think> block when thinking is off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema 2 -> 4, adding build / model.target / model.draft / host blocks for client introspection. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream (Pattern-B call-verb streaming hook). - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, updated thinking-budget spec, and the thinking-control protocol/mechanism experiments. - test_server_unit gets matching coverage (~1100 lines) for prefill, /props schema-4, and reasoning_content routing. Gives clients a single, card-driven API to control thinking budgets, soft-close behavior, and reasoning visibility - and an introspectable /props surface to discover what the server supports. - Luce-Org#336 (server-layer-split): CMake/build references - Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio + pflash_on contracts - Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks rely on tool_parser changes from Luce-Org#340

Backend mechanism for soft-close thinking termination via a logit-ratio peek with split probe/inject ids, a min_tokens floor, and max_tokens treated as a response-only budget while thinking is active. - qwen35 originates the mechanism in qwen35_backend.{cpp,h}. - gemma4 ports the same mechanism in gemma4_backend.cpp (plus the probe/inject hooks in gemma4_internal.h owned here). - common/model_backend.h carries the shared hook contract. - test_server_unit.cpp adds 533 lines of coverage for both backends. - docs/specs/thinking-budget.md updated; soft-close design plan under docs/experiments/. server_main CLI flags for this mechanism ship in the thinking-control API PR (Luce-Org#341). Lets a model finish its reasoning gracefully rather than being hard-cut when a thinking budget is hit, which preserves answer quality on long reasoning traces.

…dmission) PR Luce-Org#341 decomposition: keep only Erik's OWN work, drop pieces traced to external/closed PRs in the provenance audit. Excised (borrowed from dusterbloom): - chat_template Jinja closed-think prefill block + arch_hint=ChatFormat::QWEN3 parameter (closed PR Luce-Org#293). - 5 test_jinja_render_* tests covering the prefill behavior (test_jinja_render_qwen3_closes_think_when_thinking_off, test_jinja_render_does_not_close_think_when_thinking_on, test_jinja_render_does_not_close_think_for_non_qwen3_arch, test_jinja_render_does_not_double_append_close_think, test_chat_format_for_arch_qwen35moe_returns_qwen3). - check_admission() pure helper + keep-ratio guard in http_server (PR Luce-Org#274). - gen_req.fa_window_override assignment (post-pflash widen window, PR Luce-Org#274). - 8 test_admission_* tests. Fixes: - http_server.cpp close_kind: drop the "soft" branch that referenced the non-existent result.soft_forced_close field (soft-close lives on a sibling PR). Collapse to the natural/hard split that the actual GenerateResult.budget_forced_close field supports. - Restore the original single-line pre-handle context check (prompt_tokens + max_output > max_ctx) in place of the removed check_admission pre-compression gate. Erik's OWN surfaces preserved: - started_in_thinking plumbing (chat_template, http_server, sse_emitter) - /props schema-4 (IMAGE_INFO / HOST_INFO loaders, build / host / model blocks) - call:verb SSE hooks (looks_like_plain_text_call, Pattern B, tool_open plain-text), close_kind soft/hard/natural refactor (now natural/hard only here), Anthropic tool_use+tool_result normalization - docs/specs (props-endpoint, thinking-budget) - model_cards _schema.json, qwen3.6-27b, laguna-xs.2 think-mode opt-ins

- chat_template.cpp: drop `enable_thinking &&` gate on the Jinja-path suffix-sniff for `started_in_thinking`. The rendered prompt's tail is the source of truth: if the template hardcodes `<think>` despite enable_thinking=false (custom template, model-card mismatch) we still need to route the first generated tokens to the reasoning channel — otherwise reasoning leaks into content. Only WARN on the mismatch case (sniff=true, enable_thinking=false) so the normal success path stops spamming stderr. Adds two regression tests covering the override and add_generation_prompt requirement. - openapi-props.yaml: bump `info.version` and `x-props-schema` to 4 (matches C++ `kPropsSchema = 4` and props-endpoint.md). Add the schema-4 top-level `host` block to `PropsResponse` (required + property) and define the full `Host` component schema mirroring §4.17 of props-endpoint.md (os_pretty, kernel, wsl_version, docker_version, nvidia_driver, nvidia_ctk_version, cpu_model, nproc, ram_gb, gpus[], cuda_visible_devices, source, collector, collected_at). Update example payloads and props_schema examples in the Build / Server component schemas. Add a host block to the qwen36 example. - server_main.cpp: factor a `json_str_or_none()` lambda that uses `json::find() + is_string()` instead of `json::value<std::string>()` for the image_info and host_info startup logging. `value<T>()` throws `type_error` when the key exists but isn't a string (e.g. `"kernel": null`), which crashes server startup. Both blocks come from the same operator-provided env path via entrypoint.sh, so the defensive lookup is shared. - sse_emitter.cpp: in the TOOL_BUFFER success branch, also fold `parsed.cleaned_text` into `responses_streamed_text` for the Pattern A (XML envelope) case. The snapshot was taken before the parser ran, so any cleaned text emitted as a content delta after parsing was missing from the Responses-format .done/.completed payloads (client's accumulated buffer disagreed with the server's final text). Skip the append for Pattern B (plain-text `call:`) since its snapshot already contains the full raw buffer, a superset of cleaned_text — appending would double-count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

easel force-pushed the feat/server-thinking-control branch 2 times, most recently from ef5e9b3 to 268301f Compare June 4, 2026 03:37

easel force-pushed the feat/server-thinking-control branch from 268301f to 3f4a7d2 Compare June 4, 2026 04:54

easel force-pushed the feat/server-thinking-control branch from 3f4a7d2 to 0bc9580 Compare June 4, 2026 05:03

easel marked this pull request as ready for review June 4, 2026 19:18

cubic-dev-ai Bot reviewed Jun 4, 2026

View reviewed changes

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026

Merge PR Luce-Org#341: feat(server): card-driven thinking control + r…

c973a9e

…easoning_content channel + /props schema-4

easel and others added 2 commits June 5, 2026 16:05

easel force-pushed the feat/server-thinking-control branch from 372d6f3 to b3990d0 Compare June 5, 2026 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): card-driven thinking control + reasoning_content channel + /props schema-4#341

feat(server): card-driven thinking control + reasoning_content channel + /props schema-4#341
easel wants to merge 3 commits into
Luce-Org:mainfrom
easel:feat/server-thinking-control

easel commented Jun 3, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

easel commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Dependencies

Risk

Test plan

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

easel commented Jun 3, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading