feat(server): card-driven thinking control + reasoning_content channel + /props schema-4#341
Open
easel wants to merge 3 commits into
Open
feat(server): card-driven thinking control + reasoning_content channel + /props schema-4#341easel wants to merge 3 commits into
easel wants to merge 3 commits into
Conversation
ef5e9b3 to
268301f
Compare
This was referenced Jun 4, 2026
268301f to
3f4a7d2
Compare
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
## What
Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
docs/experiments/.
server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).
## Why
Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.
## Dependencies
- Luce-Org#336 (server-layer-split): #includes qwen35/c2_gate.h and uses BudgetHook
fa_window_override + c2_spec_decode_permitted from the shared backend
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
## What
Extends server/src/server/tool_parser.{cpp,h} to parse Gemma's
plain-text call:<verb>{} emissions (also accepts the \`\`_call:\`\`
tokenizer-artifact prefix) and render them as Anthropic tool_use +
tool_result blocks. Isolated to tool_parser; the streaming detection
hook in sse_emitter ships with Luce-Org#341. Adds 364 lines of C++ unit
coverage in test_server_unit.cpp plus the call-verb parser plan and
Gemma4-26B parser-fix writeup.
## Why
Gemma4 emits tool calls as plain-text call:<verb>{...} rather than
structured JSON, which breaks the existing Anthropic tool_use pipeline
on agentic workloads. This parser closes that gap so Gemma4 can drive
coding-agent loops end-to-end.
## Dependencies
None - this PR is independent.
3f4a7d2 to
0bc9580
Compare
This was referenced Jun 4, 2026
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
## What
Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
docs/experiments/.
server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).
## Why
Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.
## Attribution
server/src/qwen35/c2_gate.h is included here to keep this branch
self-compiling. It was originally authored by @dusterbloom in PR Luce-Org#274
(commit 52fea3b) — a 31-line pure predicate extracted from
qwen35_backend.cpp for testability. Co-author trailer below.
Co-Authored-By: dusterbloom <32869278+dusterbloom@users.noreply.github.com>
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
## What
Extends server/src/server/tool_parser.{cpp,h} to parse Gemma's
plain-text call:<verb>{} emissions (also accepts the \`\`_call:\`\`
tokenizer-artifact prefix) and render them as Anthropic tool_use +
tool_result blocks. Isolated to tool_parser; the streaming detection
hook in sse_emitter ships with Luce-Org#341. Adds 364 lines of C++ unit
coverage in test_server_unit.cpp plus the call-verb parser plan and
Gemma4-26B parser-fix writeup.
## Why
Gemma4 emits tool calls as plain-text call:<verb>{...} rather than
structured JSON, which breaks the existing Anthropic tool_use pipeline
on agentic workloads. This parser closes that gap so Gemma4 can drive
coding-agent loops end-to-end.
## Dependencies
None - this PR is independent.
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
…dmission) PR Luce-Org#341 decomposition: keep only Erik's OWN work, drop pieces traced to external/closed PRs in the provenance audit. Excised (borrowed from dusterbloom): - chat_template Jinja closed-think prefill block + arch_hint=ChatFormat::QWEN3 parameter (closed PR Luce-Org#293). - 5 test_jinja_render_* tests covering the prefill behavior (test_jinja_render_qwen3_closes_think_when_thinking_off, test_jinja_render_does_not_close_think_when_thinking_on, test_jinja_render_does_not_close_think_for_non_qwen3_arch, test_jinja_render_does_not_double_append_close_think, test_chat_format_for_arch_qwen35moe_returns_qwen3). - check_admission() pure helper + keep-ratio guard in http_server (PR Luce-Org#274). - gen_req.fa_window_override assignment (post-pflash widen window, PR Luce-Org#274). - 8 test_admission_* tests. Fixes: - http_server.cpp close_kind: drop the "soft" branch that referenced the non-existent result.soft_forced_close field (soft-close lives on a sibling PR). Collapse to the natural/hard split that the actual GenerateResult.budget_forced_close field supports. - Restore the original single-line pre-handle context check (prompt_tokens + max_output > max_ctx) in place of the removed check_admission pre-compression gate. Erik's OWN surfaces preserved: - started_in_thinking plumbing (chat_template, http_server, sse_emitter) - /props schema-4 (IMAGE_INFO / HOST_INFO loaders, build / host / model blocks) - call:verb SSE hooks (looks_like_plain_text_call, Pattern B, tool_open plain-text), close_kind soft/hard/natural refactor (now natural/hard only here), Anthropic tool_use+tool_result normalization - docs/specs (props-endpoint, thinking-budget) - model_cards _schema.json, qwen3.6-27b, laguna-xs.2 think-mode opt-ins
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
## What
Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
docs/experiments/.
server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).
## Why
Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.
Contributor
There was a problem hiding this comment.
6 issues found across 14 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 5, 2026
- chat_template.cpp: drop `enable_thinking &&` gate on the Jinja-path suffix-sniff for `started_in_thinking`. The rendered prompt's tail is the source of truth: if the template hardcodes `<think>` despite enable_thinking=false (custom template, model-card mismatch) we still need to route the first generated tokens to the reasoning channel — otherwise reasoning leaks into content. Only WARN on the mismatch case (sniff=true, enable_thinking=false) so the normal success path stops spamming stderr. Adds two regression tests covering the override and add_generation_prompt requirement. - openapi-props.yaml: bump `info.version` and `x-props-schema` to 4 (matches C++ `kPropsSchema = 4` and props-endpoint.md). Add the schema-4 top-level `host` block to `PropsResponse` (required + property) and define the full `Host` component schema mirroring §4.17 of props-endpoint.md (os_pretty, kernel, wsl_version, docker_version, nvidia_driver, nvidia_ctk_version, cpu_model, nproc, ram_gb, gpus[], cuda_visible_devices, source, collector, collected_at). Update example payloads and props_schema examples in the Build / Server component schemas. Add a host block to the qwen36 example. - server_main.cpp: factor a `json_str_or_none()` lambda that uses `json::find() + is_string()` instead of `json::value<std::string>()` for the image_info and host_info startup logging. `value<T>()` throws `type_error` when the key exists but isn't a string (e.g. `"kernel": null`), which crashes server startup. Both blocks come from the same operator-provided env path via entrypoint.sh, so the defensive lookup is shared. - sse_emitter.cpp: in the TOOL_BUFFER success branch, also fold `parsed.cleaned_text` into `responses_streamed_text` for the Pattern A (XML envelope) case. The snapshot was taken before the parser ran, so any cleaned text emitted as a content delta after parsing was missing from the Responses-format .done/.completed payloads (client's accumulated buffer disagreed with the server's final text). Skip the append for Pattern B (plain-text `call:`) since its snapshot already contains the full raw buffer, a superset of cleaned_text — appending would double-count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 5, 2026
…easoning_content channel + /props schema-4
…ps schema-4 User-facing thinking-control API across the HTTP server surface: - chat_template prefills a closed <think> block when thinking is off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema 2 -> 4, adding build / model.target / model.draft / host blocks for client introspection. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream (Pattern-B call-verb streaming hook). - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, updated thinking-budget spec, and the thinking-control protocol/mechanism experiments. - test_server_unit gets matching coverage (~1100 lines) for prefill, /props schema-4, and reasoning_content routing. Gives clients a single, card-driven API to control thinking budgets, soft-close behavior, and reasoning visibility - and an introspectable /props surface to discover what the server supports. - Luce-Org#336 (server-layer-split): CMake/build references - Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio + pflash_on contracts - Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks rely on tool_parser changes from Luce-Org#340
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 5, 2026
Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
docs/experiments/.
server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).
Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.
…dmission) PR Luce-Org#341 decomposition: keep only Erik's OWN work, drop pieces traced to external/closed PRs in the provenance audit. Excised (borrowed from dusterbloom): - chat_template Jinja closed-think prefill block + arch_hint=ChatFormat::QWEN3 parameter (closed PR Luce-Org#293). - 5 test_jinja_render_* tests covering the prefill behavior (test_jinja_render_qwen3_closes_think_when_thinking_off, test_jinja_render_does_not_close_think_when_thinking_on, test_jinja_render_does_not_close_think_for_non_qwen3_arch, test_jinja_render_does_not_double_append_close_think, test_chat_format_for_arch_qwen35moe_returns_qwen3). - check_admission() pure helper + keep-ratio guard in http_server (PR Luce-Org#274). - gen_req.fa_window_override assignment (post-pflash widen window, PR Luce-Org#274). - 8 test_admission_* tests. Fixes: - http_server.cpp close_kind: drop the "soft" branch that referenced the non-existent result.soft_forced_close field (soft-close lives on a sibling PR). Collapse to the natural/hard split that the actual GenerateResult.budget_forced_close field supports. - Restore the original single-line pre-handle context check (prompt_tokens + max_output > max_ctx) in place of the removed check_admission pre-compression gate. Erik's OWN surfaces preserved: - started_in_thinking plumbing (chat_template, http_server, sse_emitter) - /props schema-4 (IMAGE_INFO / HOST_INFO loaders, build / host / model blocks) - call:verb SSE hooks (looks_like_plain_text_call, Pattern B, tool_open plain-text), close_kind soft/hard/natural refactor (now natural/hard only here), Anthropic tool_use+tool_result normalization - docs/specs (props-endpoint, thinking-budget) - model_cards _schema.json, qwen3.6-27b, laguna-xs.2 think-mode opt-ins
- chat_template.cpp: drop `enable_thinking &&` gate on the Jinja-path suffix-sniff for `started_in_thinking`. The rendered prompt's tail is the source of truth: if the template hardcodes `<think>` despite enable_thinking=false (custom template, model-card mismatch) we still need to route the first generated tokens to the reasoning channel — otherwise reasoning leaks into content. Only WARN on the mismatch case (sniff=true, enable_thinking=false) so the normal success path stops spamming stderr. Adds two regression tests covering the override and add_generation_prompt requirement. - openapi-props.yaml: bump `info.version` and `x-props-schema` to 4 (matches C++ `kPropsSchema = 4` and props-endpoint.md). Add the schema-4 top-level `host` block to `PropsResponse` (required + property) and define the full `Host` component schema mirroring §4.17 of props-endpoint.md (os_pretty, kernel, wsl_version, docker_version, nvidia_driver, nvidia_ctk_version, cpu_model, nproc, ram_gb, gpus[], cuda_visible_devices, source, collector, collected_at). Update example payloads and props_schema examples in the Build / Server component schemas. Add a host block to the qwen36 example. - server_main.cpp: factor a `json_str_or_none()` lambda that uses `json::find() + is_string()` instead of `json::value<std::string>()` for the image_info and host_info startup logging. `value<T>()` throws `type_error` when the key exists but isn't a string (e.g. `"kernel": null`), which crashes server startup. Both blocks come from the same operator-provided env path via entrypoint.sh, so the defensive lookup is shared. - sse_emitter.cpp: in the TOOL_BUFFER success branch, also fold `parsed.cleaned_text` into `responses_streamed_text` for the Pattern A (XML envelope) case. The snapshot was taken before the parser ran, so any cleaned text emitted as a content delta after parsing was missing from the Responses-format .done/.completed payloads (client's accumulated buffer disagreed with the server's final text). Skip the append for Pattern B (plain-text `call:`) since its snapshot already contains the full raw buffer, a superset of cleaned_text — appending would double-count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
372d6f3 to
b3990d0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Card-driven thinking control for the server, split across three load-bearing surfaces:
chat_templateextends the hardcoded renderer to return aPromptRenderResultwith astarted_in_thinkingprovenance flag (Qwen3.6 / Laguna enable_thinking paths pre-open<think>from the prompt suffix), and adds the same provenance sniff to the Jinja path.sse_emitterroutes Qwen3.6 / Laguna think-mode output to thereasoning_contentSSE channel for both OpenAI-chat and Anthropic-messages formats — reasoning never leaks intocontent. Includes Pattern-Bcall:<verb>{streaming hooks that ride on top of feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340's plain-text tool parser./propsis bumped to schema 4 with new top-levelbuild(git_sha / image_tag / build_time),model.target/model.draft(GGUF identity: size_bytes + sha256 + header fields), andhost(verbatim pass-through of/opt/lucebox-hub/HOST_INFO) blocks.http_server.cppis the source of truth for the new payload;server_main.cpponly adds theIMAGE_INFO/HOST_INFOfile loaders and aGgufMetadata → JSONhelper.Matching model_card and spec updates land alongside.
Files
server/src/server/chat_template.{h,cpp}—PromptRenderResult+started_in_thinkingprovenance (hardcoded path + Jinja suffix sniff)server/src/server/sse_emitter.{h,cpp}— reasoning_content channel routing for think-mode + Pattern-Bcall:<verb>{streaming hooksserver/src/server/http_server.{h,cpp}—/propsschema-4 payload,ParsedRequest.started_in_thinkingrouting into SseEmitter initial mode, finish_details for thinkingserver/src/server/server_main.cpp—IMAGE_INFO/HOST_INFOfile loaders andgguf_metadata_to_json()helper (no new CLI flags — soft-close / debug-thinking-logits flags live in feat(server): soft-close thinking termination (qwen35 + gemma4) #339)server/test/test_server_unit.cpp— coverage spanning chat_template provenance,/propsschema 4, and reasoning_content routingdocs/specs/openapi-props.yaml,docs/specs/props-endpoint.md,docs/specs/thinking-budget.md— spec updatesshare/model_cards/_schema.json,share/model_cards/qwen3.6-27b.json,share/model_cards/laguna-xs.2.json— card schema + Qwen3.6 / Laguna think-mode opt-insDependencies
This PR carries hard semantic dependencies on three sibling split PRs. The chain should land in order.
server_main.cpp(~L956 / L975) callsread_gguf_metadata()and consumes theGgufMetadatastruct introduced by that PR (target/draft sha256 + GGUF header fields surfaced under/props.model). The currentserver/src/common/gguf_inspect.honly exposesGgufModelInfo— the wider metadata reader is the gguf-inspect PR's surface. Must land first.sse_emitter.cppadds Pattern-Bcall:<verb>{streaming hooks (looks_like_plain_text_call(), thefind_tool_start(…, bool & is_plain_text)signature, thetool_open_is_plain_text_member, theresponses_streamed_textsnapshot, and theStreamMode::TOOL_BUFFERplain-text path). These hooks call intotool_parserchanges introduced by feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340. Recommend landing feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340 first.server/src/server/CMakeLists.txtandcommon/model_backend.h)./propsschema 4 also reports layer-split / c2-gate state that originates in refactor(server): shared layer-split backend + GGUF inspection + c2-gate plumbing #336.Also worth noting: #339 (server-soft-close) earlier merged the
BudgetHook::close_token_ids/hard_limit_remainingfields this PR'shttp_server.cppconsumes at the close-token boundary. That is not a hard build dep here (the fields are already onorigin/main), but the--think-soft-close-*and--debug-thinking-logitsCLI flags live in #339, not in this PR.close_kindin this PR reportsnatural/hardonly; thesoftbranch lands with #339'ssoft_forced_closefield.Risk
Medium — touches request handling and SSE streaming on the hot path. Pattern-B streaming logic interleaves with the existing
StreamMode::REASONINGstate machine and the tool-buffer accumulator; regressions here surface as either reasoning leaking intocontentor tool calls being dropped mid-stream.Test plan
test_server_unitbuilds and passes on this branch against the gguf-inspect base/propsagainst a running server returns schema 4 withbuild,model.target,model.draft,hostkeysreasoning_contentSSE channel, never oncontentthinkingblocks (nottext)<think>preamble and emits user-visible content immediatelycall:<verb>{streaming detects the plain-text tool open and buffers correctly (requires feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340 on this branch's base)Note: server-split PRs can't be validated as a fully linked binary standalone because CMake cross-references shared headers (
common/model_backend.h, etc.) that evolve in sibling PRs. The unit tests in this PR are self-contained; full integration requires the gguf-inspect PR + #336 + #340 to land for the matching runtime behavior.🤖 Generated with Claude Code