Skip to content

server : hint preserve_thinking when supported by chat template#25079

Draft
ggerganov wants to merge 3 commits into
masterfrom
gg/preserve-thinking-hint
Draft

server : hint preserve_thinking when supported by chat template#25079
ggerganov wants to merge 3 commits into
masterfrom
gg/preserve-thinking-hint

Conversation

@ggerganov

@ggerganov ggerganov commented Jun 27, 2026

Copy link
Copy Markdown
Member

Overview

ref #24093 (comment)

Print a hint to enable preserve_thinking kwarg when the chat template supports it.

# llama serve -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M
0.00.571.359 I cmn  common_param: common_params_print_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.573.173 I srv    load_model: loading model 'unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M'
0.08.345.590 I srv    load_model: initializing, n_slots = 4, n_ctx_slot = 131072, kv_unified = 'true'
0.08.378.594 W srv          init: chat template supports 'preserve_thinking' - consider using --chat-template-kwargs "{\"preserve_thinking\": true}" (ref: https://docs.z.ai/guides/capabilities/thinking-mode#preserved-thinking)
0.08.378.603 I srv  llama_server: model loaded
0.08.378.607 I srv  llama_server: listening on http://0.0.0.0:8013

Requirements

@ggerganov ggerganov force-pushed the gg/preserve-thinking-hint branch from aec9522 to eae7149 Compare June 27, 2026 14:41
@ngxson

ngxson commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

IMO it's could better be one of the jinja caps (to make it more generic), although I'm not sure how other templates support this function (i.e. do they use the same preserve_thinking, or another mechanism?), cc @pwilkin @aldehir if you have any insights on this

@aldehir

aldehir commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

A cap seems more applicable for consistency but I think either way works for something this simple.

As far as other templates go, I believe only Qwen 3.6 supports this for now. There's no competing variables to enable this.

@ngxson

ngxson commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

@aldehir I'm thinking about a more broader case where we can somehow make this compatible with other templates. It seems like most templates only preserve reasoning_content for last assistant message, not the whole history. We may need a hack to make it work (I'm investigating that)

I think such feature would still be quite useful. From time to time I've seen issues asking for such feature

Update: seems like GLM-4.7 has clear_thinking that is the opposite of preserve_thinking

@aldehir

aldehir commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

I see. We can generalize this feature with capabilities, i.e. storing the field name and normalizing the value (clear_thinking requires false while preserve_thinking needs true).

That said, not sure how easy it would be to generalize implementation for templates that don't natively support this. I also don't know the impact since they are likely not trained to retain thinking.

@ngxson

ngxson commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

I scanned through templates inside models/templates and realize that most models check for last_user_index or last_user_idx to remove the reasoning content, so I'm quite confident that the solution could be quite simple: always force last_user_index=0 and ignore value from set statement. (So, we need to add a notion of "read-only" key)

That being said, it's still technically a hack, but just not a (too) messy one. Will try to push a PoC tomorrow to see how it goes.

Detect if the chat template supports the 'preserve_thinking' kwarg
(by checking for its presence in the template source) and print a hint
suggesting users enable it via --chat-template-kwargs.

This is particularly useful for models like Qwen3.6 where preserve_thinking
is recommended but many users are unaware of the option.

ref: https://docs.z.ai/guides/capabilities/thinking-mode#preserved-thinking

Assisted-by: pi:llama.cpp/Qwen3.6-27B
Print a hint to enable preserve_thinking kwarg when the template supports it.

ref: https://docs.z.ai/guides/capabilities/thinking-mode#preserved-thinking
Assisted-by: pi:llama.cpp/Qwen3.6-27B
Print a hint to enable preserve_thinking kwarg when the template supports it.

ref: https://docs.z.ai/guides/capabilities/thinking-mode#preserved-thinking
Assisted-by: pi:llama.cpp/Qwen3.6-27B
@ngxson

ngxson commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Hmm ok so hacking it is more complicated than I though, so I ended up abandon it. My idea was to record any if_statement and try to flip them to see which one control the reasoning output, then force them to true later.

In anyway, I added #25105 that simply translate a generic --reasoning-preserve flag into model-specific flag, I've found 3 of them:

  • preserve_thinking
  • clear_thinking (GLM-4.7)
  • truncate_history_thinking (NVIDIA-Nemotron-3-Nano-30B-A3B-BF16)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants