Skip to content

Commit 53386d2

Browse files
unamedkrclaude
andcommitted
fix(server): revert Qwen3.5 think block injection (3/7 → 5/7)
The official enable_thinking=False method (injecting <think></think> in ChatML) made RLV Acme results WORSE: With injection: 3/7 Without (logit suppression only): 5/7 The <think></think> block in the prompt confused the model's response pattern, causing it to output document sections instead of extracted answers. quant.h's logit suppression (ba8a615) is the correct approach: it prevents thinking mode without altering the prompt structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ba8a615 commit 53386d2

File tree

1 file changed

+6
-1
lines changed

1 file changed

+6
-1
lines changed

tools/quant_server_unified.c

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,8 +105,13 @@ static char* build_prompt(const char** roles, const char** contents,
105105
snprintf(w, rem, "<|assistant|>\n");
106106
else if (template_type == TMPL_GEMMA)
107107
snprintf(w, rem, "<|turn>model\n");
108-
else
108+
else {
109+
/* ChatML assistant prompt. Qwen3.5 thinking mode is handled by
110+
* suppressing the <think> token logit in tq_generate (quant.h).
111+
* The official enable_thinking=False method (injecting <think></think>)
112+
* was tested and made results WORSE (3/7 vs 5/7 on Acme). */
109113
snprintf(w, rem, "<|im_start|>assistant\n");
114+
}
110115

111116
return p;
112117
}

0 commit comments

Comments
 (0)