fix(server): revert Qwen3.5 think block injection (3/7 → 5/7)

unamedkr · claude · unamedkr · commit 53386d2a16d6 · 2026-04-14T17:01:52.000+09:00
The official enable_thinking=False method (injecting <think></think> in ChatML) made RLV Acme results WORSE: With injection: 3/7 Without (logit suppression only): 5/7 The <think></think> block in the prompt confused the model's response pattern, causing it to output document sections instead of extracted answers. quant.h's logit suppression (ba8a615) is the correct approach: it prevents thinking mode without altering the prompt structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/tools/quant_server_unified.c b/tools/quant_server_unified.c
@@ -105,8 +105,13 @@ static char* build_prompt(const char** roles, const char** contents,
         snprintf(w, rem, "<|assistant|>\n");
     else if (template_type == TMPL_GEMMA)
         snprintf(w, rem, "<|turn>model\n");
-    else
+    else {
+        /* ChatML assistant prompt. Qwen3.5 thinking mode is handled by
+         * suppressing the <think> token logit in tq_generate (quant.h).
+         * The official enable_thinking=False method (injecting <think></think>)
+         * was tested and made results WORSE (3/7 vs 5/7 on Acme). */
         snprintf(w, rem, "<|im_start|>assistant\n");
+    }
 
     return p;
 }