Skip to content

Commit f140654

Browse files
unamedkrclaude
andcommitted
feat(cli): --chat flag now correctly routes to per-family chat templates
The previous --chat fell through to Llama 3 template for all non-Gemma models, causing Phi-3.5 and Qwen family models to produce garbage output. New detection order by model config + filename: 1. Gemma 4 → <|turn>user\n...<turn|>\n<|turn>model\n (skip <|think|> — no logit suppression in CLI) 2. Gemma 2/3 → <start_of_turn>user\n...<end_of_turn> 3. Phi-3/4 → <|user|>...<|end|>\n<|assistant|>\n 4. Llama 3.x → <|start_header_id|>user<|end_header_id|>\n\n...<|eot_id|> 5. Default → ChatML (Qwen/Qwen2/Qwen3/Qwen3.5) Verified with --chat -p "What is 2+2?": - Phi-3.5 Q8_0: "The answer to...4. The sum of two and two equals four..." - Llama 3.1 8B: "The answer to 2 + 2 is: 4" - Llama 3.2 3B: "4" - Qwen2.5-0.5B: coherent English (0.5B model limit) - Gemma 4 E2B: partial (thinking-mode interaction) - Qwen3.5-4B: DeltaNet short-prompt issue persists (known) All 35 unit tests + 7 regression tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1e8698b commit f140654

1 file changed

Lines changed: 32 additions & 8 deletions

File tree

tools/quant.c

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1244,26 +1244,50 @@ int main(int argc, char** argv) {
12441244
return 0;
12451245
}
12461246

1247-
/* Auto-wrap prompt with chat template when --chat is used */
1247+
/* Auto-wrap prompt with chat template when --chat is used.
1248+
* Template detection order:
1249+
* 1. Gemma 4 → <|turn>...<turn|> + thinking mode
1250+
* 2. Gemma 2/3 → <start_of_turn>...<end_of_turn>
1251+
* 3. Phi-3/Phi-4 (by filename) → <|user|>...<|end|>
1252+
* 4. Llama 3.x (by filename) → <|start_header_id|>...<|eot_id|>
1253+
* 5. Default → ChatML <|im_start|>...<|im_end|> (Qwen/Qwen3/Qwen3.5) */
12481254
char chat_prompt[8192];
12491255
if (chat_mode) {
12501256
tq_model_config_t* mc = &model->config;
1257+
const char* mp = model_path ? model_path : "";
1258+
/* Basename for filename detection */
1259+
const char* bn = strrchr(mp, '/');
1260+
bn = bn ? bn + 1 : mp;
1261+
1262+
int is_phi = (strstr(bn, "phi-3") || strstr(bn, "phi3") ||
1263+
strstr(bn, "Phi-3") || strstr(bn, "Phi3") ||
1264+
strstr(bn, "phi-4") || strstr(bn, "phi4") ||
1265+
strstr(bn, "Phi-4") || strstr(bn, "Phi4"));
1266+
int is_llama3 = (strstr(bn, "Llama-3") || strstr(bn, "llama-3") ||
1267+
strstr(bn, "Llama3") || strstr(bn, "llama3") ||
1268+
strstr(bn, "Meta-Llama-3"));
1269+
12511270
if (mc->model_type == 1 && mc->is_gemma4) {
1252-
/* Gemma 4: uses <|turn> tokens + thinking mode.
1253-
* Reference: llama.cpp apply-template output for gemma4. */
1271+
/* Skip <|think|> in CLI — the server suppresses it via logit mask,
1272+
* but the CLI has no such suppression. Without it, the CLI uses
1273+
* plain Gemma 4 format without thinking mode. */
12541274
snprintf(chat_prompt, sizeof(chat_prompt),
1255-
"<|turn>system\n<|think|><turn|>\n<|turn>user\n%s<turn|>\n<|turn>model\n", prompt);
1275+
"<|turn>user\n%s<turn|>\n<|turn>model\n", prompt);
12561276
} else if (mc->model_type == 1) {
1257-
/* Gemma 2/3: <start_of_turn>user\n...\n<end_of_turn>\n<start_of_turn>model\n */
12581277
snprintf(chat_prompt, sizeof(chat_prompt),
12591278
"<start_of_turn>user\n%s<end_of_turn>\n<start_of_turn>model\n", prompt);
1260-
} else if (strstr(prompt, "<|start_header_id|>") == NULL) {
1261-
/* Llama 3 / generic: wrap if not already wrapped */
1279+
} else if (is_phi) {
1280+
/* Phi-3/4: <|user|>...<|end|>\n<|assistant|>\n */
1281+
snprintf(chat_prompt, sizeof(chat_prompt),
1282+
"<|user|>\n%s<|end|>\n<|assistant|>\n", prompt);
1283+
} else if (is_llama3) {
12621284
snprintf(chat_prompt, sizeof(chat_prompt),
12631285
"<|start_header_id|>user<|end_header_id|>\n\n%s<|eot_id|>"
12641286
"<|start_header_id|>assistant<|end_header_id|>\n\n", prompt);
12651287
} else {
1266-
snprintf(chat_prompt, sizeof(chat_prompt), "%s", prompt);
1288+
/* Default ChatML (Qwen/Qwen3/Qwen3.5) */
1289+
snprintf(chat_prompt, sizeof(chat_prompt),
1290+
"<|im_start|>user\n%s<|im_end|>\n<|im_start|>assistant\n", prompt);
12671291
}
12681292
prompt = chat_prompt;
12691293
}

0 commit comments

Comments
 (0)