0.00.087.730 W Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
0.00.087.736 W DEPRECATED: argument '-ngl' specified multiple times, use comma-separated values instead (only last value will be used)
0.00.087.908 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.087.909 I device_info:
0.00.088.033 I - Vulkan0 : AMD Ryzen 9 9950X 16-Core Processor (RADV RAPHAEL_MENDOCINO) (32986 MiB, 32925 MiB free)
0.00.088.131 I - Vulkan1 : AMD Radeon RX 7900 XTX (RADV NAVI31) (24560 MiB, 24533 MiB free)
0.00.088.227 I - Vulkan2 : AMD Radeon RX 7900 XTX (RADV NAVI31) (24560 MiB, 24533 MiB free)
0.00.088.329 I - Vulkan3 : AMD Radeon RX 7900 XTX (RADV NAVI31) (24560 MiB, 24533 MiB free)
0.00.088.332 I - CPU : AMD Ryzen 9 9950X 16-Core Processor (61876 MiB, 61876 MiB free)
0.00.088.344 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.088.679 I srv init: running without SSL
0.00.088.708 I srv init: using 31 threads for HTTP server
0.00.089.095 I srv start: binding port with default address family
0.00.090.194 I srv llama_server: loading model
0.00.090.446 I srv load_model: loading model '/home/user/models/coder/qwen35-122b.lmstudio.q8.gguf'
0.00.090.453 I common_init_result: fitting params to device memory ...
0.00.090.453 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.06.365.624 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance
2.37.264.962 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
2.37.507.465 I srv load_model: initializing slots, n_slots = 1
2.39.584.927 W srv load_model: speculative decoding will use checkpoints
2.39.584.936 W common_speculative_init: no implementations specified for speculative decoding
2.39.584.938 I slot load_model: id 0 | task -1 | new slot, n_ctx = 131072
2.39.584.982 I srv load_model: prompt cache is enabled, size limit: no limit
2.39.584.983 I srv load_model: use `--cache-ram 0` to disable the prompt cache
2.39.584.983 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
2.39.584.983 I srv load_model: context checkpoints enabled, max = 32, min spacing = 8192
2.39.587.620 I srv init: idle slots will be saved to prompt cache upon starting a new task
2.39.611.924 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
2.39.620.410 I srv init: init: chat template, thinking = 1
2.39.620.820 I srv llama_server: model loaded
2.39.620.957 I srv llama_server: server is listening on http://0.0.0.0:5814
2.39.620.960 I srv update_slots: all slots are idle
3.00.776.689 I srv operator(): Chat format: peg-native
3.00.778.048 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
3.00.778.050 I srv get_availabl: updating prompt cache
3.00.778.241 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
3.00.778.387 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 0.000 MiB, 131072 tokens, 131072 est)
3.00.778.388 I srv get_availabl: prompt cache update took 0.34 ms
3.00.778.682 I reasoning-budget: activated, budget=10240 tokens
3.00.778.683 I reasoning-budget: deactivated (natural end)
3.00.778.699 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
4.31.399.737 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 2048, progress = 0.24, t = 90.62 s / 22.60 tokens per second
4.58.000.702 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 4096, progress = 0.48, t = 117.22 s / 34.94 tokens per second
5.22.538.873 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 6144, progress = 0.72, t = 141.76 s / 43.34 tokens per second
5.43.849.687 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 8045, progress = 0.94, t = 163.07 s / 49.33 tokens per second
5.43.921.465 I slot create_check: id 0 | task 0 | created context checkpoint 1 of 32 (pos_min = 8044, pos_max = 8044, n_tokens = 8045, size = 149.063 MiB)
5.49.584.166 I slot print_timing: id 0 | task 0 | prompt processing, n_tokens = 8557, progress = 1.00, t = 168.81 s / 50.69 tokens per second
5.57.215.243 I slot print_timing: id 0 | task 0 | n_decoded = 100, tg = 13.44 t/s, tg_3s = 13.44 t/s
6.00.262.132 I slot print_timing: id 0 | task 0 | n_decoded = 142, tg = 13.54 t/s, tg_3s = 13.78 t/s
6.03.330.054 I slot print_timing: id 0 | task 0 | n_decoded = 184, tg = 13.58 t/s, tg_3s = 13.69 t/s
6.06.359.489 I slot print_timing: id 0 | task 0 | n_decoded = 225, tg = 13.57 t/s, tg_3s = 13.53 t/s
6.09.428.887 I slot print_timing: id 0 | task 0 | n_decoded = 267, tg = 13.59 t/s, tg_3s = 13.68 t/s
6.12.439.647 I slot print_timing: id 0 | task 0 | n_decoded = 308, tg = 13.59 t/s, tg_3s = 13.62 t/s
6.15.491.996 I slot print_timing: id 0 | task 0 | n_decoded = 350, tg = 13.61 t/s, tg_3s = 13.76 t/s
6.18.520.959 I slot print_timing: id 0 | task 0 | n_decoded = 392, tg = 13.64 t/s, tg_3s = 13.87 t/s
6.21.561.016 I slot print_timing: id 0 | task 0 | n_decoded = 434, tg = 13.65 t/s, tg_3s = 13.82 t/s
6.24.595.708 I slot print_timing: id 0 | task 0 | n_decoded = 476, tg = 13.67 t/s, tg_3s = 13.84 t/s
6.27.650.859 I slot print_timing: id 0 | task 0 | n_decoded = 518, tg = 13.68 t/s, tg_3s = 13.75 t/s
6.30.671.137 I slot print_timing: id 0 | task 0 | n_decoded = 560, tg = 13.69 t/s, tg_3s = 13.91 t/s
6.33.741.494 I slot print_timing: id 0 | task 0 | n_decoded = 602, tg = 13.69 t/s, tg_3s = 13.68 t/s
6.36.785.341 I slot print_timing: id 0 | task 0 | n_decoded = 644, tg = 13.70 t/s, tg_3s = 13.80 t/s
6.39.816.274 I slot print_timing: id 0 | task 0 | n_decoded = 686, tg = 13.71 t/s, tg_3s = 13.86 t/s
6.42.822.004 I slot print_timing: id 0 | task 0 | n_decoded = 728, tg = 13.72 t/s, tg_3s = 13.99 t/s
6.45.824.822 I slot print_timing: id 0 | task 0 | n_decoded = 770, tg = 13.74 t/s, tg_3s = 13.97 t/s
6.48.859.971 I slot print_timing: id 0 | task 0 | n_decoded = 812, tg = 13.74 t/s, tg_3s = 13.84 t/s
6.51.933.178 I slot print_timing: id 0 | task 0 | n_decoded = 854, tg = 13.74 t/s, tg_3s = 13.67 t/s
6.54.999.574 I slot print_timing: id 0 | task 0 | n_decoded = 896, tg = 13.74 t/s, tg_3s = 13.70 t/s
6.58.018.419 I slot print_timing: id 0 | task 0 | n_decoded = 938, tg = 13.75 t/s, tg_3s = 13.91 t/s
7.01.056.528 I slot print_timing: id 0 | task 0 | n_decoded = 980, tg = 13.75 t/s, tg_3s = 13.82 t/s
7.04.098.654 I slot print_timing: id 0 | task 0 | n_decoded = 1022, tg = 13.75 t/s, tg_3s = 13.81 t/s
7.07.102.953 I slot print_timing: id 0 | task 0 | n_decoded = 1063, tg = 13.75 t/s, tg_3s = 13.65 t/s
7.10.176.210 I slot print_timing: id 0 | task 0 | n_decoded = 1105, tg = 13.74 t/s, tg_3s = 13.67 t/s
7.13.243.442 I slot print_timing: id 0 | task 0 | n_decoded = 1147, tg = 13.74 t/s, tg_3s = 13.69 t/s
7.16.252.459 I slot print_timing: id 0 | task 0 | n_decoded = 1188, tg = 13.74 t/s, tg_3s = 13.63 t/s
7.19.271.067 I slot print_timing: id 0 | task 0 | n_decoded = 1229, tg = 13.73 t/s, tg_3s = 13.58 t/s
7.22.303.765 I slot print_timing: id 0 | task 0 | n_decoded = 1270, tg = 13.73 t/s, tg_3s = 13.52 t/s
7.25.349.031 I slot print_timing: id 0 | task 0 | n_decoded = 1312, tg = 13.73 t/s, tg_3s = 13.79 t/s
7.28.393.669 I slot print_timing: id 0 | task 0 | n_decoded = 1354, tg = 13.73 t/s, tg_3s = 13.81 t/s
7.31.415.895 I slot print_timing: id 0 | task 0 | n_decoded = 1396, tg = 13.73 t/s, tg_3s = 13.88 t/s
7.34.475.046 I slot print_timing: id 0 | task 0 | n_decoded = 1438, tg = 13.73 t/s, tg_3s = 13.73 t/s
7.37.502.884 I slot print_timing: id 0 | task 0 | n_decoded = 1479, tg = 13.73 t/s, tg_3s = 13.54 t/s
7.40.557.754 I slot print_timing: id 0 | task 0 | n_decoded = 1521, tg = 13.73 t/s, tg_3s = 13.75 t/s
7.43.581.644 I slot print_timing: id 0 | task 0 | n_decoded = 1563, tg = 13.73 t/s, tg_3s = 13.89 t/s
7.46.606.253 I slot print_timing: id 0 | task 0 | n_decoded = 1605, tg = 13.74 t/s, tg_3s = 13.89 t/s
7.49.663.819 I slot print_timing: id 0 | task 0 | n_decoded = 1647, tg = 13.74 t/s, tg_3s = 13.74 t/s
7.52.705.871 I slot print_timing: id 0 | task 0 | n_decoded = 1689, tg = 13.74 t/s, tg_3s = 13.81 t/s
7.55.711.757 I slot print_timing: id 0 | task 0 | n_decoded = 1731, tg = 13.75 t/s, tg_3s = 13.97 t/s
7.58.736.273 I slot print_timing: id 0 | task 0 | n_decoded = 1773, tg = 13.75 t/s, tg_3s = 13.89 t/s
8.01.740.919 I slot print_timing: id 0 | task 0 | n_decoded = 1815, tg = 13.75 t/s, tg_3s = 13.98 t/s
8.04.743.726 I slot print_timing: id 0 | task 0 | n_decoded = 1857, tg = 13.76 t/s, tg_3s = 13.99 t/s
8.07.763.219 I slot print_timing: id 0 | task 0 | n_decoded = 1899, tg = 13.76 t/s, tg_3s = 13.91 t/s
8.10.809.187 I slot print_timing: id 0 | task 0 | n_decoded = 1941, tg = 13.76 t/s, tg_3s = 13.79 t/s
8.13.825.043 I slot print_timing: id 0 | task 0 | n_decoded = 1982, tg = 13.76 t/s, tg_3s = 13.60 t/s
8.16.887.172 I slot print_timing: id 0 | task 0 | n_decoded = 2024, tg = 13.76 t/s, tg_3s = 13.71 t/s
8.19.910.052 I slot print_timing: id 0 | task 0 | n_decoded = 2066, tg = 13.76 t/s, tg_3s = 13.89 t/s
8.22.916.144 I slot print_timing: id 0 | task 0 | n_decoded = 2108, tg = 13.77 t/s, tg_3s = 13.97 t/s
8.25.939.502 I slot print_timing: id 0 | task 0 | n_decoded = 2150, tg = 13.77 t/s, tg_3s = 13.89 t/s
8.28.998.448 I slot print_timing: id 0 | task 0 | n_decoded = 2192, tg = 13.77 t/s, tg_3s = 13.73 t/s
8.32.031.307 I slot print_timing: id 0 | task 0 | n_decoded = 2234, tg = 13.77 t/s, tg_3s = 13.85 t/s
8.35.045.777 I slot print_timing: id 0 | task 0 | n_decoded = 2276, tg = 13.77 t/s, tg_3s = 13.93 t/s
8.38.075.042 I slot print_timing: id 0 | task 0 | n_decoded = 2317, tg = 13.77 t/s, tg_3s = 13.53 t/s
8.41.144.007 I slot print_timing: id 0 | task 0 | n_decoded = 2359, tg = 13.77 t/s, tg_3s = 13.69 t/s
8.44.200.202 I slot print_timing: id 0 | task 0 | n_decoded = 2401, tg = 13.77 t/s, tg_3s = 13.76 t/s
8.47.208.409 I slot print_timing: id 0 | task 0 | n_decoded = 2442, tg = 13.76 t/s, tg_3s = 13.61 t/s
8.50.237.635 I slot print_timing: id 0 | task 0 | n_decoded = 2484, tg = 13.76 t/s, tg_3s = 13.86 t/s
8.53.306.262 I slot print_timing: id 0 | task 0 | n_decoded = 2527, tg = 13.77 t/s, tg_3s = 14.01 t/s
8.56.364.667 I slot print_timing: id 0 | task 0 | n_decoded = 2570, tg = 13.77 t/s, tg_3s = 14.06 t/s
8.59.399.134 I slot print_timing: id 0 | task 0 | n_decoded = 2612, tg = 13.77 t/s, tg_3s = 13.84 t/s
9.02.440.460 I slot print_timing: id 0 | task 0 | n_decoded = 2655, tg = 13.78 t/s, tg_3s = 14.14 t/s
9.04.270.246 I slot print_timing: id 0 | task 0 | prompt eval time = 168997.32 ms / 8561 tokens ( 19.74 ms per token, 50.66 tokens per second)
9.04.270.248 I slot print_timing: id 0 | task 0 | eval time = 194490.52 ms / 2680 tokens ( 72.57 ms per token, 13.78 tokens per second)
9.04.270.249 I slot print_timing: id 0 | task 0 | total time = 363487.83 ms / 11241 tokens
9.04.270.250 I slot print_timing: id 0 | task 0 | graphs reused = 2668
9.04.317.726 I slot release: id 0 | task 0 | stop processing: n_tokens = 11240, truncated = 0
9.04.317.748 I srv update_slots: all slots are idle
Name and Version
version: 9802 (5f9b312)
built with GNU 16.1.1 for Linux x86_64
Operating systems
Linux
GGML backends
Vulkan
Hardware
3x RX 7900 XTX
Models
Qwen3.5-122B-A10B-Q8_0
Problem description & steps to reproduce
It appears that as of late, reasoning is deactivated prematurely, even before prompt processing.
The model in question is Qwen 3.5 122B @ Q8, but I've also reproduced the issue with Gemma 4 31B @ BF16.
Regardless of the simplicity or complexity of the prompt, reasoning gets deactivated prematurely.
llama-server command:
llama-server --host 0.0.0.0 --port 5814 -dev Vulkan1,Vulkan2,Vulkan3 --no-warmup -ngl all -fa on --sampling-seq k --top-k 1 --parallel 1 -fitt 0 -cram -1 -n 16384 --reasoning on --reasoning-budget 10240 --reasoning-budget-message . Actually, let me stop here. I have been thinking about this for long enough, will just reply now. --chat_template_kwargs {"enable_thinking":true, "reasoning":true, "thinking":true} -ngl auto -fitc 131072 -m /home/user/models/coder/qwen35-122b.lmstudio.q8.ggufFirst Bad Commit
I'm unsure, but somewhere in the last few weeks.
Relevant log output
Logs