Skip to content

Eval bug: Premature "reasoning-budget: deactivated (natural end)", even BEFORE prompt processing #25067

Description

@ross-rosario

Name and Version

version: 9802 (5f9b312)
built with GNU 16.1.1 for Linux x86_64

Operating systems

Linux

GGML backends

Vulkan

Hardware

3x RX 7900 XTX

Models

Qwen3.5-122B-A10B-Q8_0

Problem description & steps to reproduce

It appears that as of late, reasoning is deactivated prematurely, even before prompt processing.

The model in question is Qwen 3.5 122B @ Q8, but I've also reproduced the issue with Gemma 4 31B @ BF16.

Regardless of the simplicity or complexity of the prompt, reasoning gets deactivated prematurely.

llama-server command:
llama-server --host 0.0.0.0 --port 5814 -dev Vulkan1,Vulkan2,Vulkan3 --no-warmup -ngl all -fa on --sampling-seq k --top-k 1 --parallel 1 -fitt 0 -cram -1 -n 16384 --reasoning on --reasoning-budget 10240 --reasoning-budget-message . Actually, let me stop here. I have been thinking about this for long enough, will just reply now. --chat_template_kwargs {"enable_thinking":true, "reasoning":true, "thinking":true} -ngl auto -fitc 131072 -m /home/user/models/coder/qwen35-122b.lmstudio.q8.gguf

First Bad Commit

I'm unsure, but somewhere in the last few weeks.

Relevant log output

Logs
0.00.087.730 W Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
0.00.087.736 W DEPRECATED: argument '-ngl' specified multiple times, use comma-separated values instead (only last value will be used)
0.00.087.908 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.087.909 I device_info:
0.00.088.033 I   - Vulkan0 : AMD Ryzen 9 9950X 16-Core Processor (RADV RAPHAEL_MENDOCINO) (32986 MiB, 32925 MiB free)
0.00.088.131 I   - Vulkan1 : AMD Radeon RX 7900 XTX (RADV NAVI31) (24560 MiB, 24533 MiB free)
0.00.088.227 I   - Vulkan2 : AMD Radeon RX 7900 XTX (RADV NAVI31) (24560 MiB, 24533 MiB free)
0.00.088.329 I   - Vulkan3 : AMD Radeon RX 7900 XTX (RADV NAVI31) (24560 MiB, 24533 MiB free)
0.00.088.332 I   - CPU     : AMD Ryzen 9 9950X 16-Core Processor (61876 MiB, 61876 MiB free)
0.00.088.344 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.088.679 I srv          init: running without SSL
0.00.088.708 I srv          init: using 31 threads for HTTP server
0.00.089.095 I srv         start: binding port with default address family
0.00.090.194 I srv  llama_server: loading model
0.00.090.446 I srv    load_model: loading model '/home/user/models/coder/qwen35-122b.lmstudio.q8.gguf'
0.00.090.453 I common_init_result: fitting params to device memory ...
0.00.090.453 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.06.365.624 W llama_model_loader: tensor overrides to CPU are used with mmap enabled - consider using --no-mmap for better performance
2.37.264.962 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
2.37.507.465 I srv    load_model: initializing slots, n_slots = 1
2.39.584.927 W srv    load_model: speculative decoding will use checkpoints
2.39.584.936 W common_speculative_init: no implementations specified for speculative decoding
2.39.584.938 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 131072
2.39.584.982 I srv    load_model: prompt cache is enabled, size limit: no limit
2.39.584.983 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
2.39.584.983 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
2.39.584.983 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 8192
2.39.587.620 I srv          init: idle slots will be saved to prompt cache upon starting a new task
2.39.611.924 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
2.39.620.410 I srv          init: init: chat template, thinking = 1
2.39.620.820 I srv  llama_server: model loaded
2.39.620.957 I srv  llama_server: server is listening on http://0.0.0.0:5814
2.39.620.960 I srv  update_slots: all slots are idle
3.00.776.689 I srv    operator(): Chat format: peg-native
3.00.778.048 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
3.00.778.050 I srv  get_availabl: updating prompt cache
3.00.778.241 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
3.00.778.387 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 0.000 MiB, 131072 tokens, 131072 est)
3.00.778.388 I srv  get_availabl: prompt cache update took 0.34 ms
3.00.778.682 I reasoning-budget: activated, budget=10240 tokens
3.00.778.683 I reasoning-budget: deactivated (natural end)
3.00.778.699 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
4.31.399.737 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   2048, progress = 0.24, t =  90.62 s / 22.60 tokens per second
4.58.000.702 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   4096, progress = 0.48, t = 117.22 s / 34.94 tokens per second
5.22.538.873 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   6144, progress = 0.72, t = 141.76 s / 43.34 tokens per second
5.43.849.687 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   8045, progress = 0.94, t = 163.07 s / 49.33 tokens per second
5.43.921.465 I slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 8044, pos_max = 8044, n_tokens = 8045, size = 149.063 MiB)
5.49.584.166 I slot print_timing: id  0 | task 0 | prompt processing, n_tokens =   8557, progress = 1.00, t = 168.81 s / 50.69 tokens per second
5.57.215.243 I slot print_timing: id  0 | task 0 | n_decoded =    100, tg =  13.44 t/s, tg_3s =  13.44 t/s
6.00.262.132 I slot print_timing: id  0 | task 0 | n_decoded =    142, tg =  13.54 t/s, tg_3s =  13.78 t/s
6.03.330.054 I slot print_timing: id  0 | task 0 | n_decoded =    184, tg =  13.58 t/s, tg_3s =  13.69 t/s
6.06.359.489 I slot print_timing: id  0 | task 0 | n_decoded =    225, tg =  13.57 t/s, tg_3s =  13.53 t/s
6.09.428.887 I slot print_timing: id  0 | task 0 | n_decoded =    267, tg =  13.59 t/s, tg_3s =  13.68 t/s
6.12.439.647 I slot print_timing: id  0 | task 0 | n_decoded =    308, tg =  13.59 t/s, tg_3s =  13.62 t/s
6.15.491.996 I slot print_timing: id  0 | task 0 | n_decoded =    350, tg =  13.61 t/s, tg_3s =  13.76 t/s
6.18.520.959 I slot print_timing: id  0 | task 0 | n_decoded =    392, tg =  13.64 t/s, tg_3s =  13.87 t/s
6.21.561.016 I slot print_timing: id  0 | task 0 | n_decoded =    434, tg =  13.65 t/s, tg_3s =  13.82 t/s
6.24.595.708 I slot print_timing: id  0 | task 0 | n_decoded =    476, tg =  13.67 t/s, tg_3s =  13.84 t/s
6.27.650.859 I slot print_timing: id  0 | task 0 | n_decoded =    518, tg =  13.68 t/s, tg_3s =  13.75 t/s
6.30.671.137 I slot print_timing: id  0 | task 0 | n_decoded =    560, tg =  13.69 t/s, tg_3s =  13.91 t/s
6.33.741.494 I slot print_timing: id  0 | task 0 | n_decoded =    602, tg =  13.69 t/s, tg_3s =  13.68 t/s
6.36.785.341 I slot print_timing: id  0 | task 0 | n_decoded =    644, tg =  13.70 t/s, tg_3s =  13.80 t/s
6.39.816.274 I slot print_timing: id  0 | task 0 | n_decoded =    686, tg =  13.71 t/s, tg_3s =  13.86 t/s
6.42.822.004 I slot print_timing: id  0 | task 0 | n_decoded =    728, tg =  13.72 t/s, tg_3s =  13.99 t/s
6.45.824.822 I slot print_timing: id  0 | task 0 | n_decoded =    770, tg =  13.74 t/s, tg_3s =  13.97 t/s
6.48.859.971 I slot print_timing: id  0 | task 0 | n_decoded =    812, tg =  13.74 t/s, tg_3s =  13.84 t/s
6.51.933.178 I slot print_timing: id  0 | task 0 | n_decoded =    854, tg =  13.74 t/s, tg_3s =  13.67 t/s
6.54.999.574 I slot print_timing: id  0 | task 0 | n_decoded =    896, tg =  13.74 t/s, tg_3s =  13.70 t/s
6.58.018.419 I slot print_timing: id  0 | task 0 | n_decoded =    938, tg =  13.75 t/s, tg_3s =  13.91 t/s
7.01.056.528 I slot print_timing: id  0 | task 0 | n_decoded =    980, tg =  13.75 t/s, tg_3s =  13.82 t/s
7.04.098.654 I slot print_timing: id  0 | task 0 | n_decoded =   1022, tg =  13.75 t/s, tg_3s =  13.81 t/s
7.07.102.953 I slot print_timing: id  0 | task 0 | n_decoded =   1063, tg =  13.75 t/s, tg_3s =  13.65 t/s
7.10.176.210 I slot print_timing: id  0 | task 0 | n_decoded =   1105, tg =  13.74 t/s, tg_3s =  13.67 t/s
7.13.243.442 I slot print_timing: id  0 | task 0 | n_decoded =   1147, tg =  13.74 t/s, tg_3s =  13.69 t/s
7.16.252.459 I slot print_timing: id  0 | task 0 | n_decoded =   1188, tg =  13.74 t/s, tg_3s =  13.63 t/s
7.19.271.067 I slot print_timing: id  0 | task 0 | n_decoded =   1229, tg =  13.73 t/s, tg_3s =  13.58 t/s
7.22.303.765 I slot print_timing: id  0 | task 0 | n_decoded =   1270, tg =  13.73 t/s, tg_3s =  13.52 t/s
7.25.349.031 I slot print_timing: id  0 | task 0 | n_decoded =   1312, tg =  13.73 t/s, tg_3s =  13.79 t/s
7.28.393.669 I slot print_timing: id  0 | task 0 | n_decoded =   1354, tg =  13.73 t/s, tg_3s =  13.81 t/s
7.31.415.895 I slot print_timing: id  0 | task 0 | n_decoded =   1396, tg =  13.73 t/s, tg_3s =  13.88 t/s
7.34.475.046 I slot print_timing: id  0 | task 0 | n_decoded =   1438, tg =  13.73 t/s, tg_3s =  13.73 t/s
7.37.502.884 I slot print_timing: id  0 | task 0 | n_decoded =   1479, tg =  13.73 t/s, tg_3s =  13.54 t/s
7.40.557.754 I slot print_timing: id  0 | task 0 | n_decoded =   1521, tg =  13.73 t/s, tg_3s =  13.75 t/s
7.43.581.644 I slot print_timing: id  0 | task 0 | n_decoded =   1563, tg =  13.73 t/s, tg_3s =  13.89 t/s
7.46.606.253 I slot print_timing: id  0 | task 0 | n_decoded =   1605, tg =  13.74 t/s, tg_3s =  13.89 t/s
7.49.663.819 I slot print_timing: id  0 | task 0 | n_decoded =   1647, tg =  13.74 t/s, tg_3s =  13.74 t/s
7.52.705.871 I slot print_timing: id  0 | task 0 | n_decoded =   1689, tg =  13.74 t/s, tg_3s =  13.81 t/s
7.55.711.757 I slot print_timing: id  0 | task 0 | n_decoded =   1731, tg =  13.75 t/s, tg_3s =  13.97 t/s
7.58.736.273 I slot print_timing: id  0 | task 0 | n_decoded =   1773, tg =  13.75 t/s, tg_3s =  13.89 t/s
8.01.740.919 I slot print_timing: id  0 | task 0 | n_decoded =   1815, tg =  13.75 t/s, tg_3s =  13.98 t/s
8.04.743.726 I slot print_timing: id  0 | task 0 | n_decoded =   1857, tg =  13.76 t/s, tg_3s =  13.99 t/s
8.07.763.219 I slot print_timing: id  0 | task 0 | n_decoded =   1899, tg =  13.76 t/s, tg_3s =  13.91 t/s
8.10.809.187 I slot print_timing: id  0 | task 0 | n_decoded =   1941, tg =  13.76 t/s, tg_3s =  13.79 t/s
8.13.825.043 I slot print_timing: id  0 | task 0 | n_decoded =   1982, tg =  13.76 t/s, tg_3s =  13.60 t/s
8.16.887.172 I slot print_timing: id  0 | task 0 | n_decoded =   2024, tg =  13.76 t/s, tg_3s =  13.71 t/s
8.19.910.052 I slot print_timing: id  0 | task 0 | n_decoded =   2066, tg =  13.76 t/s, tg_3s =  13.89 t/s
8.22.916.144 I slot print_timing: id  0 | task 0 | n_decoded =   2108, tg =  13.77 t/s, tg_3s =  13.97 t/s
8.25.939.502 I slot print_timing: id  0 | task 0 | n_decoded =   2150, tg =  13.77 t/s, tg_3s =  13.89 t/s
8.28.998.448 I slot print_timing: id  0 | task 0 | n_decoded =   2192, tg =  13.77 t/s, tg_3s =  13.73 t/s
8.32.031.307 I slot print_timing: id  0 | task 0 | n_decoded =   2234, tg =  13.77 t/s, tg_3s =  13.85 t/s
8.35.045.777 I slot print_timing: id  0 | task 0 | n_decoded =   2276, tg =  13.77 t/s, tg_3s =  13.93 t/s
8.38.075.042 I slot print_timing: id  0 | task 0 | n_decoded =   2317, tg =  13.77 t/s, tg_3s =  13.53 t/s
8.41.144.007 I slot print_timing: id  0 | task 0 | n_decoded =   2359, tg =  13.77 t/s, tg_3s =  13.69 t/s
8.44.200.202 I slot print_timing: id  0 | task 0 | n_decoded =   2401, tg =  13.77 t/s, tg_3s =  13.76 t/s
8.47.208.409 I slot print_timing: id  0 | task 0 | n_decoded =   2442, tg =  13.76 t/s, tg_3s =  13.61 t/s
8.50.237.635 I slot print_timing: id  0 | task 0 | n_decoded =   2484, tg =  13.76 t/s, tg_3s =  13.86 t/s
8.53.306.262 I slot print_timing: id  0 | task 0 | n_decoded =   2527, tg =  13.77 t/s, tg_3s =  14.01 t/s
8.56.364.667 I slot print_timing: id  0 | task 0 | n_decoded =   2570, tg =  13.77 t/s, tg_3s =  14.06 t/s
8.59.399.134 I slot print_timing: id  0 | task 0 | n_decoded =   2612, tg =  13.77 t/s, tg_3s =  13.84 t/s
9.02.440.460 I slot print_timing: id  0 | task 0 | n_decoded =   2655, tg =  13.78 t/s, tg_3s =  14.14 t/s
9.04.270.246 I slot print_timing: id  0 | task 0 | prompt eval time =  168997.32 ms /  8561 tokens (   19.74 ms per token,    50.66 tokens per second)
9.04.270.248 I slot print_timing: id  0 | task 0 |        eval time =  194490.52 ms /  2680 tokens (   72.57 ms per token,    13.78 tokens per second)
9.04.270.249 I slot print_timing: id  0 | task 0 |       total time =  363487.83 ms / 11241 tokens
9.04.270.250 I slot print_timing: id  0 | task 0 |    graphs reused =       2668
9.04.317.726 I slot      release: id  0 | task 0 | stop processing: n_tokens = 11240, truncated = 0
9.04.317.748 I srv  update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions