Skip to content

Eval bug: llama-server produces intermittent garbage output with Talkie architecture, llama-cli works fine #23953

Description

@JoeEnderman

Name and Version

version: 9438 (d749821)
built with GNU 15.2.1 for Linux x86_64

Operating systems

Linux

GGML backends

Vulkan

Hardware

Ryzen 5900x + 7900 XTX

Models

Talkie 1930 13b IT HF Q8

https://huggingface.co/mradermacher/talkie-1930-13b-it-hf-GGUF

Talkie Q8_O

https://huggingface.co/niklassheth/talkie-1930-13b-it-GGUF/tree/main

Problem description & steps to reproduce

llama-cli -cnv produces coherent output every time with the Talkie 1930 13B architecture (merged in #22596), but llama-server produces repetitive garbage on both /v1/chat/completions and /completion endpoints. The garbage is intermittent. The same prompt sometimes produces coherent output and sometimes produces degenerate repetition.

CLI works when I do this: llama-cli
-m talkie-1930-13b-it-hf.Q8_0.gguf
-ngl 999
-c 2048
--jinja
-cnv
-p "Hello, what year is it?"

And it always puts out clean responses like this: It is 1883.

When I run llama-server though I cannot get verified working output every single time.

llama-server
-m talkie-1930-13b-it-hf.Q8_0.gguf
-ngl 999
-c 2048
--jinja
--host 0.0.0.0
--port 11434
--parallel 1
--metrics

I tried curl:

curl http://localhost:11434/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "talkie",
"messages": [{"role": "user", "content": "Hello, what year is it?"}],
"max_tokens": 100
}'

And OpenWebUI, and the raw llama ui endpoints. I will get output like this:

? 2? 4? 2? 2? 4? 2? 2? 2? 2? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4?

Or:

NRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNARNARNRARNRNRNRNERNRNERNERNRNRNERNNRNERNRNRNERNRNRNRNRNARNARARNERNRNRNERNRNERNRNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERN

Or:

an? a?? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a

I have tested both ggufs listed above as well as I quantized my own with the current build to see if that made a difference. All three behave identically.

I also tried disabling flash-attn, disabling prompt cache (--cache-ram 0), disabling warmup (--no-warmup), and passing an explicit --chat-template-file. None made a difference.

First Bad Commit

No response

Relevant log output

0.00.022.031 I device_info:
0.00.022.164 I - Vulkan0 : AMD Radeon RX 7900 XTX (RADV NAVI31) (24576 MiB, 22454 MiB free)
0.00.022.169 I - CPU : AMD Ryzen 9 5900X 12-Core Processor (32009 MiB, 32009 MiB free)
0.00.022.214 I system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.022.273 I srv init: using 23 threads for HTTP server
0.00.022.356 I srv start: binding port with default address family
0.00.023.710 I srv llama_server: loading model
0.00.023.714 I srv load_model: loading model '/home/joeenderman/AI_Models/talkie-Q8_0-self.gguf'
0.00.023.737 I common_init_result: fitting params to device memory ...
0.00.023.738 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.09.612.672 I srv load_model: initializing slots, n_slots = 1
0.09.660.235 W common_speculative_init: no implementations specified for speculative decoding
0.09.660.238 I slot load_model: id 0 | task -1 | new slot, n_ctx = 2048
0.09.660.298 I srv load_model: prompt cache is disabled - use --cache-ram N to enable it
0.09.660.300 I srv load_model: for more info see #16391
0.09.660.300 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256
0.09.660.329 W srv init: --cache-idle-slots requires --kv-unified, disabling
0.09.661.716 I init: chat template, example_format: '<|system|>You are a helpful assistant<|end|><|user|>Hello<|end|><|assistant|>Hi there<|end|><|user|>How are you?<|end|><|assistant|>'
0.09.662.353 I srv init: init: chat template, thinking = 0
0.09.662.376 I srv llama_server: model loaded
0.09.662.382 I srv llama_server: server is listening on http://0.0.0.0:11434
0.09.662.387 I srv update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions