Eval bug: llama-server produces intermittent garbage output with Talkie architecture, llama-cli works fine

### Name and Version

version: 9438 (d749821db)
built with GNU 15.2.1 for Linux x86_64

### Operating systems

Linux

### GGML backends

Vulkan

### Hardware

Ryzen 5900x + 7900 XTX

### Models

Talkie 1930 13b IT HF Q8

https://huggingface.co/mradermacher/talkie-1930-13b-it-hf-GGUF

Talkie Q8_O

https://huggingface.co/niklassheth/talkie-1930-13b-it-GGUF/tree/main

### Problem description & steps to reproduce

llama-cli -cnv produces coherent output every time with the Talkie 1930 13B architecture (merged in #22596), but llama-server produces repetitive garbage on both /v1/chat/completions and /completion endpoints. The garbage is intermittent. The same prompt sometimes produces coherent output and sometimes produces degenerate repetition.

CLI works when I do this: llama-cli \
  -m talkie-1930-13b-it-hf.Q8_0.gguf \
  -ngl 999 \
  -c 2048 \
  --jinja \
  -cnv \
  -p "Hello, what year is it?"

And it always puts out clean responses like this: It is 1883.

When I run llama-server though I cannot get verified working output every single time.

llama-server \
  -m talkie-1930-13b-it-hf.Q8_0.gguf \
  -ngl 999 \
  -c 2048 \
  --jinja \
  --host 0.0.0.0 \
  --port 11434 \
  --parallel 1 \
  --metrics

I tried curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "talkie",
    "messages": [{"role": "user", "content": "Hello, what year is it?"}],
    "max_tokens": 100
  }'

And OpenWebUI, and the raw llama ui endpoints. I will get output like this:

? 2? 4? 2? 2? 4? 2? 2? 2? 2? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4? 4?

Or:

NRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNRNARNARNRARNRNRNRNERNRNERNERNRNRNERNNRNERNRNRNERNRNRNRNRNARNARARNERNRNRNERNRNERNRNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERNERN

Or:

an? a?? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a? a

I have tested both ggufs listed above as well as I quantized my own with the current build to see if that made a difference. All three behave identically.

I also tried disabling flash-attn, disabling prompt cache (--cache-ram 0), disabling warmup (--no-warmup), and passing an explicit --chat-template-file. None made a difference.

### First Bad Commit

_No response_

### Relevant log output

0.00.022.031 I device_info:
0.00.022.164 I   - Vulkan0 : AMD Radeon RX 7900 XTX (RADV NAVI31) (24576 MiB, 22454 MiB free)
0.00.022.169 I   - CPU     : AMD Ryzen 9 5900X 12-Core Processor (32009 MiB, 32009 MiB free)
0.00.022.214 I system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.022.273 I srv          init: using 23 threads for HTTP server
0.00.022.356 I srv         start: binding port with default address family
0.00.023.710 I srv  llama_server: loading model
0.00.023.714 I srv    load_model: loading model '/home/joeenderman/AI_Models/talkie-Q8_0-self.gguf'
0.00.023.737 I common_init_result: fitting params to device memory ...
0.00.023.738 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.09.612.672 I srv    load_model: initializing slots, n_slots = 1
0.09.660.235 W common_speculative_init: no implementations specified for speculative decoding
0.09.660.238 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 2048
0.09.660.298 I srv    load_model: prompt cache is disabled - use `--cache-ram N` to enable it
0.09.660.300 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.09.660.300 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.09.660.329 W srv          init: --cache-idle-slots requires --kv-unified, disabling
0.09.661.716 I init: chat template, example_format: '<|system|>You are a helpful assistant<|end|><|user|>Hello<|end|><|assistant|>Hi there<|end|><|user|>How are you?<|end|><|assistant|>'
0.09.662.353 I srv          init: init: chat template, thinking = 0
0.09.662.376 I srv  llama_server: model loaded
0.09.662.382 I srv  llama_server: server is listening on http://0.0.0.0:11434
0.09.662.387 I srv  update_slots: all slots are idle


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: llama-server produces intermittent garbage output with Talkie architecture, llama-cli works fine #23953

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: llama-server produces intermittent garbage output with Talkie architecture, llama-cli works fine #23953

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions