Skip to content

Eval bug: Mistral Small 4 (mistral4 arch): repetitive/empty output on Metal #20668

Description

@bunnynode

Name and Version

llama.cpp: b8390-b6c83aad5

Operating systems

Mac

GGML backends

Metal

Hardware

Environment:

  • llama.cpp: b8390-b6c83aad5
  • Device: Apple M3 Ultra 512GB
  • GGUF: unsloth/Mistral-Small-4-119B-2603-GGUF Q4_K_XL
  • FA on/off: no difference

cc @ngxson

Models

Mistral-Small-4

Problem description & steps to reproduce

Symptoms:

  1. Chat mode (via server/--jinja): outputs repetitive text
    ("汉书后汉书后汉书后..." endlessly)
  2. Think/reasoning mode: no output, hangs
  3. Pure completion (--no-conversation): intermittent —
    sometimes correct, sometimes falls into repetition loop

First Bad Commit

No response

Relevant log output

Logs
 ./build/bin/llama-cli \
  -m unsloth/Mistral-Small-4-119B-2603-GGUF/Mistral-Small-4-119B-2603-UD-Q4_K_XL-00001-of-00003.gguf \
  -ngl 99 \
  -c 4096 \
  -n 512 \
  --no-conversation \
  -p '[SYSTEM_PROMPT]You are a helpful assistant[/SYSTEM_PROMPT][MODEL_SETTINGS]{"reasoning_effort": "high"}[/MODEL_SETTINGS][INST]What is the capital of France?[/INST]'
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name:   MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4  (5002)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 498216.21 MB
--no-conversation is not supported by llama-cli
please use llama-completion instead

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8390-b6c83aad5
model      : Mistral-Small-4-119B-2603-UD-Q4_K_XL-00001-of-00003.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


> [SYSTEM_PROMPT]You are a helpful assistant[/SYSTEM_PROMPT][MODEL_SETTINGS]{"reasoning_effort": "high"}[/MODEL_SETTINGS][INST]What is the capital of France?[/INST]

 

[ Prompt: 206.2 t/s | Generation: 71.3 t/s ]

> who are you

 

[ Prompt: 128.1 t/s | Generation: 71.3 t/s ]

> 

Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB]    |  total     free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - MTL0 (Apple M3 Ultra) | 475136 = 404407 + (70728 = 70374 +      90 +     264) +           0 |
llama_memory_breakdown_print: |   - Host                  |                      568 =   544 +       0 +      24                |
ggml_metal_free: deallocating

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions