./build/bin/llama-cli \
-m unsloth/Mistral-Small-4-119B-2603-GGUF/Mistral-Small-4-119B-2603-UD-Q4_K_XL-00001-of-00003.gguf \
-ngl 99 \
-c 4096 \
-n 512 \
--no-conversation \
-p '[SYSTEM_PROMPT]You are a helpful assistant[/SYSTEM_PROMPT][MODEL_SETTINGS]{"reasoning_effort": "high"}[/MODEL_SETTINGS][INST]What is the capital of France?[/INST]'
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 498216.21 MB
--no-conversation is not supported by llama-cli
please use llama-completion instead
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8390-b6c83aad5
model : Mistral-Small-4-119B-2603-UD-Q4_K_XL-00001-of-00003.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
> [SYSTEM_PROMPT]You are a helpful assistant[/SYSTEM_PROMPT][MODEL_SETTINGS]{"reasoning_effort": "high"}[/MODEL_SETTINGS][INST]What is the capital of France?[/INST]
[ Prompt: 206.2 t/s | Generation: 71.3 t/s ]
> who are you
[ Prompt: 128.1 t/s | Generation: 71.3 t/s ]
>
Exiting...
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - MTL0 (Apple M3 Ultra) | 475136 = 404407 + (70728 = 70374 + 90 + 264) + 0 |
llama_memory_breakdown_print: | - Host | 568 = 544 + 0 + 24 |
ggml_metal_free: deallocating
Name and Version
llama.cpp: b8390-b6c83aad5
Operating systems
Mac
GGML backends
Metal
Hardware
Environment:
cc @ngxson
Models
Mistral-Small-4
Problem description & steps to reproduce
Symptoms:
("汉书后汉书后汉书后..." endlessly)
sometimes correct, sometimes falls into repetition loop
First Bad Commit
No response
Relevant log output
Logs