I'm using Open WebUI as my ds4 frontend. It can display information about the response, such as tokens per second, if the backend provides it in the usage block:
curl http://a-llama-cpp-server:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gemma-4-12B-it-qat",
"messages": [{"role": "user", "content": "Say hello"}],
"stream": false
}'
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "Hello!",
"reasoning_content": "The user said \"Say hello\".\nThe user wants me to say the word \"hello\" or a variation of it.\n\n * Standard: Hello!\n * Friendly: Hi there!\n * Enthusiastic: Hello, how can I help you today?"
}
}
],
"created": 1780761818,
"model": "gemma-4-12B-it-qat",
"system_fingerprint": "b9518-7c158fbb4",
"object": "chat.completion",
"usage": {
"completion_tokens": 67,
"prompt_tokens": 18,
"total_tokens": 85,
"prompt_tokens_details": {
"cached_tokens": 1
}
},
"id": "chatcmpl-0BYrADMFNyBokAwslPFLB3TZLIQUg3BF",
"timings": {
"cache_n": 1,
"prompt_n": 17,
"prompt_ms": 234.305,
"prompt_per_token_ms": 13.78264705882353,
"prompt_per_second": 72.55500309425747,
"predicted_n": 67,
"predicted_ms": 2435.025,
"predicted_per_token_ms": 36.34365671641791,
"predicted_per_second": 27.515117914600463
}
}
Compare a current ds4 response:
curl http://a-ds4-server:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ds4",
"messages": [{"role": "user", "content": "Say hello"}],
"stream": false
}'
{
"id": "chatcmpl-1",
"object": "chat.completion",
"created": 1780761748,
"model": "ds4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?",
"reasoning_content": "We need to respond to the user's request. The user said \"Say hello\". That is a simple instruction. As an AI, I should comply and say hello. So I will respond with a greeting."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 6,
"completion_tokens": 52,
"total_tokens": 58,
"prompt_tokens_details": {
"cached_tokens": 0,
"cache_write_tokens": 6
}
}
}
If more usage stats are available in the code, it would be very nice if they were plumbed through to usage. I'd be happy to work on this if it's a good first issue, but I wanted to request the feature first before dumping a PR on you.
I'm using Open WebUI as my ds4 frontend. It can display information about the response, such as tokens per second, if the backend provides it in the
usageblock:Compare a current ds4 response:
If more usage stats are available in the code, it would be very nice if they were plumbed through to
usage. I'd be happy to work on this if it's a good first issue, but I wanted to request the feature first before dumping a PR on you.