Troubleshooting

Symptom-first. Find the row that looks like your problem, then act.

Discovery

No servers found

  No AI servers found on the network.

In order of likelihood:

Wrong subnet. The scan probes your /24 only. If your model host is on 10.0.2.x and you're on 10.0.1.x, it's invisible. Check ip addr to confirm.
Firewall. Try curl http://<model-host>:11434/api/tags from your machine. If that fails, fix the firewall before debugging the scan.
Model server not listening on 0.0.0.0. Ollama binds to 127.0.0.1 by default. Set OLLAMA_HOST=0.0.0.0:11434 on the host.
Custom port. Add it to COMMON_PORTS in scanner.py.
Routing. Some VLANs / WireGuard configs let you ping but not connect. Test with nc -zv host port.

Scan finds the host but no models

You'll see the host listed but the model count is 0. Causes:

Ollama with no models pulled — ollama list should show entries.
LM Studio with no model loaded — load one first, then re-scan.
vLLM started but model failed to load — check vllm logs.

Scan is slow (>30s)

A firewall that drops rather than rejects forces every probe to wait for timeout. Switch to RST replies, or accept the slowness.
Lots of live hosts on your /24 (~200+) — the Semaphore(100) cap means each is queued. This is usually fine; ~10s.

Cache thinks a server is alive but it's not

rm ~/.model_chat_cache.json

Or wait — quick_validate_cache() will drop it on the next launch.

Chat

Streaming is "jerky" — text arrives in chunks

Server is buffering output. vLLM with certain --chunked-prefill configs does this. Try --enforce-eager.
Reverse proxy (Cloudflare, nginx with default buffering) holds output. Either disable buffering or hit the server directly.

TPS shows `0` or `--`

Server didn't include a usage block in the final SSE chunk and the response was empty.
Check the request landed at all: look at the server logs. If yes, your server's OpenAI shim probably never emits usage — the UI falls back to estimation, which displays as ~.

TPS looks ridiculously high

Server returned a cached response (prefix-cache hit) — no real generation happened. Decode duration is near-zero, so tok / 0.0001s explodes.
Workaround: change one token in your prompt to defeat the cache.

`/think` shows nothing

Model doesn't emit <think>…</think> tags, and the server doesn't send reasoning_content. Either:

Model isn't a reasoning model — use a Qwen 3 / 3.5 thinking variant.
Server strips the tags. Some Ollama Modelfiles do this with a stop-token rule.

First message hangs for 30+ seconds

Cold weight load. Subsequent requests are fast. Enable model preloading on the server, or just send a one-token warmup before measuring.

Response truncates mid-sentence

Server hit max_tokens or num_predict. Defaults vary:

Ollama default: 128 tokens. Set OLLAMA_NUM_PREDICT=4096 or pass in your request.
vLLM default: 16 tokens(!) for completion. Pass max_tokens explicitly.

The client doesn't override defaults today — change at the server level.

Thinking text never closes

Server emitted <think> but no </think> because the response truncated mid-thought.
The client emits a synthetic </think> on stream-end as a safeguard, but the buffer may already have rendered weirdly.

Stress Testing

Test exits immediately with 0 results

Selected model name has changed on the server (Ollama re-tag, vLLM redeploy).
/switch back to discovery, re-pick the model.

All requests fail with timeout

60s timeout is the floor for httpx.AsyncClient in chat-stream. Slow first-token on a cold model + high concurrency burns through it.
Reduce concurrency, or warm up first.

Sustained-load mode reports memory leak

"Drift" panel shows TTFT trending up over time. Could be real (server is leaking) or could be:
- Context length growing if you accidentally enabled multi-turn (Sustained should be single-turn — confirm in the mode header).
- KV cache eviction thrashing if you exceed the server's max-batched-tokens limit.

Throughput maxes out at low concurrency

Hitting the server's --max-num-seqs cap, not your client's limit.
If you have --max-num-seqs 4, anything beyond 4 will queue. Check server config.

Tool Bench

All tasks fail with "malformed args"

Server's tool-call grammar is broken, not the model.
Confirm with a known-good tool client (curl with the exact payload your tool message would send).
For vLLM: try --enable-auto-tool-choice --tool-call-parser hermes (or mistral, etc.).
For Ollama: only certain models support tool calling; check ollama show <model> --modelfile.

Model loops calling `eval_math` instead of `calculator`

That's a distractor. The point of the test is that capable models recover by switching to calculator after the error.
Score should still register the distractor calls in the diagnostics panel.

"Unknown tool name" stats are high

Model invented a tool that doesn't exist. After stripping functions. / tools. prefixes, the harness still couldn't match.
Common cause: model trained on a different tool catalog and is hallucinating names. Not fixable client-side.

Tasks pass on one run, fail on the next

Tasks are deterministic on the harness side. Sampling variance on the model side accounts for ~5–10% flap on borderline tasks.
Run with temperature=0 (the harness asks for it) if you need fully repeatable results.

Logging

Where are my logs

./logs/stress_test_YYYYMMDD_HHMMSS.log in the project directory. One file per run.

Console is quiet but I want detail

Console handler is INFO+. The file handler has everything at DEBUG. tail -f logs/stress_test_*.log while a run is in progress.

Logs are growing huge

They're written per-run, not rotated. Delete old ones:

find logs/ -name "stress_test_*.log" -mtime +30 -delete

Catch-all

"It worked yesterday, doesn't work today"

Server-side config changed (model unloaded, port changed, restart wiped Modelfile).
Cache stale: rm ~/.model_chat_cache.json and rescan.
Python env: pip install -r requirements.txt --upgrade.

Crash with no useful output

Run with python -X dev main.py to surface warnings that are normally swallowed.
Check logs/stress_test_*.log for the most recent run; the exception traceback is usually there even if it didn't print to console.
File an issue: https://github.com/mtecnic/model-chat-cli/issues

Model Chat CLI · MIT · repo · issues · No telemetry · No cloud calls · No surprises

Model Chat CLI

Getting started

Features

Internals

Operating

GitHub repo →

Troubleshooting

Troubleshooting

Discovery

No servers found

Scan finds the host but no models

Scan is slow (>30s)

Cache thinks a server is alive but it's not

Chat

Streaming is "jerky" — text arrives in chunks

TPS shows 0 or --

TPS looks ridiculously high

/think shows nothing

First message hangs for 30+ seconds

Response truncates mid-sentence

Thinking text never closes

Stress Testing

Test exits immediately with 0 results

All requests fail with timeout

Sustained-load mode reports memory leak

Throughput maxes out at low concurrency

Tool Bench

All tasks fail with "malformed args"

Model loops calling eval_math instead of calculator

"Unknown tool name" stats are high

Tasks pass on one run, fail on the next

Logging

Where are my logs

Console is quiet but I want detail

Logs are growing huge

Catch-all

"It worked yesterday, doesn't work today"

Crash with no useful output

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Model Chat CLI

Clone this wiki locally

TPS shows `0` or `--`

`/think` shows nothing

Model loops calling `eval_math` instead of `calculator`