-
Notifications
You must be signed in to change notification settings - Fork 0
Troubleshooting
Symptom-first. Find the row that looks like your problem, then act.
No AI servers found on the network.
In order of likelihood:
-
Wrong subnet. The scan probes your
/24only. If your model host is on10.0.2.xand you're on10.0.1.x, it's invisible. Checkip addrto confirm. -
Firewall. Try
curl http://<model-host>:11434/api/tagsfrom your machine. If that fails, fix the firewall before debugging the scan. -
Model server not listening on
0.0.0.0. Ollama binds to127.0.0.1by default. SetOLLAMA_HOST=0.0.0.0:11434on the host. -
Custom port. Add it to
COMMON_PORTSinscanner.py. -
Routing. Some VLANs / WireGuard configs let you ping but not connect. Test with
nc -zv host port.
You'll see the host listed but the model count is 0. Causes:
- Ollama with no models pulled —
ollama listshould show entries. - LM Studio with no model loaded — load one first, then re-scan.
- vLLM started but model failed to load — check
vllmlogs.
- A firewall that drops rather than rejects forces every probe to wait for timeout. Switch to RST replies, or accept the slowness.
- Lots of live hosts on your
/24(~200+) — theSemaphore(100)cap means each is queued. This is usually fine; ~10s.
rm ~/.model_chat_cache.jsonOr wait — quick_validate_cache() will drop it on the next launch.
- Server is buffering output. vLLM with certain
--chunked-prefillconfigs does this. Try--enforce-eager. - Reverse proxy (Cloudflare, nginx with default buffering) holds output. Either disable buffering or hit the server directly.
- Server didn't include a
usageblock in the final SSE chunk and the response was empty. - Check the request landed at all: look at the server logs. If yes, your server's OpenAI shim probably never emits usage — the UI falls back to estimation, which displays as
~.
- Server returned a cached response (prefix-cache hit) — no real generation happened. Decode duration is near-zero, so
tok / 0.0001sexplodes. - Workaround: change one token in your prompt to defeat the cache.
Model doesn't emit <think>…</think> tags, and the server doesn't send reasoning_content. Either:
- Model isn't a reasoning model — use a Qwen 3 / 3.5 thinking variant.
- Server strips the tags. Some Ollama Modelfiles do this with a stop-token rule.
Cold weight load. Subsequent requests are fast. Enable model preloading on the server, or just send a one-token warmup before measuring.
Server hit max_tokens or num_predict. Defaults vary:
- Ollama default: 128 tokens. Set
OLLAMA_NUM_PREDICT=4096or pass in your request. - vLLM default: 16 tokens(!) for completion. Pass
max_tokensexplicitly.
The client doesn't override defaults today — change at the server level.
- Server emitted
<think>but no</think>because the response truncated mid-thought. - The client emits a synthetic
</think>on stream-end as a safeguard, but the buffer may already have rendered weirdly.
- Selected model name has changed on the server (Ollama re-tag, vLLM redeploy).
-
/switchback to discovery, re-pick the model.
- 60s timeout is the floor for
httpx.AsyncClientin chat-stream. Slow first-token on a cold model + high concurrency burns through it. - Reduce concurrency, or warm up first.
- "Drift" panel shows TTFT trending up over time. Could be real (server is leaking) or could be:
- Context length growing if you accidentally enabled multi-turn (Sustained should be single-turn — confirm in the mode header).
- KV cache eviction thrashing if you exceed the server's max-batched-tokens limit.
- Hitting the server's
--max-num-seqscap, not your client's limit. - If you have
--max-num-seqs 4, anything beyond 4 will queue. Check server config.
- Server's tool-call grammar is broken, not the model.
- Confirm with a known-good tool client (
curlwith the exact payload your tool message would send). - For vLLM: try
--enable-auto-tool-choice --tool-call-parser hermes(ormistral, etc.). - For Ollama: only certain models support tool calling; check
ollama show <model> --modelfile.
- That's a distractor. The point of the test is that capable models recover by switching to
calculatorafter the error. - Score should still register the distractor calls in the diagnostics panel.
- Model invented a tool that doesn't exist. After stripping
functions./tools.prefixes, the harness still couldn't match. - Common cause: model trained on a different tool catalog and is hallucinating names. Not fixable client-side.
- Tasks are deterministic on the harness side. Sampling variance on the model side accounts for ~5–10% flap on borderline tasks.
- Run with
temperature=0(the harness asks for it) if you need fully repeatable results.
./logs/stress_test_YYYYMMDD_HHMMSS.log in the project directory. One file per run.
Console handler is INFO+. The file handler has everything at DEBUG. tail -f logs/stress_test_*.log while a run is in progress.
They're written per-run, not rotated. Delete old ones:
find logs/ -name "stress_test_*.log" -mtime +30 -delete- Server-side config changed (model unloaded, port changed, restart wiped Modelfile).
- Cache stale:
rm ~/.model_chat_cache.jsonand rescan. - Python env:
pip install -r requirements.txt --upgrade.
- Run with
python -X dev main.pyto surface warnings that are normally swallowed. - Check
logs/stress_test_*.logfor the most recent run; the exception traceback is usually there even if it didn't print to console. - File an issue: https://github.com/mtecnic/model-chat-cli/issues
Getting started
Features
Internals
Operating