Skip to content

Troubleshooting

mtecnic edited this page May 28, 2026 · 1 revision

Troubleshooting

Symptom-first. Find the row that looks like your problem, then act.


Discovery

No servers found

  No AI servers found on the network.

In order of likelihood:

  1. Wrong subnet. The scan probes your /24 only. If your model host is on 10.0.2.x and you're on 10.0.1.x, it's invisible. Check ip addr to confirm.
  2. Firewall. Try curl http://<model-host>:11434/api/tags from your machine. If that fails, fix the firewall before debugging the scan.
  3. Model server not listening on 0.0.0.0. Ollama binds to 127.0.0.1 by default. Set OLLAMA_HOST=0.0.0.0:11434 on the host.
  4. Custom port. Add it to COMMON_PORTS in scanner.py.
  5. Routing. Some VLANs / WireGuard configs let you ping but not connect. Test with nc -zv host port.

Scan finds the host but no models

You'll see the host listed but the model count is 0. Causes:

  • Ollama with no models pulled — ollama list should show entries.
  • LM Studio with no model loaded — load one first, then re-scan.
  • vLLM started but model failed to load — check vllm logs.

Scan is slow (>30s)

  • A firewall that drops rather than rejects forces every probe to wait for timeout. Switch to RST replies, or accept the slowness.
  • Lots of live hosts on your /24 (~200+) — the Semaphore(100) cap means each is queued. This is usually fine; ~10s.

Cache thinks a server is alive but it's not

rm ~/.model_chat_cache.json

Or wait — quick_validate_cache() will drop it on the next launch.


Chat

Streaming is "jerky" — text arrives in chunks

  • Server is buffering output. vLLM with certain --chunked-prefill configs does this. Try --enforce-eager.
  • Reverse proxy (Cloudflare, nginx with default buffering) holds output. Either disable buffering or hit the server directly.

TPS shows 0 or --

  • Server didn't include a usage block in the final SSE chunk and the response was empty.
  • Check the request landed at all: look at the server logs. If yes, your server's OpenAI shim probably never emits usage — the UI falls back to estimation, which displays as ~.

TPS looks ridiculously high

  • Server returned a cached response (prefix-cache hit) — no real generation happened. Decode duration is near-zero, so tok / 0.0001s explodes.
  • Workaround: change one token in your prompt to defeat the cache.

/think shows nothing

Model doesn't emit <think>…</think> tags, and the server doesn't send reasoning_content. Either:

  • Model isn't a reasoning model — use a Qwen 3 / 3.5 thinking variant.
  • Server strips the tags. Some Ollama Modelfiles do this with a stop-token rule.

First message hangs for 30+ seconds

Cold weight load. Subsequent requests are fast. Enable model preloading on the server, or just send a one-token warmup before measuring.

Response truncates mid-sentence

Server hit max_tokens or num_predict. Defaults vary:

  • Ollama default: 128 tokens. Set OLLAMA_NUM_PREDICT=4096 or pass in your request.
  • vLLM default: 16 tokens(!) for completion. Pass max_tokens explicitly.

The client doesn't override defaults today — change at the server level.

Thinking text never closes

  • Server emitted <think> but no </think> because the response truncated mid-thought.
  • The client emits a synthetic </think> on stream-end as a safeguard, but the buffer may already have rendered weirdly.

Stress Testing

Test exits immediately with 0 results

  • Selected model name has changed on the server (Ollama re-tag, vLLM redeploy).
  • /switch back to discovery, re-pick the model.

All requests fail with timeout

  • 60s timeout is the floor for httpx.AsyncClient in chat-stream. Slow first-token on a cold model + high concurrency burns through it.
  • Reduce concurrency, or warm up first.

Sustained-load mode reports memory leak

  • "Drift" panel shows TTFT trending up over time. Could be real (server is leaking) or could be:
    • Context length growing if you accidentally enabled multi-turn (Sustained should be single-turn — confirm in the mode header).
    • KV cache eviction thrashing if you exceed the server's max-batched-tokens limit.

Throughput maxes out at low concurrency

  • Hitting the server's --max-num-seqs cap, not your client's limit.
  • If you have --max-num-seqs 4, anything beyond 4 will queue. Check server config.

Tool Bench

All tasks fail with "malformed args"

  • Server's tool-call grammar is broken, not the model.
  • Confirm with a known-good tool client (curl with the exact payload your tool message would send).
  • For vLLM: try --enable-auto-tool-choice --tool-call-parser hermes (or mistral, etc.).
  • For Ollama: only certain models support tool calling; check ollama show <model> --modelfile.

Model loops calling eval_math instead of calculator

  • That's a distractor. The point of the test is that capable models recover by switching to calculator after the error.
  • Score should still register the distractor calls in the diagnostics panel.

"Unknown tool name" stats are high

  • Model invented a tool that doesn't exist. After stripping functions. / tools. prefixes, the harness still couldn't match.
  • Common cause: model trained on a different tool catalog and is hallucinating names. Not fixable client-side.

Tasks pass on one run, fail on the next

  • Tasks are deterministic on the harness side. Sampling variance on the model side accounts for ~5–10% flap on borderline tasks.
  • Run with temperature=0 (the harness asks for it) if you need fully repeatable results.

Logging

Where are my logs

./logs/stress_test_YYYYMMDD_HHMMSS.log in the project directory. One file per run.

Console is quiet but I want detail

Console handler is INFO+. The file handler has everything at DEBUG. tail -f logs/stress_test_*.log while a run is in progress.

Logs are growing huge

They're written per-run, not rotated. Delete old ones:

find logs/ -name "stress_test_*.log" -mtime +30 -delete

Catch-all

"It worked yesterday, doesn't work today"

  • Server-side config changed (model unloaded, port changed, restart wiped Modelfile).
  • Cache stale: rm ~/.model_chat_cache.json and rescan.
  • Python env: pip install -r requirements.txt --upgrade.

Crash with no useful output

  • Run with python -X dev main.py to surface warnings that are normally swallowed.
  • Check logs/stress_test_*.log for the most recent run; the exception traceback is usually there even if it didn't print to console.
  • File an issue: https://github.com/mtecnic/model-chat-cli/issues

Clone this wiki locally