-
Notifications
You must be signed in to change notification settings - Fork 0
Discovery
The discovery layer is what turns "I have some boxes running models" into a working menu in under five seconds (after first run). It lives in scanner.py and ui/discovery.py.
flowchart TB
A[App start] --> B{Cache exists?}
B -- yes --> C[quick_validate_cache]
C --> D{Any healthy?}
D -- yes --> Z[Model picker]
D -- no --> E[Full subnet scan]
B -- no --> E
E --> F[probe_server × 255 × 5 ports]
F --> G[OpenAI /v1/models AND Ollama /api/tags in parallel]
G --> H[Fallback: /api/version for empty Ollama]
H --> I[save_cache]
I --> Z
style C fill:#0f3460,color:#eee
style E fill:#e94560,color:#fff
style G fill:#16213e,color:#eee
scanner.scan_network() figures out your local subnet by opening a UDP socket to 8.8.8.8:80, reading the OS-assigned local IP, and dropping the last octet:
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(("8.8.8.8", 80))
local_ip = s.getsockname()[0]
subnet = ".".join(local_ip.split(".")[:3])This works without sending any actual UDP packet — connect() on a datagram socket just records the kernel's chosen source address. If that fails (no default route), it falls back to 192.168.1.x.
If you're on a
/16or/22network, the scan still only sweeps your/24. To scan a different range, overridesubnetinscan_network()or open a PR — see Contributing.
COMMON_PORTS = [11434, 1234, 5000, 8000, 8080]| Port | Default for |
|---|---|
| 11434 | Ollama |
| 1234 | LM Studio |
| 5000 | Flask, custom OpenAI-compatible servers |
| 8000 | FastAPI, vLLM |
| 8080 | Alternative HTTP / proxies |
255 IPs × 5 ports = 1,275 probes per scan, throttled by an asyncio.Semaphore(100).
Adding a port:
- Append it to
COMMON_PORTSinscanner.py. - If it speaks an existing protocol (OpenAI or Ollama), you're done.
- Otherwise extend
probe_server()with a new endpoint check — see Contributing#adding-a-new-server-type.
For each ip:port, probe_server() fires two requests in parallel via asyncio.gather:
openai_data, ollama_data = await asyncio.gather(
check_endpoint(client, base_url, "/v1/models"),
check_endpoint(client, base_url, "/api/tags"),
return_exceptions=True
)Decision logic:
/v1/models |
/api/tags |
/api/version |
Result |
|---|---|---|---|
| ✓ | — | — | OpenAI-compatible (preferred) |
| — | ✓ | — | Ollama with models |
| — | — | ✓ | Ollama empty (no models pulled) |
| — | — | — | Skipped |
If a server responds to both, OpenAI takes precedence — Ollama's OpenAI-compat shim is usually the better stream parser.
Discovered Models
──────────────────────────────────────────────────────────────────────
# Model Server Type Status Latency
1 qwen2.5-32b-instruct 10.0.1.42:11434 OLL ✓ 12 ms
2 qwen3-9b-thinking 10.0.1.42:11434 OLL ✓ 12 ms
3 llama-3.3-70b-instruct-awq 10.0.1.55:8000 API ✓ 8 ms
4 qwen3.6-27b 10.0.1.55:8000 API ✓ 8 ms
5 Meta-Llama-3.1-8B-Instruct 10.0.1.77:1234 API ✓ 21 ms
──────────────────────────────────────────────────────────────────────
Each row is one model, not one server — a single host running five Ollama models becomes five rows. This is intentional: most decisions you make ("which one to chat with", "which to stress test") are per-model.
Latency is the response time of the health-check probe (/v1/models or /api/tags), not first-token latency. For real TTFT see Chat#performance-metrics.
save_cache() writes the entire discovered server list to ~/.model_chat_cache.json after a successful scan. On next launch, load_cache() reads it back and quick_validate_cache() re-validates every entry by calling check_server_health() — typically <100ms per server.
async def quick_validate_cache(servers, progress_callback=None):
validated = []
for server in servers:
healthy = await check_server_health(server)
if healthy.get("status") == "healthy":
validated.append(healthy)
return validatedServers that fail validation are dropped from the cache silently. If all cached servers are dead, the app falls through to a full rescan automatically.
To force a rescan without removing the cache:
mv ~/.model_chat_cache.json ~/.model_chat_cache.json.bak
python main.pyTo wipe and start fresh:
rm ~/.model_chat_cache.json| Network | Cold scan time |
|---|---|
Quiet /24 (few hosts) |
~3 – 5 s |
Busy /24 (50+ hosts) |
~8 – 12 s |
| Hostile firewall (drops vs RST) | ~15 – 20 s |
The Semaphore(100) cap keeps things polite — bumping it higher buys little because the HTTPX httpx.Limits(max_connections=200, max_keepalive_connections=50) becomes the real ceiling.
Validating a 10-server cache: typically <1 second.
Getting started
Features
Internals
Operating