Discovery

The discovery layer is what turns "I have some boxes running models" into a working menu in under five seconds (after first run). It lives in scanner.py and ui/discovery.py.

The scan, end to end

flowchart TB
    A[App start] --> B{Cache exists?}
    B -- yes --> C[quick_validate_cache]
    C --> D{Any healthy?}
    D -- yes --> Z[Model picker]
    D -- no --> E[Full subnet scan]
    B -- no --> E
    E --> F[probe_server × 255 × 5 ports]
    F --> G[OpenAI /v1/models AND Ollama /api/tags in parallel]
    G --> H[Fallback: /api/version for empty Ollama]
    H --> I[save_cache]
    I --> Z

    style C fill:#0f3460,color:#eee
    style E fill:#e94560,color:#fff
    style G fill:#16213e,color:#eee

Subnet detection

scanner.scan_network() figures out your local subnet by opening a UDP socket to 8.8.8.8:80, reading the OS-assigned local IP, and dropping the last octet:

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(("8.8.8.8", 80))
local_ip = s.getsockname()[0]
subnet = ".".join(local_ip.split(".")[:3])

This works without sending any actual UDP packet — connect() on a datagram socket just records the kernel's chosen source address. If that fails (no default route), it falls back to 192.168.1.x.

If you're on a /16 or /22 network, the scan still only sweeps your /24. To scan a different range, override subnet in scan_network() or open a PR — see Contributing.

Ports probed

COMMON_PORTS = [11434, 1234, 5000, 8000, 8080]

Port	Default for
11434	Ollama
1234	LM Studio
5000	Flask, custom OpenAI-compatible servers
8000	FastAPI, vLLM
8080	Alternative HTTP / proxies

255 IPs × 5 ports = 1,275 probes per scan, throttled by an asyncio.Semaphore(100).

Adding a port:

Append it to COMMON_PORTS in scanner.py.
If it speaks an existing protocol (OpenAI or Ollama), you're done.
Otherwise extend probe_server() with a new endpoint check — see Contributing#adding-a-new-server-type.

Protocol detection

For each ip:port, probe_server() fires two requests in parallel via asyncio.gather:

openai_data, ollama_data = await asyncio.gather(
    check_endpoint(client, base_url, "/v1/models"),
    check_endpoint(client, base_url, "/api/tags"),
    return_exceptions=True
)

Decision logic:

`/v1/models`	`/api/tags`	`/api/version`	Result
✓	—	—	OpenAI-compatible (preferred)
—	✓	—	Ollama with models
—	—	✓	Ollama empty (no models pulled)
—	—	—	Skipped

If a server responds to both, OpenAI takes precedence — Ollama's OpenAI-compat shim is usually the better stream parser.

The discovered-models table

  Discovered Models
  ──────────────────────────────────────────────────────────────────────
   #  Model                                Server          Type   Status   Latency
   1  qwen2.5-32b-instruct                 10.0.1.42:11434  OLL    ✓        12 ms
   2  qwen3-9b-thinking                    10.0.1.42:11434  OLL    ✓        12 ms
   3  llama-3.3-70b-instruct-awq           10.0.1.55:8000   API    ✓         8 ms
   4  qwen3.6-27b                          10.0.1.55:8000   API    ✓         8 ms
   5  Meta-Llama-3.1-8B-Instruct           10.0.1.77:1234   API    ✓        21 ms
  ──────────────────────────────────────────────────────────────────────

Each row is one model, not one server — a single host running five Ollama models becomes five rows. This is intentional: most decisions you make ("which one to chat with", "which to stress test") are per-model.

Latency is the response time of the health-check probe (/v1/models or /api/tags), not first-token latency. For real TTFT see Chat#performance-metrics.

Caching

save_cache() writes the entire discovered server list to ~/.model_chat_cache.json after a successful scan. On next launch, load_cache() reads it back and quick_validate_cache() re-validates every entry by calling check_server_health() — typically <100ms per server.

async def quick_validate_cache(servers, progress_callback=None):
    validated = []
    for server in servers:
        healthy = await check_server_health(server)
        if healthy.get("status") == "healthy":
            validated.append(healthy)
    return validated

Servers that fail validation are dropped from the cache silently. If all cached servers are dead, the app falls through to a full rescan automatically.

To force a rescan without removing the cache:

mv ~/.model_chat_cache.json ~/.model_chat_cache.json.bak
python main.py

To wipe and start fresh:

rm ~/.model_chat_cache.json

Performance characteristics

Network	Cold scan time
Quiet `/24` (few hosts)	~3 – 5 s
Busy `/24` (50+ hosts)	~8 – 12 s
Hostile firewall (drops vs RST)	~15 – 20 s

The Semaphore(100) cap keeps things polite — bumping it higher buys little because the HTTPX httpx.Limits(max_connections=200, max_keepalive_connections=50) becomes the real ceiling.

Validating a 10-server cache: typically <1 second.

Model Chat CLI · MIT · repo · issues · No telemetry · No cloud calls · No surprises

Model Chat CLI

Getting started

Features

Internals

Operating

GitHub repo →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery

Discovery

The scan, end to end

Subnet detection

Ports probed

Protocol detection

The discovered-models table

Caching

Performance characteristics

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Model Chat CLI

Clone this wiki locally