Skip to content

Discovery

mtecnic edited this page May 28, 2026 · 1 revision

Discovery

The discovery layer is what turns "I have some boxes running models" into a working menu in under five seconds (after first run). It lives in scanner.py and ui/discovery.py.


The scan, end to end

flowchart TB
    A[App start] --> B{Cache exists?}
    B -- yes --> C[quick_validate_cache]
    C --> D{Any healthy?}
    D -- yes --> Z[Model picker]
    D -- no --> E[Full subnet scan]
    B -- no --> E
    E --> F[probe_server × 255 × 5 ports]
    F --> G[OpenAI /v1/models AND Ollama /api/tags in parallel]
    G --> H[Fallback: /api/version for empty Ollama]
    H --> I[save_cache]
    I --> Z

    style C fill:#0f3460,color:#eee
    style E fill:#e94560,color:#fff
    style G fill:#16213e,color:#eee
Loading

Subnet detection

scanner.scan_network() figures out your local subnet by opening a UDP socket to 8.8.8.8:80, reading the OS-assigned local IP, and dropping the last octet:

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(("8.8.8.8", 80))
local_ip = s.getsockname()[0]
subnet = ".".join(local_ip.split(".")[:3])

This works without sending any actual UDP packet — connect() on a datagram socket just records the kernel's chosen source address. If that fails (no default route), it falls back to 192.168.1.x.

If you're on a /16 or /22 network, the scan still only sweeps your /24. To scan a different range, override subnet in scan_network() or open a PR — see Contributing.


Ports probed

COMMON_PORTS = [11434, 1234, 5000, 8000, 8080]
Port Default for
11434 Ollama
1234 LM Studio
5000 Flask, custom OpenAI-compatible servers
8000 FastAPI, vLLM
8080 Alternative HTTP / proxies

255 IPs × 5 ports = 1,275 probes per scan, throttled by an asyncio.Semaphore(100).

Adding a port:

  1. Append it to COMMON_PORTS in scanner.py.
  2. If it speaks an existing protocol (OpenAI or Ollama), you're done.
  3. Otherwise extend probe_server() with a new endpoint check — see Contributing#adding-a-new-server-type.

Protocol detection

For each ip:port, probe_server() fires two requests in parallel via asyncio.gather:

openai_data, ollama_data = await asyncio.gather(
    check_endpoint(client, base_url, "/v1/models"),
    check_endpoint(client, base_url, "/api/tags"),
    return_exceptions=True
)

Decision logic:

/v1/models /api/tags /api/version Result
OpenAI-compatible (preferred)
Ollama with models
Ollama empty (no models pulled)
Skipped

If a server responds to both, OpenAI takes precedence — Ollama's OpenAI-compat shim is usually the better stream parser.


The discovered-models table

  Discovered Models
  ──────────────────────────────────────────────────────────────────────
   #  Model                                Server          Type   Status   Latency
   1  qwen2.5-32b-instruct                 10.0.1.42:11434  OLL    ✓        12 ms
   2  qwen3-9b-thinking                    10.0.1.42:11434  OLL    ✓        12 ms
   3  llama-3.3-70b-instruct-awq           10.0.1.55:8000   API    ✓         8 ms
   4  qwen3.6-27b                          10.0.1.55:8000   API    ✓         8 ms
   5  Meta-Llama-3.1-8B-Instruct           10.0.1.77:1234   API    ✓        21 ms
  ──────────────────────────────────────────────────────────────────────

Each row is one model, not one server — a single host running five Ollama models becomes five rows. This is intentional: most decisions you make ("which one to chat with", "which to stress test") are per-model.

Latency is the response time of the health-check probe (/v1/models or /api/tags), not first-token latency. For real TTFT see Chat#performance-metrics.


Caching

save_cache() writes the entire discovered server list to ~/.model_chat_cache.json after a successful scan. On next launch, load_cache() reads it back and quick_validate_cache() re-validates every entry by calling check_server_health() — typically <100ms per server.

async def quick_validate_cache(servers, progress_callback=None):
    validated = []
    for server in servers:
        healthy = await check_server_health(server)
        if healthy.get("status") == "healthy":
            validated.append(healthy)
    return validated

Servers that fail validation are dropped from the cache silently. If all cached servers are dead, the app falls through to a full rescan automatically.

To force a rescan without removing the cache:

mv ~/.model_chat_cache.json ~/.model_chat_cache.json.bak
python main.py

To wipe and start fresh:

rm ~/.model_chat_cache.json

Performance characteristics

Network Cold scan time
Quiet /24 (few hosts) ~3 – 5 s
Busy /24 (50+ hosts) ~8 – 12 s
Hostile firewall (drops vs RST) ~15 – 20 s

The Semaphore(100) cap keeps things polite — bumping it higher buys little because the HTTPX httpx.Limits(max_connections=200, max_keepalive_connections=50) becomes the real ceiling.

Validating a 10-server cache: typically <1 second.

Clone this wiki locally