-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
A walkthrough of how the pieces fit. For per-function detail, see API-Reference.
graph TB
subgraph "Entry"
M[main.py — state machine]
end
subgraph "Discovery"
S[scanner.py — subnet probe + cache]
end
subgraph "Communication"
C[client.py — OpenAI / Ollama streaming + tool calls]
end
subgraph "Engines"
ST[stress_tester.py — 6-mode load testing]
TB[tool_bench.py — agent loop + 45 tasks]
PA[prompt_arena.py — prompt comparison]
end
subgraph "UI Layer"
UI[ui/ — Rich-based dashboards & chat]
end
M --> S
M --> UI
UI --> C
UI --> ST
UI --> PA
ST --> TB
ST --> C
TB --> C
PA --> C
style M fill:#1a1a2e,color:#eee,stroke:#0f3460
style C fill:#16213e,color:#eee,stroke:#0f3460
style TB fill:#e94560,color:#fff,stroke:#16213e
style ST fill:#0f3460,color:#eee,stroke:#16213e
model-chat-cli/
├── main.py · State machine (discovery → chat → arena/stress)
├── scanner.py · Network discovery, caching, health checks
├── client.py · Model API client (OpenAI + Ollama, streaming + tools)
├── prompt_arena.py · System-prompt comparison engine
├── stress_tester.py · 6-mode load testing
├── tool_bench.py · Agentic tool-calling benchmark
├── think_parser.py · <think>…</think> stream parser
├── logger.py · Centralized logging
├── storage/
│ └── history.py · Chat history persistence (currently unused by UI)
└── ui/
├── theme.py · Semantic color theme
├── components.py · Shared renderables + token estimation
├── discovery.py · Server scan + model selection
├── chat.py · Streaming chat with TTFT / decode TPS
├── multi_arena.py · Multi-model arena
├── arena.py · Prompt comparison UI
└── stress_test.py · Stress test dashboard + tool-bench summary
class AppState(Enum):
DISCOVERY = "discovery"
CHAT = "chat"
STRESS_TEST = "stress_test"
PROMPT_ARENA = "prompt_arena"
ARENA = "arena"
QUIT = "quit"The main loop is a while self.state != QUIT dispatch. Each view's run() returns a control signal ("switch", "back", "stress_test", …) that maps to the next state. This means each view is self-contained — it owns its input loop, its Rich layout, and its exit conditions.
All I/O uses async/await with httpx.AsyncClient. There is no threading and no asyncio.to_thread — the only blocking call in the hot path is reading user input (handled by prompt_toolkit in async mode).
Long-running operations accept async def progress_callback(current, total). The UI binds this to a Rich Progress bar:
servers = await scan_network(progress_callback=update_progress)This lets the scanner / validator stay UI-agnostic — they don't know about Rich.
The server dict carries type: "openai" | "ollama". Code that crosses backends branches on it:
if server["type"] == "openai":
# /v1/chat/completions, SSE, data: {…}
else:
# /api/chat, NDJSONThe split is deep in the stack (mostly inside ModelClient) — most code above that layer doesn't care.
Everything in the stack speaks the OpenAI chat message format:
[
{"role": "system", "content": "…"},
{"role": "user", "content": "…"},
{"role": "assistant", "content": "…"},
{"role": "tool", "content": "…", "tool_call_id": "…"},
]For Ollama, ModelClient._chat_stream_ollama translates this to/from Ollama's format internally.
enable_thinking is a tri-state (None | True | False). When None, no flag is sent. When True/False, the client passes it through the backend-appropriate field:
| Server | Field path |
|---|---|
| OpenAI | body.chat_template_kwargs.enable_thinking |
| Ollama | body.options.enable_thinking |
Servers that don't recognize the field ignore it. See Chat#enable-thinking-flag.
Both backends use httpx's client.stream() context manager and aiter_bytes(chunk_size=64):
async with client.stream("POST", url, json=payload) as response:
buffer = b""
async for chunk_bytes in response.aiter_bytes(chunk_size=64):
buffer += chunk_bytes
while b"\n" in buffer:
line_bytes, buffer = buffer.split(b"\n", 1)
line = line_bytes.decode("utf-8", errors="ignore").strip()
…Why 64-byte chunks: line-buffering smooths out the token cadence and makes streaming feel jerky. 64 bytes is small enough that each individual SSE/NDJSON line lands fast, large enough that we're not bouncing into the asyncio scheduler on every byte.
Why errors='ignore' on decode: SSE chunks occasionally split a multi-byte UTF-8 codepoint across a chunk boundary; ignoring + buffering means we'll see the complete character on the next iteration when the rest of the bytes arrive.
logger.setup_logger() configures two handlers per run:
- Console handler: INFO+ — high-signal events visible in the terminal.
-
File handler: DEBUG+ — every request, every error trace, every metric, written to
logs/stress_test_YYYYMMDD_HHMMSS.log.
Helpers log_request_error, log_vllm_error, log_test_summary produce structured entries you can grep after the fact.
UI views don't log directly — they hand off to engines (scanner, stress_tester, tool_bench) which own logging.
ChatHistoryManager saves/loads/searches/deletes conversations as JSON in ~/.model_chat_history/. It's a finished module that is not currently wired into the UI — the chat view exports markdown directly via /export. If you want to wire it up, see Contributing#wiring-up-chat-history.
Getting started
Features
Internals
Operating