Skip to content

Architecture

mtecnic edited this page May 28, 2026 · 1 revision

Architecture

A walkthrough of how the pieces fit. For per-function detail, see API-Reference.


High-level

graph TB
    subgraph "Entry"
        M[main.py — state machine]
    end
    subgraph "Discovery"
        S[scanner.py — subnet probe + cache]
    end
    subgraph "Communication"
        C[client.py — OpenAI / Ollama streaming + tool calls]
    end
    subgraph "Engines"
        ST[stress_tester.py — 6-mode load testing]
        TB[tool_bench.py — agent loop + 45 tasks]
        PA[prompt_arena.py — prompt comparison]
    end
    subgraph "UI Layer"
        UI[ui/ — Rich-based dashboards & chat]
    end

    M --> S
    M --> UI
    UI --> C
    UI --> ST
    UI --> PA
    ST --> TB
    ST --> C
    TB --> C
    PA --> C

    style M fill:#1a1a2e,color:#eee,stroke:#0f3460
    style C fill:#16213e,color:#eee,stroke:#0f3460
    style TB fill:#e94560,color:#fff,stroke:#16213e
    style ST fill:#0f3460,color:#eee,stroke:#16213e
Loading

File layout

model-chat-cli/
├── main.py              · State machine (discovery → chat → arena/stress)
├── scanner.py           · Network discovery, caching, health checks
├── client.py            · Model API client (OpenAI + Ollama, streaming + tools)
├── prompt_arena.py      · System-prompt comparison engine
├── stress_tester.py     · 6-mode load testing
├── tool_bench.py        · Agentic tool-calling benchmark
├── think_parser.py      · <think>…</think> stream parser
├── logger.py            · Centralized logging
├── storage/
│   └── history.py       · Chat history persistence (currently unused by UI)
└── ui/
    ├── theme.py         · Semantic color theme
    ├── components.py    · Shared renderables + token estimation
    ├── discovery.py     · Server scan + model selection
    ├── chat.py          · Streaming chat with TTFT / decode TPS
    ├── multi_arena.py   · Multi-model arena
    ├── arena.py         · Prompt comparison UI
    └── stress_test.py   · Stress test dashboard + tool-bench summary

Core patterns

1. State machine in main.py

class AppState(Enum):
    DISCOVERY    = "discovery"
    CHAT         = "chat"
    STRESS_TEST  = "stress_test"
    PROMPT_ARENA = "prompt_arena"
    ARENA        = "arena"
    QUIT         = "quit"

The main loop is a while self.state != QUIT dispatch. Each view's run() returns a control signal ("switch", "back", "stress_test", …) that maps to the next state. This means each view is self-contained — it owns its input loop, its Rich layout, and its exit conditions.

2. Async everywhere

All I/O uses async/await with httpx.AsyncClient. There is no threading and no asyncio.to_thread — the only blocking call in the hot path is reading user input (handled by prompt_toolkit in async mode).

3. Progress callbacks

Long-running operations accept async def progress_callback(current, total). The UI binds this to a Rich Progress bar:

servers = await scan_network(progress_callback=update_progress)

This lets the scanner / validator stay UI-agnostic — they don't know about Rich.

4. Server-type dispatch

The server dict carries type: "openai" | "ollama". Code that crosses backends branches on it:

if server["type"] == "openai":
    # /v1/chat/completions, SSE, data: {…}
else:
    # /api/chat, NDJSON

The split is deep in the stack (mostly inside ModelClient) — most code above that layer doesn't care.

5. Standard message format

Everything in the stack speaks the OpenAI chat message format:

[
    {"role": "system",    "content": "…"},
    {"role": "user",      "content": "…"},
    {"role": "assistant", "content": "…"},
    {"role": "tool",      "content": "…", "tool_call_id": "…"},
]

For Ollama, ModelClient._chat_stream_ollama translates this to/from Ollama's format internally.

6. Thinking-mode passthrough

enable_thinking is a tri-state (None | True | False). When None, no flag is sent. When True/False, the client passes it through the backend-appropriate field:

Server Field path
OpenAI body.chat_template_kwargs.enable_thinking
Ollama body.options.enable_thinking

Servers that don't recognize the field ignore it. See Chat#enable-thinking-flag.


Streaming internals

Both backends use httpx's client.stream() context manager and aiter_bytes(chunk_size=64):

async with client.stream("POST", url, json=payload) as response:
    buffer = b""
    async for chunk_bytes in response.aiter_bytes(chunk_size=64):
        buffer += chunk_bytes
        while b"\n" in buffer:
            line_bytes, buffer = buffer.split(b"\n", 1)
            line = line_bytes.decode("utf-8", errors="ignore").strip()
            …

Why 64-byte chunks: line-buffering smooths out the token cadence and makes streaming feel jerky. 64 bytes is small enough that each individual SSE/NDJSON line lands fast, large enough that we're not bouncing into the asyncio scheduler on every byte.

Why errors='ignore' on decode: SSE chunks occasionally split a multi-byte UTF-8 codepoint across a chunk boundary; ignoring + buffering means we'll see the complete character on the next iteration when the rest of the bytes arrive.


Logging architecture

logger.setup_logger() configures two handlers per run:

  • Console handler: INFO+ — high-signal events visible in the terminal.
  • File handler: DEBUG+ — every request, every error trace, every metric, written to logs/stress_test_YYYYMMDD_HHMMSS.log.

Helpers log_request_error, log_vllm_error, log_test_summary produce structured entries you can grep after the fact.

UI views don't log directly — they hand off to engines (scanner, stress_tester, tool_bench) which own logging.


What storage/history.py is for

ChatHistoryManager saves/loads/searches/deletes conversations as JSON in ~/.model_chat_history/. It's a finished module that is not currently wired into the UI — the chat view exports markdown directly via /export. If you want to wire it up, see Contributing#wiring-up-chat-history.

Clone this wiki locally