Architecture

A walkthrough of how the pieces fit. For per-function detail, see API-Reference.

High-level

graph TB
    subgraph "Entry"
        M[main.py — state machine]
    end
    subgraph "Discovery"
        S[scanner.py — subnet probe + cache]
    end
    subgraph "Communication"
        C[client.py — OpenAI / Ollama streaming + tool calls]
    end
    subgraph "Engines"
        ST[stress_tester.py — 6-mode load testing]
        TB[tool_bench.py — agent loop + 45 tasks]
        PA[prompt_arena.py — prompt comparison]
    end
    subgraph "UI Layer"
        UI[ui/ — Rich-based dashboards & chat]
    end

    M --> S
    M --> UI
    UI --> C
    UI --> ST
    UI --> PA
    ST --> TB
    ST --> C
    TB --> C
    PA --> C

    style M fill:#1a1a2e,color:#eee,stroke:#0f3460
    style C fill:#16213e,color:#eee,stroke:#0f3460
    style TB fill:#e94560,color:#fff,stroke:#16213e
    style ST fill:#0f3460,color:#eee,stroke:#16213e

File layout

model-chat-cli/
├── main.py              · State machine (discovery → chat → arena/stress)
├── scanner.py           · Network discovery, caching, health checks
├── client.py            · Model API client (OpenAI + Ollama, streaming + tools)
├── prompt_arena.py      · System-prompt comparison engine
├── stress_tester.py     · 6-mode load testing
├── tool_bench.py        · Agentic tool-calling benchmark
├── think_parser.py      · <think>…</think> stream parser
├── logger.py            · Centralized logging
├── storage/
│   └── history.py       · Chat history persistence (currently unused by UI)
└── ui/
    ├── theme.py         · Semantic color theme
    ├── components.py    · Shared renderables + token estimation
    ├── discovery.py     · Server scan + model selection
    ├── chat.py          · Streaming chat with TTFT / decode TPS
    ├── multi_arena.py   · Multi-model arena
    ├── arena.py         · Prompt comparison UI
    └── stress_test.py   · Stress test dashboard + tool-bench summary

Core patterns

1. State machine in `main.py`

class AppState(Enum):
    DISCOVERY    = "discovery"
    CHAT         = "chat"
    STRESS_TEST  = "stress_test"
    PROMPT_ARENA = "prompt_arena"
    ARENA        = "arena"
    QUIT         = "quit"

The main loop is a while self.state != QUIT dispatch. Each view's run() returns a control signal ("switch", "back", "stress_test", …) that maps to the next state. This means each view is self-contained — it owns its input loop, its Rich layout, and its exit conditions.

2. Async everywhere

All I/O uses async/await with httpx.AsyncClient. There is no threading and no asyncio.to_thread — the only blocking call in the hot path is reading user input (handled by prompt_toolkit in async mode).

3. Progress callbacks

Long-running operations accept async def progress_callback(current, total). The UI binds this to a Rich Progress bar:

servers = await scan_network(progress_callback=update_progress)

This lets the scanner / validator stay UI-agnostic — they don't know about Rich.

4. Server-type dispatch

The server dict carries type: "openai" | "ollama". Code that crosses backends branches on it:

if server["type"] == "openai":
    # /v1/chat/completions, SSE, data: {…}
else:
    # /api/chat, NDJSON

The split is deep in the stack (mostly inside ModelClient) — most code above that layer doesn't care.

5. Standard message format

Everything in the stack speaks the OpenAI chat message format:

[
    {"role": "system",    "content": "…"},
    {"role": "user",      "content": "…"},
    {"role": "assistant", "content": "…"},
    {"role": "tool",      "content": "…", "tool_call_id": "…"},
]

For Ollama, ModelClient._chat_stream_ollama translates this to/from Ollama's format internally.

6. Thinking-mode passthrough

enable_thinking is a tri-state (None | True | False). When None, no flag is sent. When True/False, the client passes it through the backend-appropriate field:

Server	Field path
OpenAI	`body.chat_template_kwargs.enable_thinking`
Ollama	`body.options.enable_thinking`

Servers that don't recognize the field ignore it. See Chat#enable-thinking-flag.

Streaming internals

Both backends use httpx's client.stream() context manager and aiter_bytes(chunk_size=64):

async with client.stream("POST", url, json=payload) as response:
    buffer = b""
    async for chunk_bytes in response.aiter_bytes(chunk_size=64):
        buffer += chunk_bytes
        while b"\n" in buffer:
            line_bytes, buffer = buffer.split(b"\n", 1)
            line = line_bytes.decode("utf-8", errors="ignore").strip()
            …

Why 64-byte chunks: line-buffering smooths out the token cadence and makes streaming feel jerky. 64 bytes is small enough that each individual SSE/NDJSON line lands fast, large enough that we're not bouncing into the asyncio scheduler on every byte.

Why errors='ignore' on decode: SSE chunks occasionally split a multi-byte UTF-8 codepoint across a chunk boundary; ignoring + buffering means we'll see the complete character on the next iteration when the rest of the bytes arrive.

Logging architecture

logger.setup_logger() configures two handlers per run:

Console handler: INFO+ — high-signal events visible in the terminal.
File handler: DEBUG+ — every request, every error trace, every metric, written to logs/stress_test_YYYYMMDD_HHMMSS.log.

Helpers log_request_error, log_vllm_error, log_test_summary produce structured entries you can grep after the fact.

UI views don't log directly — they hand off to engines (scanner, stress_tester, tool_bench) which own logging.

What `storage/history.py` is for

ChatHistoryManager saves/loads/searches/deletes conversations as JSON in ~/.model_chat_history/. It's a finished module that is not currently wired into the UI — the chat view exports markdown directly via /export. If you want to wire it up, see Contributing#wiring-up-chat-history.

Model Chat CLI · MIT · repo · issues · No telemetry · No cloud calls · No surprises

Model Chat CLI

Getting started

Features

Internals

Operating

GitHub repo →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Architecture

High-level

File layout

Core patterns

1. State machine in `main.py`

2. Async everywhere

3. Progress callbacks

4. Server-type dispatch

5. Standard message format

6. Thinking-mode passthrough

Streaming internals

Logging architecture

What `storage/history.py` is for

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Model Chat CLI

Clone this wiki locally

Architecture

Architecture

High-level

File layout

Core patterns

1. State machine in main.py

2. Async everywhere

3. Progress callbacks

4. Server-type dispatch

5. Standard message format

6. Thinking-mode passthrough

Streaming internals

Logging architecture

What storage/history.py is for

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Model Chat CLI

Clone this wiki locally

1. State machine in `main.py`

What `storage/history.py` is for