Chat

The chat view (ui/chat.py) sits on top of client.ModelClient and streams tokens to a rich.live.Live panel while measuring real performance.

Streaming pipeline

flowchart LR
    A[User input] --> B[history.append user msg]
    B --> C[ModelClient.chat_stream]
    C --> D{server.type}
    D -- openai --> E[_chat_stream_openai<br/>data: SSE]
    D -- ollama --> F[_chat_stream_ollama<br/>NDJSON]
    E --> G[aiter_bytes 64B]
    F --> G
    G --> H[Parse delta / chunk]
    H --> I[think_parser]
    I --> J[Live update Rich panel]
    J --> K[history.append assistant msg]

    style C fill:#0f3460,color:#eee
    style I fill:#e94560,color:#fff

Both backends read response bytes with aiter_bytes(chunk_size=64) rather than aiter_lines() — line buffering inside HTTPX would otherwise smooth out the token-by-token cadence and make the streaming feel jerky on slow tokens. 64-byte chunks give the fastest visible streaming without burning CPU on lock contention.

Performance metrics

The status line under every assistant reply:

  ↳ 32 tok · 0.7s · 45.7 t/s · 142ms ttft

Field	Meaning
`tok`	Output tokens (best-effort count if not reported by server)
`s`	Decode time — starts on first content token
`t/s`	`tok / s` → true decode TPS, excludes TTFT
`ttft`	Time-to-first-token in ms (request send → first content byte)

Why decode-only TPS matters

Many tools report tokens-per-second as total_tokens / wall_clock. That's wrong: a 5-second prefill on a 70B model makes a fast decode look slow. By separating TTFT and decode, you can see prefill cost vs throughput cost separately — critical when you're tuning batch sizes or quant levels.

Where the numbers come from

For OpenAI-compatible servers, usage in the final SSE chunk gives token counts. For Ollama, prompt_eval_count, eval_count, prompt_eval_duration, and eval_duration come back on the last NDJSON line (done: true). Both paths land in ChatMetrics:

@dataclass
class ChatMetrics:
    prompt_tokens: int = 0
    completion_tokens: int = 0
    prompt_eval_duration_ns: int = 0   # Ollama prefill
    eval_duration_ns: int = 0          # Ollama decode
    ttft: float = 0.0                  # Time to first content token

When the server doesn't report usage (some custom OpenAI-compat shims), the UI estimates from len(text.split()) × 1.3 — explicitly an estimate, marked ~ in the stats line.

Thinking mode

Models that emit reasoning chains (Qwen 3 / 3.5, DeepSeek-R1, GLM Reasoner) ship them in one of two ways:

As <think>…</think> tags inside the streamed content (most fine-tuned models).
As a separate reasoning_content field in the delta (vLLM with Qwen 3 / 3.5).

The client normalizes the second form into the first:

reasoning = delta.get("reasoning_content", "")
if reasoning:
    if not thinking_active:
        yield "<think>"
        thinking_active = True
    yield reasoning

think_parser.py then splits the stream into thinking and content buffers for the UI. The thinking buffer renders in italics, the content buffer renders normally.

Toggle with /think. When on, the stats line gains a thinking-token count:

  ↳ 234 tok · 412 think · 5.2s · 45.0 t/s · 320ms ttft

Enable-thinking flag

For models where the server controls reasoning (Qwen 3 / 3.5 on vLLM), the client passes enable_thinking through:

Server	Where it goes
OpenAI	`chat_template_kwargs.enable_thinking` in body
Ollama	`options.enable_thinking` in body

If your model doesn't support either, the flag is silently ignored — toggling /think then only affects how <think> tags in the stream are rendered.

Commands

Command	Action
`/quit`, `/q`	Exit the app
`/switch`	Back to model picker (Discovery)
`/clear`	Clear conversation history
`/export`	Export to markdown (writes `./chat__*.md`)
`/system`	View / edit system prompt
`/think`	Toggle reasoning mode
`/arena`	Multi-model arena (Arena)
`/promptarena`	System-prompt tournament (Prompt-Arena)
`/stress`	Stress testing (Stress-Testing)
`/help`	Show all commands

Keyboard

Key	Action
`Ctrl+D`	Back to model selection
`Ctrl+C`	Quit (with confirmation prompt)

History format

Messages are stored as a flat list of {role, content} dicts — the format every chat completion API expects:

history = [
    {"role": "system",    "content": "You are…"},
    {"role": "user",      "content": "what is…"},
    {"role": "assistant", "content": "…"},
]

ModelClient.chat_stream(message, history=…) appends the new user message to a copy of history before sending, so the caller's list isn't mutated until the response is fully captured. This matters if a stream errors out mid-response — the next retry sees a clean history.

/clear resets the list. /system rewrites the first entry. /export dumps the whole thing to a markdown file with timestamps and metrics inline.

Common surprises

You see…	Cause
TPS looks ridiculously high	Server returned cached response, no real generation
TPS = 0 / `--`	Server didn't report `usage` and the response was empty
Thinking text never closes	Server emitted `<think>` but not `</think>`; happens on truncation
Long pause then full response (no streaming)	Server is buffering — vLLM with certain configs, or Cloudflare
First message takes 30s, rest are instant	Cold model load (Ollama, LM Studio with auto-unload)

Model Chat CLI · MIT · repo · issues · No telemetry · No cloud calls · No surprises

Model Chat CLI

Getting started

Features

Internals

Operating

GitHub repo →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chat

Chat

Streaming pipeline

Performance metrics

Why decode-only TPS matters

Where the numbers come from

Thinking mode

Enable-thinking flag

Commands

Keyboard

History format

Common surprises

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Model Chat CLI

Clone this wiki locally