Skip to content
mtecnic edited this page May 28, 2026 · 1 revision

Chat

The chat view (ui/chat.py) sits on top of client.ModelClient and streams tokens to a rich.live.Live panel while measuring real performance.


Streaming pipeline

flowchart LR
    A[User input] --> B[history.append user msg]
    B --> C[ModelClient.chat_stream]
    C --> D{server.type}
    D -- openai --> E[_chat_stream_openai<br/>data: SSE]
    D -- ollama --> F[_chat_stream_ollama<br/>NDJSON]
    E --> G[aiter_bytes 64B]
    F --> G
    G --> H[Parse delta / chunk]
    H --> I[think_parser]
    I --> J[Live update Rich panel]
    J --> K[history.append assistant msg]

    style C fill:#0f3460,color:#eee
    style I fill:#e94560,color:#fff
Loading

Both backends read response bytes with aiter_bytes(chunk_size=64) rather than aiter_lines() — line buffering inside HTTPX would otherwise smooth out the token-by-token cadence and make the streaming feel jerky on slow tokens. 64-byte chunks give the fastest visible streaming without burning CPU on lock contention.


Performance metrics

The status line under every assistant reply:

  ↳ 32 tok · 0.7s · 45.7 t/s · 142ms ttft
Field Meaning
tok Output tokens (best-effort count if not reported by server)
s Decode time — starts on first content token
t/s tok / s → true decode TPS, excludes TTFT
ttft Time-to-first-token in ms (request send → first content byte)

Why decode-only TPS matters

Many tools report tokens-per-second as total_tokens / wall_clock. That's wrong: a 5-second prefill on a 70B model makes a fast decode look slow. By separating TTFT and decode, you can see prefill cost vs throughput cost separately — critical when you're tuning batch sizes or quant levels.

Where the numbers come from

For OpenAI-compatible servers, usage in the final SSE chunk gives token counts. For Ollama, prompt_eval_count, eval_count, prompt_eval_duration, and eval_duration come back on the last NDJSON line (done: true). Both paths land in ChatMetrics:

@dataclass
class ChatMetrics:
    prompt_tokens: int = 0
    completion_tokens: int = 0
    prompt_eval_duration_ns: int = 0   # Ollama prefill
    eval_duration_ns: int = 0          # Ollama decode
    ttft: float = 0.0                  # Time to first content token

When the server doesn't report usage (some custom OpenAI-compat shims), the UI estimates from len(text.split()) × 1.3 — explicitly an estimate, marked ~ in the stats line.


Thinking mode

Models that emit reasoning chains (Qwen 3 / 3.5, DeepSeek-R1, GLM Reasoner) ship them in one of two ways:

  1. As <think>…</think> tags inside the streamed content (most fine-tuned models).
  2. As a separate reasoning_content field in the delta (vLLM with Qwen 3 / 3.5).

The client normalizes the second form into the first:

reasoning = delta.get("reasoning_content", "")
if reasoning:
    if not thinking_active:
        yield "<think>"
        thinking_active = True
    yield reasoning

think_parser.py then splits the stream into thinking and content buffers for the UI. The thinking buffer renders in italics, the content buffer renders normally.

Toggle with /think. When on, the stats line gains a thinking-token count:

  ↳ 234 tok · 412 think · 5.2s · 45.0 t/s · 320ms ttft

Enable-thinking flag

For models where the server controls reasoning (Qwen 3 / 3.5 on vLLM), the client passes enable_thinking through:

Server Where it goes
OpenAI chat_template_kwargs.enable_thinking in body
Ollama options.enable_thinking in body

If your model doesn't support either, the flag is silently ignored — toggling /think then only affects how <think> tags in the stream are rendered.


Commands

Command Action
/quit, /q Exit the app
/switch Back to model picker (Discovery)
/clear Clear conversation history
/export Export to markdown (writes ./chat__*.md)
/system View / edit system prompt
/think Toggle reasoning mode
/arena Multi-model arena (Arena)
/promptarena System-prompt tournament (Prompt-Arena)
/stress Stress testing (Stress-Testing)
/help Show all commands

Keyboard

Key Action
Ctrl+D Back to model selection
Ctrl+C Quit (with confirmation prompt)

History format

Messages are stored as a flat list of {role, content} dicts — the format every chat completion API expects:

history = [
    {"role": "system",    "content": "You are…"},
    {"role": "user",      "content": "what is…"},
    {"role": "assistant", "content": "…"},
]

ModelClient.chat_stream(message, history=…) appends the new user message to a copy of history before sending, so the caller's list isn't mutated until the response is fully captured. This matters if a stream errors out mid-response — the next retry sees a clean history.

/clear resets the list. /system rewrites the first entry. /export dumps the whole thing to a markdown file with timestamps and metrics inline.


Common surprises

You see… Cause
TPS looks ridiculously high Server returned cached response, no real generation
TPS = 0 / -- Server didn't report usage and the response was empty
Thinking text never closes Server emitted <think> but not </think>; happens on truncation
Long pause then full response (no streaming) Server is buffering — vLLM with certain configs, or Cloudflare
First message takes 30s, rest are instant Cold model load (Ollama, LM Studio with auto-unload)

Clone this wiki locally