-
Notifications
You must be signed in to change notification settings - Fork 0
Chat
The chat view (ui/chat.py) sits on top of client.ModelClient and streams tokens to a rich.live.Live panel while measuring real performance.
flowchart LR
A[User input] --> B[history.append user msg]
B --> C[ModelClient.chat_stream]
C --> D{server.type}
D -- openai --> E[_chat_stream_openai<br/>data: SSE]
D -- ollama --> F[_chat_stream_ollama<br/>NDJSON]
E --> G[aiter_bytes 64B]
F --> G
G --> H[Parse delta / chunk]
H --> I[think_parser]
I --> J[Live update Rich panel]
J --> K[history.append assistant msg]
style C fill:#0f3460,color:#eee
style I fill:#e94560,color:#fff
Both backends read response bytes with aiter_bytes(chunk_size=64) rather than aiter_lines() — line buffering inside HTTPX would otherwise smooth out the token-by-token cadence and make the streaming feel jerky on slow tokens. 64-byte chunks give the fastest visible streaming without burning CPU on lock contention.
The status line under every assistant reply:
↳ 32 tok · 0.7s · 45.7 t/s · 142ms ttft
| Field | Meaning |
|---|---|
tok |
Output tokens (best-effort count if not reported by server) |
s |
Decode time — starts on first content token |
t/s |
tok / s → true decode TPS, excludes TTFT |
ttft |
Time-to-first-token in ms (request send → first content byte) |
Many tools report tokens-per-second as total_tokens / wall_clock. That's wrong: a 5-second prefill on a 70B model makes a fast decode look slow. By separating TTFT and decode, you can see prefill cost vs throughput cost separately — critical when you're tuning batch sizes or quant levels.
For OpenAI-compatible servers, usage in the final SSE chunk gives token counts. For Ollama, prompt_eval_count, eval_count, prompt_eval_duration, and eval_duration come back on the last NDJSON line (done: true). Both paths land in ChatMetrics:
@dataclass
class ChatMetrics:
prompt_tokens: int = 0
completion_tokens: int = 0
prompt_eval_duration_ns: int = 0 # Ollama prefill
eval_duration_ns: int = 0 # Ollama decode
ttft: float = 0.0 # Time to first content tokenWhen the server doesn't report usage (some custom OpenAI-compat shims), the UI estimates from len(text.split()) × 1.3 — explicitly an estimate, marked ~ in the stats line.
Models that emit reasoning chains (Qwen 3 / 3.5, DeepSeek-R1, GLM Reasoner) ship them in one of two ways:
-
As
<think>…</think>tags inside the streamed content (most fine-tuned models). -
As a separate
reasoning_contentfield in the delta (vLLM with Qwen 3 / 3.5).
The client normalizes the second form into the first:
reasoning = delta.get("reasoning_content", "")
if reasoning:
if not thinking_active:
yield "<think>"
thinking_active = True
yield reasoningthink_parser.py then splits the stream into thinking and content buffers for the UI. The thinking buffer renders in italics, the content buffer renders normally.
Toggle with /think. When on, the stats line gains a thinking-token count:
↳ 234 tok · 412 think · 5.2s · 45.0 t/s · 320ms ttft
For models where the server controls reasoning (Qwen 3 / 3.5 on vLLM), the client passes enable_thinking through:
| Server | Where it goes |
|---|---|
| OpenAI |
chat_template_kwargs.enable_thinking in body |
| Ollama |
options.enable_thinking in body |
If your model doesn't support either, the flag is silently ignored — toggling /think then only affects how <think> tags in the stream are rendered.
| Command | Action |
|---|---|
/quit, /q
|
Exit the app |
/switch |
Back to model picker (Discovery) |
/clear |
Clear conversation history |
/export |
Export to markdown (writes ./chat__*.md) |
/system |
View / edit system prompt |
/think |
Toggle reasoning mode |
/arena |
Multi-model arena (Arena) |
/promptarena |
System-prompt tournament (Prompt-Arena) |
/stress |
Stress testing (Stress-Testing) |
/help |
Show all commands |
| Key | Action |
|---|---|
Ctrl+D |
Back to model selection |
Ctrl+C |
Quit (with confirmation prompt) |
Messages are stored as a flat list of {role, content} dicts — the format every chat completion API expects:
history = [
{"role": "system", "content": "You are…"},
{"role": "user", "content": "what is…"},
{"role": "assistant", "content": "…"},
]ModelClient.chat_stream(message, history=…) appends the new user message to a copy of history before sending, so the caller's list isn't mutated until the response is fully captured. This matters if a stream errors out mid-response — the next retry sees a clean history.
/clear resets the list. /system rewrites the first entry. /export dumps the whole thing to a markdown file with timestamps and metrics inline.
| You see… | Cause |
|---|---|
| TPS looks ridiculously high | Server returned cached response, no real generation |
TPS = 0 / --
|
Server didn't report usage and the response was empty |
| Thinking text never closes | Server emitted <think> but not </think>; happens on truncation |
| Long pause then full response (no streaming) | Server is buffering — vLLM with certain configs, or Cloudflare |
| First message takes 30s, rest are instant | Cold model load (Ollama, LM Studio with auto-unload) |
Getting started
Features
Internals
Operating