quiet-node · quiet-node · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -69,7 +69,7 @@ User-facing reference for all commands lives in `docs/commands.md`. **Any new sl
 ### Backend (`src-tauri/src/`)
 
 - **`lib.rs`**: app setup: loads `AppConfig` via `config::load`, converts window to NSPanel (fullscreen overlay), registers tray, spawns hotkey listener, spawns the engine runner actor, intercepts close events (hides instead of quits), and on `RunEvent::Exit` kills the engine sidecar and awaits its confirmed exit so no orphan `llama-server` survives quit
-- **`config/`**: typed TOML-backed application configuration. Loaded once at startup from `~/Library/Application Support/com.quietnode.thuki/config.toml` (seeded with defaults on first run), installed as Tauri managed state, exposed to the frontend via the `get_config` command. Every subsystem that needs model, prompt, window, activation, or quote values reads from `State<AppConfig>`. The `[inference]` section holds `active_provider`, `num_ctx`, `keep_warm_inactivity_minutes` (Ollama only), `idle_unload_minutes` (built-in engine only), and the typed providers list (`[[inference.providers]]`, each `{id, kind, label, base_url, model, vision}`; `kind` is `builtin`, `ollama`, or `openai`, anything else is dropped on load). Fresh installs default `active_provider` to `builtin`; the loader pins any pre-providers config (no `[[inference.providers]]` array) to `ollama`, because no working built-in provider existed when that file was written. The loader also migrates a legacy flat `ollama_url` onto a synthesized Ollama provider, and `config/migrate.rs` folds the legacy SQLite `active_model` onto the active provider when it is Ollama-kind. See `docs/configurations.md` for the user-facing schema.
+- **`config/`**: typed TOML-backed application configuration. Loaded once at startup from `~/Library/Application Support/com.quietnode.thuki/config.toml` (seeded with defaults on first run), installed as Tauri managed state, exposed to the frontend via the `get_config` command. Every subsystem that needs model, prompt, window, activation, or quote values reads from `State<AppConfig>`. The `[inference]` section holds `active_provider`, `num_ctx`, `keep_warm_inactivity_minutes` (unified residency knob governing both local providers: the built-in engine's idle-unload timer and Ollama's `keep_alive`; not applicable to OpenAI), and the typed providers list (`[[inference.providers]]`, each `{id, kind, label, base_url, model, vision}`; `kind` is `builtin`, `ollama`, or `openai`, anything else is dropped on load). Fresh installs default `active_provider` to `builtin`; the loader pins any pre-providers config (no `[[inference.providers]]` array) to `ollama`, because no working built-in provider existed when that file was written. The loader also migrates a legacy flat `ollama_url` onto a synthesized Ollama provider, and `config/migrate.rs` folds the legacy SQLite `active_model` onto the active provider when it is Ollama-kind. See `docs/configurations.md` for the user-facing schema.
 - **`commands.rs`**: `ask_model` Tauri command: routes by the active provider's kind. `builtin` resolves the installed model from the manifest, ensures the sidecar is loaded via the engine runner, and streams OpenAI-compatible `/v1/chat/completions` SSE through `openai.rs` (`V1Flavor::Builtin`); `ollama` streams the native `/api/chat` newline-delimited JSON; `openai` streams `/v1` SSE against the provider's `base_url` (`V1Flavor::Remote`). All paths emit the same `StreamChunk` contract via Tauri Channel and read the active provider, the resolved system prompt, and the in-memory `ActiveModelState` from managed state.
 - **`keychain.rs`**: write-only storage for `openai`-provider API keys in the macOS Keychain via the `keyring` crate. The Keychain is the only place keys ever live: they are never written to the TOML config and never returned to the frontend (only existence is queryable via `has_provider_api_key`); the `SecretStore` trait decouples callers from the real Keychain for tests.
 - **`screenshot.rs`** — `capture_full_screen_command` Tauri command: uses CoreGraphics FFI (`CGWindowListCreateImage`) to capture all displays excluding Thuki's own windows, writes a JPEG to a temp dir, and returns the path

diff --git a/docs/configurations.md b/docs/configurations.md
@@ -38,14 +38,13 @@ active_provider = "builtin"
 # system prompt are reused. Raise to fit longer conversations; lower to reduce
 # GPU memory use. Valid range: 2048-1048576.
 num_ctx = 16384
-# Minutes of inactivity before Thuki tells Ollama to release the model.
-# 0 = let Ollama manage (its own 5-minute default applies).
-# -1 = never release. Applies to the Ollama provider only.
+# Minutes of inactivity before Thuki releases the active model from memory.
+# Applies to both local providers (built-in engine and Ollama); not applicable
+# to a remote OpenAI-compatible server.
+# 0 = use the provider's natural short default (~5 min): Ollama defers to its
+#     own timer, the built-in engine applies its own ~5-minute timer.
+# -1 = keep resident forever. Valid range: -1 or 0-1440.
 keep_warm_inactivity_minutes = 0
-# Minutes of inactivity before Thuki stops the built-in engine to free RAM.
-# 0 keeps the model loaded indefinitely for instant first tokens (default).
-# Applies to the built-in engine only. Valid range: 0-1440.
-idle_unload_minutes = 0
 
 # One block per provider. The built-in entry is always present. A provider's
 # selected model lives on its own `model` field (empty until you pick one in
@@ -149,8 +148,7 @@ Upgrading from an older version is automatic: a pre-providers config with a flat
 | :---------------- | :--------- | :------- | :------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `active_provider` | `"builtin"` | Yes      | id of a provider    | Which provider receives inference. Must match the `id` of one of the `[[inference.providers]]` entries; an empty or dangling value resets to `builtin`. Exception: a config that predates the providers list is pinned to `ollama` on load, because no working built-in provider existed when that file was written.                                                                                                                                                                                                                                                                                              |
 | `num_ctx`         | `16384`    | Yes      | `[2048, 1048576]`   | Context window size in tokens sent to the active provider with every request. For the built-in engine, the value becomes `--ctx-size` when the `llama-server` process starts, so changing it restarts the engine. For Ollama, warmup and chat share this value so the same runner instance and its cached KV prefix for the system prompt are reused: they must match or Ollama creates a second runner and the warmup saves nothing. Ollama silently clamps this to the model's physical maximum. For OpenAI-compatible providers the value is informational only; the server controls the actual context. Raise to fit longer conversations: each doubling roughly doubles VRAM for the KV cache; lower to reclaim GPU memory. See [Tuning the Context Window](./tuning-context-window.md). |
-| `keep_warm_inactivity_minutes` | `0` | Yes | `-1` or `[0, 1440]` | Minutes of inactivity before Thuki tells Ollama to release the model from VRAM. Applies to the Ollama provider only. `0` means do not manage: Ollama's own 5-minute default applies. `-1` means never release. Raise for longer sessions between uses; lower to reclaim VRAM sooner.                                                                                                                                                                                                                                            |
-| `idle_unload_minutes`          | `0` | Yes | `[0, 1440]`         | Minutes of inactivity before Thuki stops the built-in engine to free RAM. Applies to the built-in engine only; the Ollama provider uses `keep_warm_inactivity_minutes` instead. `0` keeps the model loaded indefinitely so the first token after a pause stays instant. Raise to free RAM on an idle Mac; keep `0` for instant first tokens.                                                                                                                                                                                   |
+| `keep_warm_inactivity_minutes` | `0` | Yes | `-1` or `[0, 1440]` | Minutes of inactivity before Thuki releases the active model from memory. Governs both local providers: the built-in engine stops its sidecar to free RAM, and Ollama is told to release the model from VRAM. Not applicable to a remote OpenAI-compatible server, whose residency Thuki does not manage. `0` uses the provider's natural short default (about 5 minutes): Ollama defers to its own timer, the built-in engine applies its own ~5-minute timer (`DEFAULT_BUILTIN_IDLE_MINUTES`). `-1` keeps the model resident forever. Raise for longer sessions between uses; lower to reclaim memory sooner. |
 
 Each `[[inference.providers]]` block has these fields:
 
@@ -179,7 +177,8 @@ The table below also lists the baked-in safety limits that govern Thuki's commun
 | `VRAM_POLL_INTERVAL_SECS`                   | `5 s`    | No       | Tuning this trades responsiveness against localhost polling load; 5 s is the sweet spot for loopback calls and matches Ollama's internal TTL resolution granularity. | —      | How often Thuki polls Ollama's `/api/ps` to detect VRAM changes made outside Thuki (for example, running `ollama stop` or a TTL expiry). The Settings panel VRAM indicator reflects these changes within one interval. |
 | `ENGINE_HEALTH_DEADLINE_SECS`               | `300 s`  | No       | Engine lifecycle contract: this bounds the worst-case "warming up" wait the UI can show before a start is declared failed, so changing it alters the UX contract rather than tuning a preference. | —      | How long Thuki waits for a freshly spawned built-in engine to pass its `/health` check before giving up and killing the process. Large GGUF models loading from a cold disk can legitimately take minutes, so the deadline is generous. |
 | `ENGINE_HEALTH_POLL_INTERVAL_MS`            | `250 ms` | No       | Pure loopback-load tuning: 250 ms detects readiness promptly without hammering the local server while it is busy loading the model.                                  | —      | How often Thuki probes the built-in engine's `/health` endpoint while it starts up. A `503` answer means the model is still loading and the poll continues; `200` means ready.       |
-| `ENGINE_IDLE_CHECK_INTERVAL_SECS`           | `30 s`   | No       | Internal timer granularity behind the user-facing `idle_unload_minutes` knob; 30 s keeps the unload within a minute-scale setting's precision at negligible cost.    | —      | How often the engine runner checks whether `idle_unload_minutes` of inactivity have elapsed and the built-in engine should be stopped to free RAM.                                   |
+| `ENGINE_IDLE_CHECK_INTERVAL_SECS`           | `30 s`   | No       | Internal timer granularity behind the user-facing `keep_warm_inactivity_minutes` knob; 30 s keeps the unload within a minute-scale setting's precision at negligible cost.    | —      | How often the engine runner checks whether the configured idle window has elapsed and the built-in engine should be stopped to free RAM.                                   |
+| `DEFAULT_BUILTIN_IDLE_MINUTES`              | `5 min`  | No       | The fixed translation of the `keep_warm_inactivity_minutes = 0` sentinel for the built-in engine, not a separate preference. The built-in engine has no external daemon to defer to, so `0` ("use the provider's natural short default") resolves to this value. Users who want a different timeout set `keep_warm_inactivity_minutes` directly (`N` minutes, or `-1` for forever). | —      | The idle window the built-in engine applies when `keep_warm_inactivity_minutes` is `0`. After this many minutes of inactivity the sidecar is stopped to free RAM. |
 | `ENGINE_HEALTH_PROBE_TIMEOUT_SECS`          | `5 s`    | No       | Internal lifecycle contract between the runner and the engine process. A wedged-but-connected server must not park the poll loop forever; loopback probes are normally instant so 5 s is generous. The poll interval and deadline are the user-facing knobs. | —      | How long a single `/health` GET is allowed to take inside the startup poll loop. If the engine has accepted the TCP connection but stopped responding, this timeout causes the probe to return an error (treated as Wait and retried after `ENGINE_HEALTH_POLL_INTERVAL_MS`). |
 | `ENGINE_COMMAND_QUEUE_CAPACITY`             | `64`     | No       | Bounds memory under command bursts; 64 slots is ample for all UI-driven traffic (Ensure, Touch, SetIdleMinutes, Shutdown) under any realistic usage pattern. | —      | Capacity of the bounded `mpsc` channel that carries commands from `EngineHandle` to the runner actor task. Back-pressure from a full queue is not observable in normal use. |
 | `DOWNLOAD_PROGRESS_MIN_INTERVAL_MS`         | `500 ms` | No       | Pure IPC hygiene: a fast local connection can deliver thousands of chunks per second and the UI only needs a few updates per second, so throttling below the UI refresh rate is invisible to the user. | —      | Minimum interval between `Progress` events emitted while a model file downloads. An update is also emitted whenever at least 1% of the file has arrived since the last one, whichever comes first, and a final 100% update always precedes verification. |

diff --git a/docs/tuning-context-window.md b/docs/tuning-context-window.md
@@ -13,7 +13,7 @@ The Context Window value (`num_ctx`) is sent to whichever provider is active:
 - **Built-in engine (the default):** the value is passed to the bundled `llama-server` process as `--ctx-size` when it starts. The context size is fixed for the lifetime of the process, so changing it in Settings restarts the engine (a model reload, a few seconds). The three signals below and the Activity Monitor steps apply unchanged; the `ollama ps` steps do not, so watch Memory Pressure and GPU History instead.
 - **Ollama provider:** everything in this guide applies as written, including the `ollama ps` checks.
 
-The Keep Warm knob is Ollama-only. The built-in engine's counterpart is `idle_unload_minutes` (Settings, or `[inference]` in `config.toml`): minutes of inactivity before Thuki stops the engine to free memory, with `0` meaning keep it loaded indefinitely.
+The Keep Warm knob (`keep_warm_inactivity_minutes`, in Settings or `[inference]` in `config.toml`) governs both local providers: minutes of inactivity before Thuki releases the active model from memory. For the built-in engine it stops the `llama-server` sidecar; for Ollama it sets the `keep_alive`. `0` uses the provider's natural short default (about 5 minutes), and `-1` keeps the model resident forever.
 
 ## Quick vocabulary
 

diff --git a/src-tauri/src/commands.rs b/src-tauri/src/commands.rs
@@ -2849,7 +2849,7 @@ mod tests {
         assert_eq!(messages[1].images.as_ref().unwrap().len(), 1);
     }
 
-    // ─── classify_http_error: Phase B picker hint ────────────────────────────
+    // ─── classify_http_error: capability picker hint ─────────────────────────
 
     #[test]
     fn classify_http_500_appends_picker_hint_when_body_mentions_image() {
@@ -3191,7 +3191,7 @@ mod tests {
     /// Locks the native `/api/chat` wire contract across the routing change:
     /// the exact request body (model, messages, stream, think, options
     /// {temperature, top_p, top_k, num_ctx}, keep_alive) must be identical
-    /// to the pre-routing Phase 1 payload.
+    /// to the payload Thuki sent before provider routing was introduced.
     #[tokio::test]
     async fn ollama_request_body_unchanged() {
         use crate::config::defaults::PROVIDER_KIND_OLLAMA;