quiet-node · quiet-node · Jun 13, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/.github/workflows/nightly-release.yml b/.github/workflows/nightly-release.yml
@@ -94,6 +94,12 @@ jobs:
       - name: Run all tests with coverage enforcement
         run: bun run test:all:coverage
 
+      - name: Cache llama.cpp sidecar
+        uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830  # v4.3.0
+        with:
+          path: src-tauri/binaries
+          key: llama-cpp-${{ runner.os }}-b9590-b12cb8851ea60433
+
       - name: Build Tauri app
         # VITE_GIT_COMMIT_SHA is set here, not on a separate frontend step, because
         # tauri build runs beforeBuildCommand (bun run build:frontend) internally.

diff --git a/.github/workflows/pr-backend-tests.yml b/.github/workflows/pr-backend-tests.yml
@@ -27,6 +27,20 @@ jobs:
         with:
           tool: cargo-llvm-cov
 
+      - name: Setup Bun
+        uses: oven-sh/setup-bun@0c5077e51419868618aeaa5fe8019c62421857d6  # v2.2.0
+        with:
+          bun-version: 1.3.11
+
+      - name: Cache llama.cpp sidecar
+        uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830  # v4.3.0
+        with:
+          path: src-tauri/binaries
+          key: llama-cpp-${{ runner.os }}-b9590-b12cb8851ea60433
+
+      - name: Fetch llama-server sidecar
+        run: bun run engine:ensure
+
       - name: Run backend tests with coverage
         working-directory: src-tauri
         run: |

diff --git a/.github/workflows/pr-build-validation.yml b/.github/workflows/pr-build-validation.yml
@@ -27,6 +27,15 @@ jobs:
       - name: Install dependencies
         run: bun install --frozen-lockfile
 
+      - name: Cache llama.cpp sidecar
+        uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830  # v4.3.0
+        with:
+          path: src-tauri/binaries
+          key: llama-cpp-${{ runner.os }}-b9590-b12cb8851ea60433
+
+      - name: Fetch llama-server sidecar
+        run: bun run engine:ensure
+
       - name: Security vulnerability scan
         run: |
           AUDIT=$(bun audit 2>&1 || true)

diff --git a/.github/workflows/release-please.yml b/.github/workflows/release-please.yml
@@ -95,6 +95,12 @@ jobs:
       - name: Build frontend
         run: bun run build:frontend
 
+      - name: Cache llama.cpp sidecar
+        uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830  # v4.3.0
+        with:
+          path: src-tauri/binaries
+          key: llama-cpp-${{ runner.os }}-b9590-b12cb8851ea60433
+
       - name: Build Tauri app
         run: bun run build:backend
 

diff --git a/.gitignore b/.gitignore
@@ -50,3 +50,7 @@ docs/superpowers/
 # SearXNG container dumps upstream defaults here on each start
 sandbox/search-box/searxng/settings.yml.new
 
+
+# Bundled inference engine artifacts (fetched by scripts/ensure-llama-server.ts)
+src-tauri/binaries/
+*.gguf
diff --git a/docs/configurations.md b/docs/configurations.md
@@ -39,6 +39,10 @@ num_ctx = 16384
 # 0 = let Ollama manage (its own 5-minute default applies).
 # -1 = never release. Applies to the Ollama provider only.
 keep_warm_inactivity_minutes = 0
+# Minutes of inactivity before Thuki stops the built-in engine to free RAM.
+# 0 keeps the model loaded indefinitely for instant first tokens (default).
+# Applies to the built-in engine only. Valid range: 0-1440.
+idle_unload_minutes = 0
 
 # One block per provider. The built-in entry is always present. A provider's
 # selected model lives on its own `model` field (empty until you pick one in
@@ -143,6 +147,7 @@ Upgrading from an older version is automatic: a pre-providers config with a flat
 | `active_provider` | `"ollama"` | Yes      | id of a provider    | Which provider receives inference. Must match the `id` of one of the `[[inference.providers]]` entries; an empty or dangling value resets to `ollama`. Phase 1: leave this on `ollama` (the Built-in engine is not available yet).                                                                                                                                                                                                                                                                                              |
 | `num_ctx`         | `16384`    | Yes      | `[2048, 1048576]`   | Context window size in tokens sent to the active provider with every request. Warmup and chat share this value so Ollama reuses the same runner instance and its cached KV prefix for the system prompt: they must match or Ollama creates a second runner and the warmup saves nothing. Ollama silently clamps this to the model's physical maximum. Raise to fit longer conversations: each doubling roughly doubles VRAM for the KV cache; lower to reclaim GPU memory. See [Tuning the Context Window](./tuning-context-window.md). |
 | `keep_warm_inactivity_minutes` | `0` | Yes | `-1` or `[0, 1440]` | Minutes of inactivity before Thuki tells Ollama to release the model from VRAM. Applies to the Ollama provider only. `0` means do not manage: Ollama's own 5-minute default applies. `-1` means never release. Raise for longer sessions between uses; lower to reclaim VRAM sooner.                                                                                                                                                                                                                                            |
+| `idle_unload_minutes`          | `0` | Yes | `[0, 1440]`         | Minutes of inactivity before Thuki stops the built-in engine to free RAM. Applies to the built-in engine only; the Ollama provider uses `keep_warm_inactivity_minutes` instead. `0` keeps the model loaded indefinitely so the first token after a pause stays instant. Raise to free RAM on an idle Mac; keep `0` for instant first tokens.                                                                                                                                                                                   |
 
 Each `[[inference.providers]]` block has these fields:
 
@@ -156,7 +161,7 @@ Each `[[inference.providers]]` block has these fields:
 
 If the active model has been removed from Ollama between launches, Thuki silently falls back to the first installed model the next time you open the picker. If no models are installed at all, the next request surfaces a "Model not found" error with the exact `ollama pull <name>` command to run.
 
-The table below also lists the baked-in safety limits that govern Thuki's communication with the Ollama HTTP API. None are tunable.
+The table below also lists the baked-in safety limits that govern Thuki's communication with the Ollama HTTP API and the lifecycle of the built-in engine process. None are tunable.
 
 | Constant                                    | Default  | Tunable? | Why not tunable                                                                                                                                                         | Bounds | Description                                                                                                                                                                          |
 | :------------------------------------------ | :------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -166,6 +171,15 @@ The table below also lists the baked-in safety limits that govern Thuki's commun
 | `MAX_OLLAMA_SHOW_BODY_BYTES`                | `4 MiB`  | No       | Defense-in-depth bound on attacker-controlled response body. Same rationale as `MAX_OLLAMA_TAGS_BODY_BYTES`.                                                            | —      | The largest `/api/show` response body Thuki will accept. Full Modelfiles and parameters can be sizable, but 4 MiB is well above any real model; larger responses are rejected.      |
 | `MAX_MODEL_SLUG_LEN`                        | `256 B`  | No       | Defense-in-depth bound on adversarial input. Real Ollama slugs are a handful of characters; capping the length stops malformed values long before any network or DB work. | —      | The longest model slug Thuki will accept from `set_active_model`. Anything longer is rejected immediately by `validate_model_slug`.                                                  |
 | `VRAM_POLL_INTERVAL_SECS`                   | `5 s`    | No       | Tuning this trades responsiveness against localhost polling load; 5 s is the sweet spot for loopback calls and matches Ollama's internal TTL resolution granularity. | —      | How often Thuki polls Ollama's `/api/ps` to detect VRAM changes made outside Thuki (for example, running `ollama stop` or a TTL expiry). The Settings panel VRAM indicator reflects these changes within one interval. |
+| `ENGINE_HEALTH_DEADLINE_SECS`               | `300 s`  | No       | Engine lifecycle contract: this bounds the worst-case "warming up" wait the UI can show before a start is declared failed, so changing it alters the UX contract rather than tuning a preference. | —      | How long Thuki waits for a freshly spawned built-in engine to pass its `/health` check before giving up and killing the process. Large GGUF models loading from a cold disk can legitimately take minutes, so the deadline is generous. |
+| `ENGINE_HEALTH_POLL_INTERVAL_MS`            | `250 ms` | No       | Pure loopback-load tuning: 250 ms detects readiness promptly without hammering the local server while it is busy loading the model.                                  | —      | How often Thuki probes the built-in engine's `/health` endpoint while it starts up. A `503` answer means the model is still loading and the poll continues; `200` means ready.       |
+| `ENGINE_IDLE_CHECK_INTERVAL_SECS`           | `30 s`   | No       | Internal timer granularity behind the user-facing `idle_unload_minutes` knob; 30 s keeps the unload within a minute-scale setting's precision at negligible cost.    | —      | How often the engine runner checks whether `idle_unload_minutes` of inactivity have elapsed and the built-in engine should be stopped to free RAM.                                   |
+| `ENGINE_HEALTH_PROBE_TIMEOUT_SECS`          | `5 s`    | No       | Internal lifecycle contract between the runner and the engine process. A wedged-but-connected server must not park the poll loop forever; loopback probes are normally instant so 5 s is generous. The poll interval and deadline are the user-facing knobs. | —      | How long a single `/health` GET is allowed to take inside the startup poll loop. If the engine has accepted the TCP connection but stopped responding, this timeout causes the probe to return an error (treated as Wait and retried after `ENGINE_HEALTH_POLL_INTERVAL_MS`). |
+| `ENGINE_COMMAND_QUEUE_CAPACITY`             | `64`     | No       | Bounds memory under command bursts; 64 slots is ample for all UI-driven traffic (Ensure, Touch, SetIdleMinutes, Shutdown) under any realistic usage pattern. | —      | Capacity of the bounded `mpsc` channel that carries commands from `EngineHandle` to the runner actor task. Back-pressure from a full queue is not observable in normal use. |
+| `DOWNLOAD_PROGRESS_MIN_INTERVAL_MS`         | `500 ms` | No       | Pure IPC hygiene: a fast local connection can deliver thousands of chunks per second and the UI only needs a few updates per second, so throttling below the UI refresh rate is invisible to the user. | —      | Minimum interval between `Progress` events emitted while a model file downloads. An update is also emitted whenever at least 1% of the file has arrived since the last one, whichever comes first, and a final 100% update always precedes verification. |
+| `MAX_HF_API_BODY_BYTES`                     | `4 MiB`  | No       | Defense-in-depth bound on attacker-controlled data from a remote service, mirroring `MAX_OLLAMA_TAGS_BODY_BYTES`. | —      | The largest Hugging Face API response body (repo file listings) Thuki will accept while resolving a model to download. Larger responses are rejected mid-stream and the request returns an error. |
+| `HF_API_TIMEOUT_SECS`                       | `15 s`   | No       | Protocol cap on a hung remote service so the download UI cannot stall on metadata resolution; 15 s is generous for a small metadata call over the internet. | —      | How long Thuki waits for a Hugging Face API metadata call (repo file listing) to respond before giving up. Applies to resolving pasted repo ids and listing a repo's GGUF files, not to the model download itself. |
+| `HF_BASE_URL`                               | `https://huggingface.co` | No | Single origin for model metadata and downloads; the sha256-pinning and provenance model assume the canonical Hub. Pointing downloads at an arbitrary mirror would bypass the integrity guarantees that make the curated starter registry safe. | — | The Hugging Face origin Thuki uses for all model metadata calls and blob downloads. Every starter in the registry pins a repo at an exact revision and carries a sha256 digest verified on install; those digests are read from this origin and only meaningful against it. |
 
 ### `[prompt]`
 

diff --git a/docs/release-process.md b/docs/release-process.md
@@ -30,6 +30,26 @@ A backup copy of both keys lives in the private `quiet-node/thuki-confidential`
 
 There is nothing to set up on your laptop. No env vars, no key files, no `.zshrc.local` overrides. New contributors clone the repo and start working.
 
+## Bundled inference engine
+
+Every build embeds llama.cpp's `llama-server` as a Tauri sidecar. The binary and the dylibs it links are fetched and verified by `scripts/ensure-llama-server.ts`, which pins an exact llama.cpp release tag and the sha256 of its macOS arm64 asset; a hash mismatch aborts the build. The script runs automatically in front of `dev`, `build:backend`, and `build:release`, and is an instant no-op once the pinned version is installed under `src-tauri/binaries/` (gitignored, never committed). CI caches that directory with a key derived from the pinned version and hash, so release builds only hit GitHub's release CDN when the pin changes. Because the script adds an `@loader_path/../Frameworks` rpath for bundle-time dylib resolution, it ad-hoc re-signs the binary and each dylib after the edit.
+
+Deferred: Developer ID re-signing, deep-signing of the nested dylibs, and notarization land as a release-please workflow step when the Apple Developer certificate exists.
+
+### Bumping the pinned llama.cpp version
+
+The pin in `scripts/ensure-llama-server.ts` is two constants. `LLAMA_CPP_TAG` names a published llama.cpp release (for example `b9590`, listed at https://github.com/ggml-org/llama.cpp/releases), and `ASSET_SHA256` is the sha256 of that release's `llama-<tag>-bin-macos-arm64.tar.gz` asset. This is a release pin, not a git commit: llama.cpp's `main` branch moving forward does not affect a pinned build, and a newer release does not make the current one stop working. The pin is updated only when we deliberately adopt a newer engine.
+
+There is no automatic bump, and that is intentional: a new engine version has to clear the manual checks below on real hardware before it ships. Upgrade when there is a concrete reason: a newer model architecture we want to load, a `llama-server` bug or security fix, or a Metal/performance improvement. Otherwise the existing pin keeps working indefinitely.
+
+To bump:
+
+1. Pick the target release on https://github.com/ggml-org/llama.cpp/releases and set `LLAMA_CPP_TAG` to its tag.
+2. Set `ASSET_SHA256` to the macOS arm64 asset's hash. Read it from the GitHub Releases API (the asset's `digest` field) or compute it locally with `shasum -a 256 llama-<tag>-bin-macos-arm64.tar.gz`.
+3. Run `bun run engine:ensure`. It fetches the new asset, verifies the new hash, and re-derives the dylib link closure. If the new release adds, renames, or drops a dylib, the script aborts and names exactly which entries differ from `bundle.macOS.frameworks` in `src-tauri/tauri.conf.json`; update that list to match so the closure check passes.
+4. Bump the cache key in the build workflows so the new asset is not served stale from the old cache.
+5. Re-run the binary-dependent checks on a real machine: the sidecar spawns and streams a response, and `codesign -vv` is clean on the `llama-server` binary and every bundled dylib.
+
 ## Cutting a release manually (rare)
 
 If for some reason a release must be cut outside of CI (incident response, rolling back a bad release-please commit, etc.), the procedure is:

diff --git a/package.json b/package.json
@@ -10,12 +10,13 @@
   "homepage": "https://www.thuki.app/",
   "type": "module",
   "scripts": {
-    "dev": "tauri dev",
+    "dev": "bun run engine:ensure && tauri dev",
     "frontend:dev": "vite",
     "generate:commands": "bun scripts/generate-commands.ts",
+    "engine:ensure": "bun scripts/ensure-llama-server.ts",
     "build:frontend": "tsc && vite build",
-    "build:backend": "tauri build --bundles app",
-    "build:release": "tauri build --bundles app -c \"{\\\"bundle\\\":{\\\"createUpdaterArtifacts\\\":true}}\"",
+    "build:backend": "bun run engine:ensure && tauri build --bundles app",
+    "build:release": "bun run engine:ensure && tauri build --bundles app -c \"{\\\"bundle\\\":{\\\"createUpdaterArtifacts\\\":true}}\"",
     "build:all": "bun run build:frontend && bun run build:backend",
     "preview": "vite preview",
     "tauri": "tauri",