Spantree · evieassist · Apr 4, 2026 · Apr 4, 2026
diff --git a/.github/workflows/deploy-docs.yml b/.github/workflows/deploy-docs.yml
@@ -4,39 +4,53 @@ on:
   push:
     branches: [main]
     paths:
-      - "docs/**"
-  workflow_dispatch:
+      - 'docs/**'
+      - '.github/workflows/deploy-docs.yml'
+  pull_request:
+    paths:
+      - 'docs/**'
+      - '.github/workflows/deploy-docs.yml'
 
 jobs:
   deploy:
     runs-on: ubuntu-latest
     permissions:
       contents: read
       deployments: write
+      pull-requests: write
     steps:
       - uses: actions/checkout@v4
 
-      - uses: pnpm/action-setup@v4
-        with:
-          version: 9
-
       - uses: actions/setup-node@v4
         with:
-          node-version: "22"
-          cache: pnpm
-          cache-dependency-path: docs/pnpm-lock.yaml
+          node-version: 22
+          cache: npm
 
       - name: Install dependencies
-        working-directory: docs
-        run: pnpm install --frozen-lockfile
+        run: npm ci
 
-      - name: Build
-        working-directory: docs
-        run: pnpm run build
+      - name: Build docs
+        run: cd docs && npm run build
 
       - name: Deploy to Cloudflare Pages
+        id: deploy
         uses: cloudflare/wrangler-action@v3
         with:
           apiToken: ${{ secrets.CLOUDFLARE_API_TOKEN }}
           accountId: ${{ secrets.CLOUDFLARE_ACCOUNT_ID }}
-          command: pages deploy docs/build --project-name=attache-docs
+          command: pages deploy docs/build --project-name=attache-docs --branch=${{ github.head_ref || github.ref_name }} --commit-dirty=true
+
+      - name: Comment preview URL on PR
+        if: github.event_name == 'pull_request'
+        uses: actions/github-script@v7
+        with:
+          script: |
+            const url = '${{ steps.deploy.outputs.deployment-url }}';
+            if (url) {
+              github.rest.issues.createComment({
+                owner: context.repo.owner,
+                repo: context.repo.repo,
+                issue_number: context.issue.number,
+                body: `📚 Docs preview: ${url}`
+              });
+            }
diff --git a/docs/docs/architecture/agent-orchestration.md b/docs/docs/architecture/agent-orchestration.md
@@ -0,0 +1,63 @@
+---
+sidebar_label: Agent Orchestration
+sidebar_position: 2
+---
+
+# Agent Orchestration
+
+Evie Platform's orchestration layer is runtime-agnostic. Any agent that can read files, write files, and commit to Git works as a participant. The coordination layer is the Git repository itself -- not a proprietary protocol, not a message queue, not a shared database.
+
+## Runtime-Agnostic Delegation
+
+The orchestrating agent (typically OpenClaw) delegates tasks to coding agents without coupling to a specific runtime. Claude Code, Codex, Gemini CLI, Aider, or a local model running via Ollama -- all participate through the same interface:
+
+1. The orchestrator writes a brief (a markdown file describing the task, context, and constraints)
+2. The coding agent picks up the brief, does the work, and commits the result
+3. The orchestrator reads the commit, evaluates the output, and decides what's next
+
+This works because the contract is files and Git, not an API. A coding agent doesn't need a plugin, SDK, or integration to participate. It needs a shell, a file system, and Git.
+
+The `evie-orchestrate` bounded context manages delegation lifecycle: brief generation, session launch, progress monitoring, and result collection. It's the switchboard, not the worker.
+
+## Git as Coordination Layer
+
+Git is the universal coordination layer because every coding agent already speaks it. Branches isolate parallel work. Commits provide atomic checkpoints. Diffs show exactly what changed. Merge conflicts surface when two agents touch the same code.
+
+The orchestrator uses Git worktrees to give each coding agent an isolated copy of the repository. Agents work in parallel on separate branches without stepping on each other. When work completes, the orchestrator evaluates the branch and decides whether to merge, request changes, or discard.
+
+This pattern scales to any number of concurrent agents. The coordination overhead is Git's merge machinery, which handles the hard problems (conflict detection, three-way merge, history linearization) that a custom protocol would need to reimplement.
+
+## Multi-Model Evaluation
+
+When output quality matters more than speed, the orchestrator runs blind parallel evaluation from two to three different model providers. See [Blind Multi-Model Evaluation](./design-decisions/blind-multi-model) for the full design decision.
+
+In practice, this applies to:
+
+- **Code review** -- send the same diff to Claude, GPT, and Gemini. Disagreements surface real issues that single-model review would miss.
+- **Research synthesis** -- three models independently summarize source material. Overlapping conclusions are high-confidence; divergent conclusions need human judgment.
+- **Risk assessment** -- independent security evaluations of a proposed change. Unanimous "safe" is a stronger signal than one model saying "safe."
+
+The orchestrator collects all evaluations before synthesizing a result. Models don't see each other's assessments. This eliminates anchoring and herding biases.
+
+## Local Reflection
+
+Quantized local models (Llama 3, Mistral, Phi) running on Apple Silicon handle a specific class of work: extracting procedural knowledge from session logs.
+
+After a coding session, the local model reads the session transcript and extracts patterns: "When reviewing TypeScript, always check for unhandled promise rejections." "This codebase uses barrel exports -- follow that convention." These observations become candidate entries for procedural memory (SKILL.md updates or new skills).
+
+Local reflection runs on-device with no API calls. The session logs -- which may contain sensitive code, credentials references, or internal discussion -- never leave the machine. The local model's output is lower quality than a frontier model, but the privacy tradeoff is worth it for this use case.
+
+The Dream Cycle (overnight consolidation) uses a mix of local and API-based models depending on the task. Privacy-sensitive consolidation runs locally; quality-critical synthesis uses frontier models.
+
+## Tmux as Session Management
+
+Each coding agent runs in a tmux session. Tmux provides the session lifecycle that agent orchestration needs:
+
+- **Named sessions** -- `evie-cc-auth-rewrite`, `evie-codex-review-42`. Find any agent's work by name.
+- **Detached execution** -- agents run in the background. The orchestrator launches a session and checks back later.
+- **Output capture** -- tmux's scrollback buffer captures the full session transcript for post-hoc analysis and local reflection.
+- **Multiplexing** -- multiple agents run concurrently in separate sessions on the same machine.
+
+The dispatch harness creates a tmux session, injects the brief, starts the coding agent, and monitors for completion. When the agent finishes (detected by process exit or a sentinel file), the harness collects the results and notifies the orchestrator.
+
+This is intentionally low-tech. Tmux is battle-tested, available on every Unix system, and requires zero infrastructure. The alternative -- a custom daemon managing agent processes -- would add complexity without meaningful benefit.
diff --git a/docs/docs/architecture/design-decisions/_category_.json b/docs/docs/architecture/design-decisions/_category_.json
@@ -0,0 +1,6 @@
+{
+  "label": "Design Decisions",
+  "position": 4,
+  "collapsible": true,
+  "collapsed": false
+}
diff --git a/docs/docs/architecture/design-decisions/blind-multi-model.md b/docs/docs/architecture/design-decisions/blind-multi-model.md
@@ -0,0 +1,34 @@
+---
+sidebar_label: Blind Multi-Model Evaluation
+---
+
+# Blind Multi-Model Evaluation
+
+## Problem
+
+When an AI agent evaluates its own output, it has a systematic bias toward confirming its own work. A Claude-based agent reviewing Claude-generated code will find fewer issues than an independent reviewer would. Self-evaluation is better than nothing, but it creates a ceiling on quality assurance.
+
+## Options Considered
+
+1. **Single-model self-evaluation** -- the same model that generates output also reviews it
+2. **Human review for everything** -- highest quality, doesn't scale
+3. **Blind parallel evaluation** -- send the same prompt to two or three different models, compare results without revealing which model produced what
+
+## Decision
+
+Blind parallel evaluation from two to three different model providers (Anthropic, OpenAI, Google). The evaluating models don't know which model produced the original output or which models are co-evaluating.
+
+"Blind" means two things: the evaluating model doesn't know the identity of the model that produced the work, and evaluating models don't see each other's assessments until all responses are collected. This eliminates anchoring bias (where a reviewer defers to a known-good model) and herding (where later reviewers converge on early assessments).
+
+The orchestrator collects all evaluations, then synthesizes a final assessment. Disagreements between models are flagged for human review rather than auto-resolved. When two out of three models agree on an issue, it's likely real. When all three disagree, the uncertainty itself is the signal.
+
+This pattern applies to code review, research synthesis, risk assessment, and any task where confidence in the output matters more than speed.
+
+## Tradeoffs
+
+- **Won.** Eliminates self-evaluation bias. Independent models catch different classes of errors.
+- **Won.** Disagreement detection surfaces genuine ambiguity that single-model evaluation would miss.
+- **Won.** Model-agnostic by design. Swap providers without changing behavior. If one provider has an outage, the system degrades to two-model evaluation rather than failing.
+- **Lost.** Two to three times the API cost per evaluation. Acceptable for high-stakes operations (code review, security assessment), expensive for routine tasks.
+- **Lost.** Latency increases -- you wait for the slowest model to respond. Mitigated by parallel execution, but still slower than single-model.
+- **Lost.** Synthesizing disagreements is a hard problem. The orchestrator's merge logic is itself a source of potential error.
diff --git a/docs/docs/architecture/design-decisions/index.md b/docs/docs/architecture/design-decisions/index.md
@@ -0,0 +1,20 @@
+---
+sidebar_label: Design Decisions
+sidebar_position: 1
+---
+
+# Design Decisions
+
+Each page in this section documents a key architectural choice using a consistent format: the problem, the options considered, the decision, and the tradeoffs.
+
+These are not retrospective justifications. They capture the reasoning at the time the decision was made, so future contributors understand the constraints and can revisit decisions when the constraints change.
+
+## Decisions
+
+- **[Why Postgres](./why-postgres)** -- single database with JSONB, pgvector, ParadeDB, TimescaleDB and pg_trgm. Graph traversal via recursive CTEs.
+- **[Why Mac Mini](./why-mac-mini)** -- per-user dedicated hardware, Apple Silicon for local inference, physical data sovereignty.
+- **[Why Bun](./why-bun)** -- one language ecosystem, native TypeScript, Python only as escape hatch.
+- **[Why Discord](./why-discord)** -- named threads, auto-hide, personal server model. Slack and others come later.
+- **[Why Not Neo4j](./why-not-neo4j)** -- heavyweight Java dependency, CTEs outperform AGE by 40x for our query patterns.
+- **[Why Local First](./why-local-first)** -- progressive trust, no cloud dependency for core ops, network for enrichment only.
+- **[Blind Multi-Model Evaluation](./blind-multi-model)** -- parallel eval from two to three models eliminates self-evaluation bias.
diff --git a/docs/docs/architecture/design-decisions/why-bun.md b/docs/docs/architecture/design-decisions/why-bun.md
@@ -0,0 +1,32 @@
+---
+sidebar_label: Why Bun
+---
+
+# Why Bun
+
+## Problem
+
+Evie Platform scripts, skills, and tooling need a runtime. OpenClaw itself runs on Node.js. The question is whether to standardize on Node, adopt Bun, or split between TypeScript and Python.
+
+## Options Considered
+
+1. **Node.js only** -- the runtime OpenClaw already uses
+2. **Bun** -- binary drop-in replacement for Node with built-in TypeScript, bundler, and package manager
+3. **Python for tooling, Node for runtime** -- common in AI/ML ecosystems
+4. **Mixed Bun + Python** -- Bun as primary, Python as escape hatch
+
+## Decision
+
+Bun as the primary runtime for all Evie Platform scripts, skills, and tooling. Python only as an escape hatch for workloads that have no TypeScript equivalent (e.g., OpenCV keyframe extraction from video).
+
+One language ecosystem eliminates the dev-tool drift that comes from maintaining both `pyproject.toml` and `package.json`, both `pip` and `npm`, both `ruff` and `eslint`. Bun is a binary drop-in: it runs TypeScript natively, bundles without a separate tool, and manages packages faster than npm.
+
+Bun's built-in test runner, HTTP server, and file I/O APIs reduce the dependency count for common operations. A skill script that needs to make HTTP calls and parse JSON doesn't need axios or node-fetch.
+
+## Tradeoffs
+
+- **Won.** Single language ecosystem for the entire platform. Every contributor needs to know TypeScript, not TypeScript and Python.
+- **Won.** Native TypeScript execution -- no compile step, no tsconfig complexity for scripts.
+- **Won.** Faster package installs and script startup vs. Node.
+- **Lost.** Bun's Node.js compatibility is not 100%. Some npm packages with native addons or Node-specific APIs may not work. Mitigated by falling back to Node for those cases.
+- **Lost.** Python's ML/AI library ecosystem is unmatched. The escape hatch exists because some tasks (video processing, specialized ML inference) have no viable TypeScript alternative.
diff --git a/docs/docs/architecture/design-decisions/why-discord.md b/docs/docs/architecture/design-decisions/why-discord.md
@@ -0,0 +1,36 @@
+---
+sidebar_label: Why Discord
+---
+
+# Why Discord
+
+## Problem
+
+Evie Platform agents need a messaging surface for human-agent interaction: approval prompts, status updates, conversational commands, and trust-tier escalations. The channel needs to support structured conversations that stay organized over time.
+
+## Options Considered
+
+1. **Slack** -- dominant in enterprise, rich API, but threads are unnamed and don't auto-hide
+2. **Discord** -- named threads, auto-hide for inactive threads, strong bot API
+3. **Telegram** -- lightweight, good bot API, limited thread support
+4. **Signal** -- privacy-first, minimal bot support
+5. **Custom web UI** -- full control, high development cost
+
+## Decision
+
+Discord as the V1 messaging channel. Slack, Telegram, and Signal come later as additional surfaces.
+
+Discord's named threads solve a real organizational problem. When your agent opens a thread called "PR Review: auth-middleware-rewrite," you can find it by name, archive it, and come back to it. Slack threads are unnamed replies to a message -- they disappear into the scroll. For an agent that opens dozens of threads per day, the naming matters.
+
+Inactive threads auto-hide after a configurable period. Your channel stays clean without manual archiving. Active conversations surface; finished ones fade.
+
+The personal server model (one Discord server per agent) creates a defensible 1:1 space. Your agent's server is yours. No shared workspace admins, no IT policies restricting bot permissions, no enterprise licensing.
+
+## Tradeoffs
+
+- **Won.** Named threads keep agent conversations organized and searchable.
+- **Won.** Auto-hide prevents channel clutter from resolved conversations.
+- **Won.** Personal server model -- no dependency on organizational Slack admin permissions.
+- **Lost.** Enterprise teams already on Slack face friction adopting a second messaging tool. The Slack integration is planned but not yet built.
+- **Lost.** Discord's reputation as a "gaming platform" can create perception issues in professional contexts.
+- **Lost.** No built-in email integration. Slack Connect bridges to external parties; Discord doesn't.
diff --git a/docs/docs/architecture/design-decisions/why-local-first.md b/docs/docs/architecture/design-decisions/why-local-first.md
@@ -0,0 +1,32 @@
+---
+sidebar_label: Why Local First
+---
+
+# Why Local First
+
+## Problem
+
+AI agent platforms face a fundamental tension: cloud services offer convenience and scale, but they require sending your data to someone else's infrastructure. For a personal agent with access to your email, calendar, credentials, and file system, the data sovereignty question is not abstract.
+
+## Options Considered
+
+1. **Cloud-hosted** -- agent runs on managed infrastructure (AWS, GCP, or a SaaS platform)
+2. **Hybrid** -- agent runs locally, memory and state stored in cloud
+3. **Local-first** -- everything runs on your hardware, network used only for enrichment
+
+## Decision
+
+Local-first architecture with a progressive trust model. The agent sees your data locally. No cloud dependency for core operations. Network connectivity is used for enrichment only: LLM API calls, web fetches, integration syncs.
+
+The progressive trust model means the agent starts with no network access to external services and gains it incrementally as you configure integrations. Your Postgres instance, memory files, knowledge graph, and activity log all live on your Mac. If you disconnect from the internet, the agent still works -- it just can't call LLM APIs or sync external services.
+
+This design aligns with the dedicated Mac mini model. The hardware is yours. The data is yours. The agent process runs under a restricted OS user on hardware you physically control.
+
+## Tradeoffs
+
+- **Won.** Data sovereignty -- your conversations, credentials, and knowledge graph never leave your hardware unless you explicitly configure an integration to sync them.
+- **Won.** Latency -- local Postgres queries are faster than round-trips to a cloud database. Memory retrieval and knowledge graph lookups run in single-digit milliseconds.
+- **Won.** Availability -- core agent functionality works offline. No dependency on cloud uptime for local operations.
+- **Lost.** No automatic backups without configuration. Cloud-hosted solutions handle this by default. You need to set up your own backup strategy (Time Machine, rsync, or Restic).
+- **Lost.** No multi-device sync out of the box. Your agent's state lives on one machine. Accessing it from elsewhere requires Tailscale or similar remote access.
+- **Lost.** Compute is bounded by your hardware. Cloud solutions can scale up for heavy workloads. A Mac mini has fixed CPU, memory, and storage.
diff --git a/docs/docs/architecture/design-decisions/why-mac-mini.md b/docs/docs/architecture/design-decisions/why-mac-mini.md
@@ -0,0 +1,33 @@
+---
+sidebar_label: Why Mac Mini
+---
+
+# Why Mac Mini
+
+## Problem
+
+Every Evie Platform agent needs dedicated compute. The agent runs Docker containers, a Postgres instance, local inference models, and the OpenClaw gateway. It needs to be always-on, physically isolated from your primary workstation, and powerful enough for real-time work.
+
+## Options Considered
+
+1. **Cloud VM** -- AWS EC2, GCP, or a VPS provider
+2. **Linux mini PC** -- Intel NUC or similar
+3. **Mac mini with Apple Silicon** -- M4 now, M5 when available (May 2026)
+
+## Decision
+
+Dedicated Mac mini per agent, 512 GB SSD minimum. M4 for current deployments, upgrading to M5 when it ships.
+
+The model is the same as Vision Pro: the device is yours. Your agent runs on your hardware, on your desk or in your closet, under your physical control.
+
+Apple Silicon provides the unified memory architecture that makes local inference practical. A Mac mini with 24 GB unified memory can run quantized models (Llama 3, Mistral, Phi) for local reflection tasks without a discrete GPU. The Neural Engine accelerates inference workloads that would require an expensive GPU on x86.
+
+macOS is the native target for OpenClaw. The gateway, tools, and ecosystem assume macOS or Linux -- and macOS has better support for the desktop integration patterns Evie Platform uses (launchd agents, Keychain Access, Shortcuts).
+
+## Tradeoffs
+
+- **Won.** Physical sovereignty -- no cloud provider can access, throttle, or terminate your agent. Data never leaves the box unless the agent explicitly sends it.
+- **Won.** Local inference capability -- Apple Silicon's unified memory makes running quantized models practical without a separate GPU budget.
+- **Lost.** Higher upfront cost than a cloud VM (though break-even is typically three to five months vs. a comparable EC2 instance).
+- **Lost.** macOS-specific. Teams running Linux infrastructure need the [macOS vs. Linux](../macos-vs-linux.md) guide to evaluate the gap.
+- **Lost.** Single point of failure without redundancy planning. A dead Mac mini means a dead agent until you replace it.