Detached Claude worker: let Klonode edit itself without severing its own chat stream

## Why this is the critical path

Klonode's whole point is to be a workstation where Claude can improve the workstation itself. That only works if Claude can edit Klonode's own source files **without killing the conversation doing the editing**. Today it doesn't:

- Claude edits `packages/ui/src/lib/components/ChatPanel/ChatPanel.svelte` → Vite HMR re-mounts the component → the in-flight SSE reader in the browser is torn down → the stream dies → the conversation is lost and has to restart from scratch.
- Claude edits `packages/ui/src/routes/api/chat/stream/+server.ts` → Vite restarts the dev server → the spawned Claude CLI subprocess is killed mid-turn → the SSE stream errors out → Klonode renders "Feil: network error".
- Claude edits a store under `packages/ui/src/lib/stores/` → store re-exports invalidate all importers → full page reload → `location.reload()` severs the fetch.

Every path by which Klonode could improve its own UI dies at the first `Edit` tool call.

Workarounds we've tried and why they don't solve it:

- Persisting `chatStore.messages` and `cliSessionIds` to localStorage (PR #66) — survives the reload but loses the in-flight tokens and interrupts whatever file operation was mid-flight
- Dropping `--resume` and always re-spawning (PR #70) — solves a different bug but makes every edit cost a cold restart of the conversation
- Adding a send queue so the user can type follow-ups (PR #72) — polish for the case when Claude is NOT editing the UI, doesn't help when it is

The only real fix is: **the Claude CLI process must outlive Vite**. That means running it as a detached worker the dev server can't kill.

## Architecture

### Core idea

Move the `spawn('claude.exe', [...])` out of the request-handler lifecycle and into a **long-lived worker** managed by Klonode's backend. The worker:

- Lives in its own Node process (or a `pm2` / `forever` / Windows Service handle)
- Has its own stdout/stderr captured to an append-only log file under `.klonode/workers/<worker-id>.log`
- Exposes an IPC channel (named pipe on Windows, unix socket on Mac/Linux, or a localhost-only HTTP endpoint with a random token)
- Persists its state (active Claude session, last message, streaming position) to `.klonode/workers/<worker-id>.state.json`

The Klonode Workstation UI connects to the worker over the IPC channel instead of spawning Claude directly. When Vite HMRs the browser tab, the new browser reconnects to the same worker by ID and gets a delta of events emitted since the last byte offset the old connection acknowledged.

### Minimum viable worker protocol

```
POST /worker/spawn    { repoPath, cwd, systemPrompt, prompt } → { workerId }
GET  /worker/:id/stream?since=<byte-offset>  → SSE of events from offset
POST /worker/:id/stop → aborts (SIGTERM the child)
GET  /worker/:id/status → { alive, lastActivity, tokensConsumed }
```

The worker process keeps a ring buffer of recent events in memory AND appends every event to the disk log. On reconnect the Workstation tails the log from the last offset the browser confirmed.

### Surviving HMR

When Vite HMRs ChatPanel.svelte:

1. Svelte remounts the component
2. `onMount` runs, checks `sessionsStore` for an active `workerId`
3. If present, opens an EventSource to `GET /worker/:id/stream?since=<offset>`
4. Events resume streaming from where they left off — no tokens lost, no tool calls retriggered

When Vite restarts the whole dev server (because a store or API route changed):

1. Server process exits — but the detached worker is NOT a child of the Vite process, so it survives
2. Browser reloads, reconnects by `workerId`
3. Same resume path

When the detached worker itself crashes (Claude CLI bug, OOM, etc):

1. Worker manager detects SIGCHLD / exit
2. State file is marked `crashed: true` with last-known-good byte offset
3. Browser shows a recoverable error banner with a "resume from last checkpoint" button

### Why a separate Node process (not a worker_thread)

Worker threads die with the Vite parent. Child processes spawned by the Vite handler die with the Vite parent (which is why today's code has this problem). The only thing that survives is a process spawned with `detached: true` AND `stdio: 'ignore'` AND its PID recorded somewhere durable. That has to be a full standalone Node process with its own module graph, not a thread sharing the SvelteKit runtime.

### Windows specifics

- Use `detached: true, windowsHide: true, stdio: ['ignore', outputFileFd, outputFileFd]`
- Open a named pipe with `net.createServer` on `\\\\.\\pipe\\klonode-worker-<id>`
- Register the PID in `.klonode/workers/<id>.pid` for cleanup on next boot

### Files to touch

**New:**

- `packages/ui/src/lib/workers/worker-client.ts` — browser-side client that spawns, streams, reconnects
- `packages/ui/src/lib/workers/worker-manager.ts` — server-side worker registry / spawner
- `packages/ui/src/lib/workers/worker-protocol.ts` — shared types for the IPC events
- `packages/ui/src/routes/api/worker/spawn/+server.ts` — POST endpoint
- `packages/ui/src/routes/api/worker/[id]/stream/+server.ts` — GET SSE with `?since=` offset
- `packages/ui/src/routes/api/worker/[id]/stop/+server.ts` — POST abort
- `packages/worker/` — NEW package. A standalone Node binary `klonode-worker` that wraps the Claude CLI spawn, writes the log, handles the IPC, and persists state. Built with tsup like the existing CLI package.
- `docs/self-hosting.md` — update to describe the detached-worker flow

**Modified:**

- `packages/ui/src/routes/api/chat/stream/+server.ts` — either delegate to the worker manager or keep as a legacy path for non-self-hosting setups with a warning
- `packages/ui/src/lib/components/ChatPanel/ChatPanel.svelte` — replace direct fetch to `/api/chat/stream` with `workerClient.spawn(...)` + reconnect logic in `onMount`
- `packages/ui/src/lib/stores/agents.ts` — `cliSessionIds` becomes `workerIds: Record<tabId, workerId>`; persisted across reloads as already done for the old session IDs

## Acceptance criteria

- [ ] Open the Workstation, send a long-running task (e.g. "add Kotlin support to the content extractor"), wait for it to start streaming
- [ ] Edit `packages/ui/src/lib/components/ChatPanel/ChatPanel.svelte` manually in your editor while the stream is live
- [ ] Vite HMR fires, the ChatPanel remounts — **the streaming tokens keep flowing** into the new instance with no user-visible interruption
- [ ] Edit `packages/ui/src/routes/api/chat/stream/+server.ts` manually while the stream is live
- [ ] Vite restarts the dev server, browser reloads, ChatPanel remounts — **the same streaming tokens pick up from where they left off** (not a fresh conversation)
- [ ] Kill the Vite dev server completely (Ctrl-C the terminal), restart, reload the browser — **the conversation is still there** because the worker is still running detached
- [ ] `ps` / `tasklist` shows a `node klonode-worker` process even when Vite is down
- [ ] If the worker crashes, the UI shows a recoverable error banner with a "resume from checkpoint" button
- [ ] Works on Windows (named pipes) and macOS / Linux (unix sockets)

## Out of scope for the first PR

- Multi-worker orchestration beyond one worker per tab
- Worker sharing across windows / users
- Authentication beyond "localhost-only + random token"
- GUI for inspecting worker state

## Why this is help-wanted

The IPC transport, log-tailing, and reconnect protocol are well-understood problems with good reference implementations (pm2, forever, tmux, GotTY). The Klonode-specific part is narrow: wrap the existing Claude CLI spawn in a persistent wrapper and make the ChatPanel reconnect by worker ID instead of spawning fresh. A contributor who has shipped a Node process supervisor before could land the minimum viable version in a weekend.

Sibling issues:

- #64 (Workstation self-introspection) — related but independent
- #62 (Playwright MCP in spawned sessions) — the spawned sessions become detached workers in this model
- #71 (Gemma backend) — the detached worker layer should also wrap Gemma, so the two PRs should be designed to compose

Sibling PRs:

- #66 persisted chat + cliSessionIds (the client-side half of resume)
- #72 send queue (polish for when self-editing is NOT happening)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detached Claude worker: let Klonode edit itself without severing its own chat stream #73

Why this is the critical path

Architecture

Core idea

Minimum viable worker protocol

Surviving HMR

Why a separate Node process (not a worker_thread)

Windows specifics

Files to touch

Acceptance criteria

Out of scope for the first PR

Why this is help-wanted

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Detached Claude worker: let Klonode edit itself without severing its own chat stream #73

Description

Why this is the critical path

Architecture

Core idea

Minimum viable worker protocol

Surviving HMR

Why a separate Node process (not a worker_thread)

Windows specifics

Files to touch

Acceptance criteria

Out of scope for the first PR

Why this is help-wanted

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions