Skip to content

Detached Claude worker: let Klonode edit itself without severing its own chat stream #73

@smorchj

Description

@smorchj

Why this is the critical path

Klonode's whole point is to be a workstation where Claude can improve the workstation itself. That only works if Claude can edit Klonode's own source files without killing the conversation doing the editing. Today it doesn't:

  • Claude edits packages/ui/src/lib/components/ChatPanel/ChatPanel.svelte → Vite HMR re-mounts the component → the in-flight SSE reader in the browser is torn down → the stream dies → the conversation is lost and has to restart from scratch.
  • Claude edits packages/ui/src/routes/api/chat/stream/+server.ts → Vite restarts the dev server → the spawned Claude CLI subprocess is killed mid-turn → the SSE stream errors out → Klonode renders "Feil: network error".
  • Claude edits a store under packages/ui/src/lib/stores/ → store re-exports invalidate all importers → full page reload → location.reload() severs the fetch.

Every path by which Klonode could improve its own UI dies at the first Edit tool call.

Workarounds we've tried and why they don't solve it:

The only real fix is: the Claude CLI process must outlive Vite. That means running it as a detached worker the dev server can't kill.

Architecture

Core idea

Move the spawn('claude.exe', [...]) out of the request-handler lifecycle and into a long-lived worker managed by Klonode's backend. The worker:

  • Lives in its own Node process (or a pm2 / forever / Windows Service handle)
  • Has its own stdout/stderr captured to an append-only log file under .klonode/workers/<worker-id>.log
  • Exposes an IPC channel (named pipe on Windows, unix socket on Mac/Linux, or a localhost-only HTTP endpoint with a random token)
  • Persists its state (active Claude session, last message, streaming position) to .klonode/workers/<worker-id>.state.json

The Klonode Workstation UI connects to the worker over the IPC channel instead of spawning Claude directly. When Vite HMRs the browser tab, the new browser reconnects to the same worker by ID and gets a delta of events emitted since the last byte offset the old connection acknowledged.

Minimum viable worker protocol

POST /worker/spawn    { repoPath, cwd, systemPrompt, prompt } → { workerId }
GET  /worker/:id/stream?since=<byte-offset>  → SSE of events from offset
POST /worker/:id/stop → aborts (SIGTERM the child)
GET  /worker/:id/status → { alive, lastActivity, tokensConsumed }

The worker process keeps a ring buffer of recent events in memory AND appends every event to the disk log. On reconnect the Workstation tails the log from the last offset the browser confirmed.

Surviving HMR

When Vite HMRs ChatPanel.svelte:

  1. Svelte remounts the component
  2. onMount runs, checks sessionsStore for an active workerId
  3. If present, opens an EventSource to GET /worker/:id/stream?since=<offset>
  4. Events resume streaming from where they left off — no tokens lost, no tool calls retriggered

When Vite restarts the whole dev server (because a store or API route changed):

  1. Server process exits — but the detached worker is NOT a child of the Vite process, so it survives
  2. Browser reloads, reconnects by workerId
  3. Same resume path

When the detached worker itself crashes (Claude CLI bug, OOM, etc):

  1. Worker manager detects SIGCHLD / exit
  2. State file is marked crashed: true with last-known-good byte offset
  3. Browser shows a recoverable error banner with a "resume from last checkpoint" button

Why a separate Node process (not a worker_thread)

Worker threads die with the Vite parent. Child processes spawned by the Vite handler die with the Vite parent (which is why today's code has this problem). The only thing that survives is a process spawned with detached: true AND stdio: 'ignore' AND its PID recorded somewhere durable. That has to be a full standalone Node process with its own module graph, not a thread sharing the SvelteKit runtime.

Windows specifics

  • Use detached: true, windowsHide: true, stdio: ['ignore', outputFileFd, outputFileFd]
  • Open a named pipe with net.createServer on \\\\.\\pipe\\klonode-worker-<id>
  • Register the PID in .klonode/workers/<id>.pid for cleanup on next boot

Files to touch

New:

  • packages/ui/src/lib/workers/worker-client.ts — browser-side client that spawns, streams, reconnects
  • packages/ui/src/lib/workers/worker-manager.ts — server-side worker registry / spawner
  • packages/ui/src/lib/workers/worker-protocol.ts — shared types for the IPC events
  • packages/ui/src/routes/api/worker/spawn/+server.ts — POST endpoint
  • packages/ui/src/routes/api/worker/[id]/stream/+server.ts — GET SSE with ?since= offset
  • packages/ui/src/routes/api/worker/[id]/stop/+server.ts — POST abort
  • packages/worker/ — NEW package. A standalone Node binary klonode-worker that wraps the Claude CLI spawn, writes the log, handles the IPC, and persists state. Built with tsup like the existing CLI package.
  • docs/self-hosting.md — update to describe the detached-worker flow

Modified:

  • packages/ui/src/routes/api/chat/stream/+server.ts — either delegate to the worker manager or keep as a legacy path for non-self-hosting setups with a warning
  • packages/ui/src/lib/components/ChatPanel/ChatPanel.svelte — replace direct fetch to /api/chat/stream with workerClient.spawn(...) + reconnect logic in onMount
  • packages/ui/src/lib/stores/agents.tscliSessionIds becomes workerIds: Record<tabId, workerId>; persisted across reloads as already done for the old session IDs

Acceptance criteria

  • Open the Workstation, send a long-running task (e.g. "add Kotlin support to the content extractor"), wait for it to start streaming
  • Edit packages/ui/src/lib/components/ChatPanel/ChatPanel.svelte manually in your editor while the stream is live
  • Vite HMR fires, the ChatPanel remounts — the streaming tokens keep flowing into the new instance with no user-visible interruption
  • Edit packages/ui/src/routes/api/chat/stream/+server.ts manually while the stream is live
  • Vite restarts the dev server, browser reloads, ChatPanel remounts — the same streaming tokens pick up from where they left off (not a fresh conversation)
  • Kill the Vite dev server completely (Ctrl-C the terminal), restart, reload the browser — the conversation is still there because the worker is still running detached
  • ps / tasklist shows a node klonode-worker process even when Vite is down
  • If the worker crashes, the UI shows a recoverable error banner with a "resume from checkpoint" button
  • Works on Windows (named pipes) and macOS / Linux (unix sockets)

Out of scope for the first PR

  • Multi-worker orchestration beyond one worker per tab
  • Worker sharing across windows / users
  • Authentication beyond "localhost-only + random token"
  • GUI for inspecting worker state

Why this is help-wanted

The IPC transport, log-tailing, and reconnect protocol are well-understood problems with good reference implementations (pm2, forever, tmux, GotTY). The Klonode-specific part is narrow: wrap the existing Claude CLI spawn in a persistent wrapper and make the ChatPanel reconnect by worker ID instead of spawning fresh. A contributor who has shipped a Node process supervisor before could land the minimum viable version in a weekend.

Sibling issues:

Sibling PRs:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestframework-supportAdding support for a new frameworkhelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions