Skip to content

Initial findings and follow-on iteration to find actual RC: Electron terminal distortion with Codex/zellij #324

@whoisasx

Description

@whoisasx

Initial findings and follow-on iteration to find the actual RC

Related / same symptom reports: #314, #280

This issue captures initial RCA findings from a live debugging session. Treat this as a starting point, not a final root cause. We should follow on with a tighter reproduction and instrumentation pass to confirm the actual RC.

User-visible symptom

When spawning an orchestrator with the Codex agent, the terminal initially works in the Electron app. After some time, or after navigating around the UI, the terminal UI becomes distorted. Typing appears to stop showing in the Codex query panel. Plain zellij ls can show the session as EXITED - attach to resurrect.

Observed dev log excerpts included:

daemon already running (pid 4426, port 3001); refusing to start
[vite] ws proxy error: Error: write EPIPE
[vite] ws proxy socket error: Error: write EPIPE
[vite] ws proxy error: Error: read ECONNRESET

Initial RCA hypothesis

The strongest current hypothesis is a dev-mode daemon ownership / readiness split-brain:

  1. A daemon is already running on 127.0.0.1:3001.
  2. pnpm dev starts Electron.
  3. Electron attempts to start another daemon via go run ./cmd/ao daemon.
  4. The backend correctly refuses because running.json points at a live daemon.
  5. Electron treats the failed child process as stopped instead of adopting the existing daemon.
  6. The renderer can still initially reach the daemon through Vite's proxy, so the terminal works at first.
  7. Later, /mux websocket churn from navigation, Vite proxy resets, or renderer remounts closes the terminal socket.
  8. The frontend refuses to reattach because daemonReady=false.
  9. xterm does not locally echo input, so keystrokes appear to do nothing.
  10. The stale terminal surface remains and can look distorted because the fresh zellij attach/repaint path never completes.

Evidence from code

Electron auto-starts the daemon on app ready:

  • frontend/src/main.ts calls startDaemon() during app.whenReady().

The duplicate daemon refusal is expected backend behavior:

  • backend/internal/daemon/daemon.go returns daemon already running ... refusing to start when a live run file exists.

Electron currently marks the failed child as stopped:

  • frontend/src/main.ts child exit handler sets daemon status to stopped.

Terminal initial attach does not require daemon readiness, but reconnect does:

  • frontend/src/renderer/hooks/useTerminalSession.ts attaches if a terminal handle exists.
  • The reconnect path gates on daemonReady; when false, it sets reattaching and waits instead of building a fresh mux.

In dev, /mux can go through the Vite proxy:

  • frontend/src/renderer/lib/api-client.ts initializes the dev API base from window.location.origin.
  • frontend/vite.renderer.config.ts proxies /api and /mux to 127.0.0.1:3001.

Important zellij debugging note

Plain zellij ls can be misleading for AO sessions. AO uses a custom socket dir:

ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-$(id -u)

Use this when checking AO-owned sessions:

env ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-$(id -u) zellij list-sessions --no-formatting

During the RCA, plain zellij list-sessions showed EXITED - attach to resurrect, while the AO socket-dir query showed the AO zellij session still alive.

Secondary findings / contributors

  1. There were stale daemon-owned zellij attach processes.

    The live process table showed multiple long-lived zellij attach test-reverb-1 options --pane-frames false children under the daemon. The active zellij metadata showed multiple connected clients and the pane pinned at 80x24. That matches the stale-client size pinning failure mode described in backend comments.

  2. WebSocket disconnects line up with zellij attach churn.

    ~/.ao/daemon.log showed repeated /mux connections ending and terminal re-attaching for the same terminal id. Around one observed failure, /mux lifetimes dropped to sub-second intervals.

  3. The backend terminal attach model intentionally creates one private zellij attach per mux open.

    This is expected, but it means route navigation, StrictMode dev remounts, Vite proxy resets, or websocket churn can create rapid attach/detach cycles.

  4. xterm WebGL context loss has weak recovery.

    XtermTerminal.tsx prefers WebglAddon; on context loss it disposes the addon, but does not fall back to CanvasAddon or recreate the terminal. That may explain purely visual distortion even when the PTY is alive.

  5. The frontend does not send an explicit terminal close frame on teardown.

    TerminalMux exposes close(id), but useTerminalSession teardown disposes the socket. Backend websocket cleanup should close attachments, but an explicit close frame may reduce stale attach windows during navigation.

Follow-on iteration needed

To find the actual RC, next pass should instrument and verify:

  • Whether Electron should adopt an already-running daemon from running.json instead of marking status stopped after duplicate-start refusal.
  • Whether Electron dev should connect directly to http://127.0.0.1:<daemon-port> instead of routing /mux through Vite, except in VITE_NO_ELECTRON=1 browser-preview mode.
  • Whether websocket close always calls backend terminal cleanup and actually terminates the private zellij attach process.
  • Whether stale zellij clients remain registered after SIGTERM / SIGKILL fallback and pin the session size.
  • Whether WebGL context loss is occurring in Electron during navigation/panel churn.

Candidate fixes

  • In Electron startup, if daemon launch exits with daemon already running, read running.json, verify readiness, and set daemon status to ready with the existing port.
  • In dev Electron, rebase API/mux URL directly to the daemon port once known, bypassing Vite proxy for /mux.
  • Send explicit terminal close(id) before socket disposal in terminal teardown.
  • Add backend test/instrumentation proving websocket close kills the corresponding private zellij attach process.
  • Add xterm WebGL context-loss fallback to canvas or terminal recreation.

Workaround while investigating

Avoid running two daemon owners. For dev with Codex, start a clean environment and let Electron own the daemon:

cd frontend
AO_AGENT=codex pnpm dev

Then use the AO socket dir when checking zellij state:

env ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-$(id -u) zellij list-sessions

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions