Initial findings and follow-on iteration to find the actual RC
Related / same symptom reports: #314, #280
This issue captures initial RCA findings from a live debugging session. Treat this as a starting point, not a final root cause. We should follow on with a tighter reproduction and instrumentation pass to confirm the actual RC.
User-visible symptom
When spawning an orchestrator with the Codex agent, the terminal initially works in the Electron app. After some time, or after navigating around the UI, the terminal UI becomes distorted. Typing appears to stop showing in the Codex query panel. Plain zellij ls can show the session as EXITED - attach to resurrect.
Observed dev log excerpts included:
daemon already running (pid 4426, port 3001); refusing to start
[vite] ws proxy error: Error: write EPIPE
[vite] ws proxy socket error: Error: write EPIPE
[vite] ws proxy error: Error: read ECONNRESET
Initial RCA hypothesis
The strongest current hypothesis is a dev-mode daemon ownership / readiness split-brain:
- A daemon is already running on
127.0.0.1:3001.
pnpm dev starts Electron.
- Electron attempts to start another daemon via
go run ./cmd/ao daemon.
- The backend correctly refuses because
running.json points at a live daemon.
- Electron treats the failed child process as
stopped instead of adopting the existing daemon.
- The renderer can still initially reach the daemon through Vite's proxy, so the terminal works at first.
- Later,
/mux websocket churn from navigation, Vite proxy resets, or renderer remounts closes the terminal socket.
- The frontend refuses to reattach because
daemonReady=false.
- xterm does not locally echo input, so keystrokes appear to do nothing.
- The stale terminal surface remains and can look distorted because the fresh zellij attach/repaint path never completes.
Evidence from code
Electron auto-starts the daemon on app ready:
frontend/src/main.ts calls startDaemon() during app.whenReady().
The duplicate daemon refusal is expected backend behavior:
backend/internal/daemon/daemon.go returns daemon already running ... refusing to start when a live run file exists.
Electron currently marks the failed child as stopped:
frontend/src/main.ts child exit handler sets daemon status to stopped.
Terminal initial attach does not require daemon readiness, but reconnect does:
frontend/src/renderer/hooks/useTerminalSession.ts attaches if a terminal handle exists.
- The reconnect path gates on
daemonReady; when false, it sets reattaching and waits instead of building a fresh mux.
In dev, /mux can go through the Vite proxy:
frontend/src/renderer/lib/api-client.ts initializes the dev API base from window.location.origin.
frontend/vite.renderer.config.ts proxies /api and /mux to 127.0.0.1:3001.
Important zellij debugging note
Plain zellij ls can be misleading for AO sessions. AO uses a custom socket dir:
ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-$(id -u)
Use this when checking AO-owned sessions:
env ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-$(id -u) zellij list-sessions --no-formatting
During the RCA, plain zellij list-sessions showed EXITED - attach to resurrect, while the AO socket-dir query showed the AO zellij session still alive.
Secondary findings / contributors
-
There were stale daemon-owned zellij attach processes.
The live process table showed multiple long-lived zellij attach test-reverb-1 options --pane-frames false children under the daemon. The active zellij metadata showed multiple connected clients and the pane pinned at 80x24. That matches the stale-client size pinning failure mode described in backend comments.
-
WebSocket disconnects line up with zellij attach churn.
~/.ao/daemon.log showed repeated /mux connections ending and terminal re-attaching for the same terminal id. Around one observed failure, /mux lifetimes dropped to sub-second intervals.
-
The backend terminal attach model intentionally creates one private zellij attach per mux open.
This is expected, but it means route navigation, StrictMode dev remounts, Vite proxy resets, or websocket churn can create rapid attach/detach cycles.
-
xterm WebGL context loss has weak recovery.
XtermTerminal.tsx prefers WebglAddon; on context loss it disposes the addon, but does not fall back to CanvasAddon or recreate the terminal. That may explain purely visual distortion even when the PTY is alive.
-
The frontend does not send an explicit terminal close frame on teardown.
TerminalMux exposes close(id), but useTerminalSession teardown disposes the socket. Backend websocket cleanup should close attachments, but an explicit close frame may reduce stale attach windows during navigation.
Follow-on iteration needed
To find the actual RC, next pass should instrument and verify:
- Whether Electron should adopt an already-running daemon from
running.json instead of marking status stopped after duplicate-start refusal.
- Whether Electron dev should connect directly to
http://127.0.0.1:<daemon-port> instead of routing /mux through Vite, except in VITE_NO_ELECTRON=1 browser-preview mode.
- Whether websocket close always calls backend terminal cleanup and actually terminates the private
zellij attach process.
- Whether stale zellij clients remain registered after
SIGTERM / SIGKILL fallback and pin the session size.
- Whether WebGL context loss is occurring in Electron during navigation/panel churn.
Candidate fixes
- In Electron startup, if daemon launch exits with
daemon already running, read running.json, verify readiness, and set daemon status to ready with the existing port.
- In dev Electron, rebase API/mux URL directly to the daemon port once known, bypassing Vite proxy for
/mux.
- Send explicit terminal
close(id) before socket disposal in terminal teardown.
- Add backend test/instrumentation proving websocket close kills the corresponding private
zellij attach process.
- Add xterm WebGL context-loss fallback to canvas or terminal recreation.
Workaround while investigating
Avoid running two daemon owners. For dev with Codex, start a clean environment and let Electron own the daemon:
cd frontend
AO_AGENT=codex pnpm dev
Then use the AO socket dir when checking zellij state:
env ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-$(id -u) zellij list-sessions
Initial findings and follow-on iteration to find the actual RC
Related / same symptom reports: #314, #280
This issue captures initial RCA findings from a live debugging session. Treat this as a starting point, not a final root cause. We should follow on with a tighter reproduction and instrumentation pass to confirm the actual RC.
User-visible symptom
When spawning an orchestrator with the Codex agent, the terminal initially works in the Electron app. After some time, or after navigating around the UI, the terminal UI becomes distorted. Typing appears to stop showing in the Codex query panel. Plain
zellij lscan show the session asEXITED - attach to resurrect.Observed dev log excerpts included:
Initial RCA hypothesis
The strongest current hypothesis is a dev-mode daemon ownership / readiness split-brain:
127.0.0.1:3001.pnpm devstarts Electron.go run ./cmd/ao daemon.running.jsonpoints at a live daemon.stoppedinstead of adopting the existing daemon./muxwebsocket churn from navigation, Vite proxy resets, or renderer remounts closes the terminal socket.daemonReady=false.Evidence from code
Electron auto-starts the daemon on app ready:
frontend/src/main.tscallsstartDaemon()duringapp.whenReady().The duplicate daemon refusal is expected backend behavior:
backend/internal/daemon/daemon.goreturnsdaemon already running ... refusing to startwhen a live run file exists.Electron currently marks the failed child as stopped:
frontend/src/main.tschildexithandler sets daemon status tostopped.Terminal initial attach does not require daemon readiness, but reconnect does:
frontend/src/renderer/hooks/useTerminalSession.tsattaches if a terminal handle exists.daemonReady; when false, it setsreattachingand waits instead of building a fresh mux.In dev,
/muxcan go through the Vite proxy:frontend/src/renderer/lib/api-client.tsinitializes the dev API base fromwindow.location.origin.frontend/vite.renderer.config.tsproxies/apiand/muxto127.0.0.1:3001.Important zellij debugging note
Plain
zellij lscan be misleading for AO sessions. AO uses a custom socket dir:ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-$(id -u)Use this when checking AO-owned sessions:
env ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-$(id -u) zellij list-sessions --no-formattingDuring the RCA, plain
zellij list-sessionsshowedEXITED - attach to resurrect, while the AO socket-dir query showed the AO zellij session still alive.Secondary findings / contributors
There were stale daemon-owned
zellij attachprocesses.The live process table showed multiple long-lived
zellij attach test-reverb-1 options --pane-frames falsechildren under the daemon. The active zellij metadata showed multiple connected clients and the pane pinned at80x24. That matches the stale-client size pinning failure mode described in backend comments.WebSocket disconnects line up with zellij attach churn.
~/.ao/daemon.logshowed repeated/muxconnections ending andterminal re-attachingfor the same terminal id. Around one observed failure,/muxlifetimes dropped to sub-second intervals.The backend terminal attach model intentionally creates one private
zellij attachper mux open.This is expected, but it means route navigation, StrictMode dev remounts, Vite proxy resets, or websocket churn can create rapid attach/detach cycles.
xterm WebGL context loss has weak recovery.
XtermTerminal.tsxprefersWebglAddon; on context loss it disposes the addon, but does not fall back to CanvasAddon or recreate the terminal. That may explain purely visual distortion even when the PTY is alive.The frontend does not send an explicit terminal close frame on teardown.
TerminalMuxexposesclose(id), butuseTerminalSessionteardown disposes the socket. Backend websocket cleanup should close attachments, but an explicit close frame may reduce stale attach windows during navigation.Follow-on iteration needed
To find the actual RC, next pass should instrument and verify:
running.jsoninstead of marking statusstoppedafter duplicate-start refusal.http://127.0.0.1:<daemon-port>instead of routing/muxthrough Vite, except inVITE_NO_ELECTRON=1browser-preview mode.zellij attachprocess.SIGTERM/SIGKILLfallback and pin the session size.Candidate fixes
daemon already running, readrunning.json, verify readiness, and set daemon status toreadywith the existing port./mux.close(id)before socket disposal in terminal teardown.zellij attachprocess.Workaround while investigating
Avoid running two daemon owners. For dev with Codex, start a clean environment and let Electron own the daemon:
cd frontend AO_AGENT=codex pnpm devThen use the AO socket dir when checking zellij state:
env ZELLIJ_SOCKET_DIR=/tmp/ao-zellij-$(id -u) zellij list-sessions