Summary
After ~24h of uptime with active streaming and several large sessions, the pi-web server (the supervised child) becomes completely unresponsive: it accepts TCP connections but never writes a response. The browser UI shows the sessions drawer stuck on "No saved sessions yet." with the /api/sessions request hanging forever, and the tunnel/proxy eventually returns a "Tunnel Timed Out" page.
This is not a memory problem — it's single-threaded event-loop / syscall saturation.
Symptoms
- Sessions drawer renders the empty placeholder ("No saved sessions yet.") because the
/api/sessions fetch never resolves.
curl to both the supervisor (127.0.0.1:8787) and the child directly (127.0.0.1:8788) time out (HTTP 000), even though both processes are alive and LISTENing.
- Connections accept but never get a response → they pile up (observed 38
ESTAB connections to the child backing up).
- Restarting the child clears it immediately (sessions persist on disk, nothing lost).
Evidence it is CPU / event loop, not memory
System:
- 494 GB RAM, ~320 GB free, no swap, no OOM.
Child process:
- RSS only ~833 MB (far below V8's default ~4 GB old-space limit). 37 GB VmSize is just reserved address space.
Main thread is saturated in the kernel:
MainThread: utime 291,825 vs stime 1,715,164 (~85% of CPU in kernel/syscalls)
voluntary_ctxt_switches: 31,571,312 (31.5M wakeups)
state: running / ep_poll, ~20-30% CPU sustained for the full ~24h uptime
V8Worker / libuv threads are idle — this is a main-thread saturation, which is exactly why Node accepts sockets but never responds.
Root cause (two compounding issues in server.ts)
-
Per-streaming-event fan-out is O(events x clients x messages).
- Every
pi_event (fires on every token/chunk while a turn streams) triggers two broadcast() calls.
broadcast() does a JSON.stringify then a .send() (a write syscall) per connected realtime client.
- On
message_end / agent_end / compaction_end it also calls sessionStats(value), which iterates the entire message branch (one affected session had 892 messages).
- With high-frequency events x several clients x large sessions, this becomes a syscall storm (the 31.5M context switches + huge
stime).
-
File-descriptor leak on session files.
- The same session
.jsonl files are held open repeatedly — observed one file open 18x, others 9x / 7x / 6x — totaling 244 open fds.
- Session listing/reading opens files without closing them; with frequent
/api/sessions + state polling over a long-lived server, fds and parse/I/O work accumulate.
Why it tips from "slow" to "fully hung"
The server runs ~20-30% kernel CPU continuously (chronic). Streaming a turn on top of large sessions + accumulated connections/fds pushes the event loop past the point where it can drain incoming HTTP requests between broadcast bursts. Requests then queue indefinitely → empty drawer, perpetually-pending /api/sessions, tunnel timeout.
Suggested fixes
- Throttle / coalesce
pi_event broadcasts (batch instead of fanning out every token).
- Prune dead realtime clients; avoid the double-broadcast per event.
- Memoize or incrementally maintain
sessionStats instead of re-scanning the full branch on every event.
- Close session-file handles after reading (fix the fd leak).
Workaround
Restart the supervised child (sessions persist on disk):
curl -X POST http://127.0.0.1:8787/api/restart
# or
kill -9 <child-pid> # supervisor auto-respawns
Summary
After ~24h of uptime with active streaming and several large sessions, the pi-web server (the supervised child) becomes completely unresponsive: it accepts TCP connections but never writes a response. The browser UI shows the sessions drawer stuck on "No saved sessions yet." with the
/api/sessionsrequest hanging forever, and the tunnel/proxy eventually returns a "Tunnel Timed Out" page.This is not a memory problem — it's single-threaded event-loop / syscall saturation.
Symptoms
/api/sessionsfetch never resolves.curlto both the supervisor (127.0.0.1:8787) and the child directly (127.0.0.1:8788) time out (HTTP000), even though both processes are alive andLISTENing.ESTABconnections to the child backing up).Evidence it is CPU / event loop, not memory
System:
Child process:
Main thread is saturated in the kernel:
V8Worker / libuv threads are idle — this is a main-thread saturation, which is exactly why Node accepts sockets but never responds.
Root cause (two compounding issues in
server.ts)Per-streaming-event fan-out is O(events x clients x messages).
pi_event(fires on every token/chunk while a turn streams) triggers twobroadcast()calls.broadcast()does aJSON.stringifythen a.send()(awritesyscall) per connected realtime client.message_end/agent_end/compaction_endit also callssessionStats(value), which iterates the entire message branch (one affected session had 892 messages).stime).File-descriptor leak on session files.
.jsonlfiles are held open repeatedly — observed one file open 18x, others 9x / 7x / 6x — totaling 244 open fds./api/sessions+ state polling over a long-lived server, fds and parse/I/O work accumulate.Why it tips from "slow" to "fully hung"
The server runs ~20-30% kernel CPU continuously (chronic). Streaming a turn on top of large sessions + accumulated connections/fds pushes the event loop past the point where it can drain incoming HTTP requests between broadcast bursts. Requests then queue indefinitely → empty drawer, perpetually-pending
/api/sessions, tunnel timeout.Suggested fixes
pi_eventbroadcasts (batch instead of fanning out every token).sessionStatsinstead of re-scanning the full branch on every event.Workaround
Restart the supervised child (sessions persist on disk):