Skip to content

pi-web server becomes unresponsive over time: event-loop saturation from per-token broadcast fan-out + session-file fd leak #24

@ashwin-pc

Description

@ashwin-pc

Summary

After ~24h of uptime with active streaming and several large sessions, the pi-web server (the supervised child) becomes completely unresponsive: it accepts TCP connections but never writes a response. The browser UI shows the sessions drawer stuck on "No saved sessions yet." with the /api/sessions request hanging forever, and the tunnel/proxy eventually returns a "Tunnel Timed Out" page.

This is not a memory problem — it's single-threaded event-loop / syscall saturation.

Symptoms

  • Sessions drawer renders the empty placeholder ("No saved sessions yet.") because the /api/sessions fetch never resolves.
  • curl to both the supervisor (127.0.0.1:8787) and the child directly (127.0.0.1:8788) time out (HTTP 000), even though both processes are alive and LISTENing.
  • Connections accept but never get a response → they pile up (observed 38 ESTAB connections to the child backing up).
  • Restarting the child clears it immediately (sessions persist on disk, nothing lost).

Evidence it is CPU / event loop, not memory

System:

  • 494 GB RAM, ~320 GB free, no swap, no OOM.

Child process:

  • RSS only ~833 MB (far below V8's default ~4 GB old-space limit). 37 GB VmSize is just reserved address space.

Main thread is saturated in the kernel:

MainThread:  utime 291,825  vs  stime 1,715,164   (~85% of CPU in kernel/syscalls)
voluntary_ctxt_switches: 31,571,312               (31.5M wakeups)
state: running / ep_poll, ~20-30% CPU sustained for the full ~24h uptime

V8Worker / libuv threads are idle — this is a main-thread saturation, which is exactly why Node accepts sockets but never responds.

Root cause (two compounding issues in server.ts)

  1. Per-streaming-event fan-out is O(events x clients x messages).

    • Every pi_event (fires on every token/chunk while a turn streams) triggers two broadcast() calls.
    • broadcast() does a JSON.stringify then a .send() (a write syscall) per connected realtime client.
    • On message_end / agent_end / compaction_end it also calls sessionStats(value), which iterates the entire message branch (one affected session had 892 messages).
    • With high-frequency events x several clients x large sessions, this becomes a syscall storm (the 31.5M context switches + huge stime).
  2. File-descriptor leak on session files.

    • The same session .jsonl files are held open repeatedly — observed one file open 18x, others 9x / 7x / 6x — totaling 244 open fds.
    • Session listing/reading opens files without closing them; with frequent /api/sessions + state polling over a long-lived server, fds and parse/I/O work accumulate.

Why it tips from "slow" to "fully hung"

The server runs ~20-30% kernel CPU continuously (chronic). Streaming a turn on top of large sessions + accumulated connections/fds pushes the event loop past the point where it can drain incoming HTTP requests between broadcast bursts. Requests then queue indefinitely → empty drawer, perpetually-pending /api/sessions, tunnel timeout.

Suggested fixes

  • Throttle / coalesce pi_event broadcasts (batch instead of fanning out every token).
  • Prune dead realtime clients; avoid the double-broadcast per event.
  • Memoize or incrementally maintain sessionStats instead of re-scanning the full branch on every event.
  • Close session-file handles after reading (fix the fd leak).

Workaround

Restart the supervised child (sessions persist on disk):

curl -X POST http://127.0.0.1:8787/api/restart
# or
kill -9 <child-pid>   # supervisor auto-respawns

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions