fix(ipc): close uninitialized supervised process#1774
fix(ipc): close uninitialized supervised process#1774rosetta-livekit-bot[bot] wants to merge 1 commit into
Conversation
|
| if (!this.init.done) { | ||
| this.init.reject(new Error('process closed before initialization completed')); | ||
| this.proc?.kill(); | ||
| await this.#join.await.then(() => { | ||
| this.clearTimers(); | ||
| }); | ||
| return; |
There was a problem hiding this comment.
🚩 initialize() caller may hang when close() is called during initialization
The new close() path correctly rejects init and resolves #join, fixing the issue where close() itself would hang. However, if a caller (e.g. proc_pool.ts:108-109 in procWatchTask) is concurrently awaiting proc.initialize(), that call hangs forever on once(this.proc!, 'message') at supervised_proc.ts:217 because killing a child process emits 'exit'/'close' but NOT 'error', so events.once() never resolves or rejects. This is a pre-existing issue not introduced by this PR — in fact, the old code was worse because close() itself would also hang. The fix here is a meaningful improvement, but a complete solution would also need initialize() to detect the killed process (e.g., by racing once(proc, 'message') with once(proc, 'exit') or using an AbortSignal).
Was this helpful? React with 👍 or 👎 to provide feedback.
Ports livekit/agents#6051 to agents-js.\n\nWhen a supervised process is closed before initialization completes, reject the initialization future to unblock the supervisor task, kill the child process directly, and wait for join cleanup instead of sending a graceful shutdown the child cannot handle yet.\n\nTests:\n- pnpm test agents/src/ipc/supervised_proc.test.ts\n- pnpm build:agents\n\nNo tests were added, matching the porting instruction.
Ported from livekit/agents#6051
Original PR description
Summary
test_slow_initializationflakes on CI with a leakedProcJobExecutor._supervise_taskat teardown. The root cause is a real shutdown race inSupervisedProc, not a test issue.When the owning task (e.g.
ProcPool._proc_spawn_taskduring pool close) is cancelled while awaitingstart(), the shielded_start()keeps running and creates the supervise task afterwards. The cleanup path then callsaclose(), which no-ops because the proc isn't marked started yet — leaking the supervise task and the child process. The orphaned child blocks forever waiting forInitializeRequest, and its non-daemon join thread can hang worker shutdown indefinitely.Changes
start()keeps a handle on the shielded_start()task;aclose()waits for it before checkingstarted, so an abandoned start can't race past the check.aclose()kills a never-initialized process instead of attempting a graceful shutdown: the child readsInitializeRequestbefore servicing any other message, so aShutdownRequestcould never be acked.kill()resolves a pending_initialize_futso the supervise task (which waits on it before supervising) can observe the process exit.test_aclose_after_cancelled_start, which deterministically reproduces the leak (fails the leaked-tasks check and hangs pytest exit without the fix).nit: Also sets the agent session tests to
speed = 1and raises the defaultdrain_delay— they run under virtual time, so the speed factor no longer buys wall-clock time.