fix(ipc): make SupervisedProc.initialize() always settle (closes #1748 wedge)#1768
Open
tsushanth wants to merge 1 commit into
Open
fix(ipc): make SupervisedProc.initialize() always settle (closes #1748 wedge)#1768tsushanth wants to merge 1 commit into
tsushanth wants to merge 1 commit into
Conversation
…wedge) initialize() only completed via `await once(proc, 'message')`, so a warming child that died or hung before sending its first IPC message left initialize() pending forever. The init timeout rejected the side `init` future but did not unblock initialize() itself, so ProcPool.procWatchTask was parked at `await proc.initialize()` holding both initMutex and its procMutex slot — the worker kept reporting available and accepting jobs that could never launch. In production this black-holed every job routed to the worker until the process was externally restarted. initialize() now races three signals — first message, child exit, and the init timeout — and SIGKILLs the child on timeout. Late race losers swallow their own rejection so a normal child exit after a successful init never surfaces as an unhandled rejection. procWatchTask already catches initialization failures, so its mutex slots release and the pool replenishes as intended. Update the stale comment in start() and the two existing init-timeout tests to assert initialize() now rejects; add a regression test for the never-sends-a-message wedge. Closes livekit#1748
🦋 Changeset detectedLatest commit: df6592f The changes in this PR will be included in the next version bump. This PR includes changesets to release 35 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In
@livekit/agents≥1.4.x, a warming child process that dies or hangs before sending its first IPC message permanently wedges the worker's process pool: the worker keeps reporting available and acceptingreceived job request, but never launches another job until externally restarted.I hit this against the production wedge described in #1748 — the reporter saw 5 wedge events across 2 days in a contact-center deployment, each black-holing every job routed to the worker for hours.
Mechanism
SupervisedProc.initialize()only completes viaawait once(proc, 'message')(supervised_proc.ts:217). A child killed mid-prewarm (OOM, V8 heap abort, import crash) emits neither'message'nor'error', soonce()pends forever.initializeProcessTimeouttimer only rejects the sideinitfuture — it does not kill the child and does not unblockinitialize().ProcPool.procWatchTaskis parked atawait proc.initialize()(proc_pool.ts:109) holding bothinitMutexand itsprocMutexslot. WithnumIdleProcesses: 1that is the only slot, sorun()blocks atprocMutex.lock()forever,warmedProcQueuenever refills, and every accepted job blocks atwarmedProcQueue.get()— unbounded, no timeout.Fix
initialize()now races three signals so it always settles:once(proc, 'message')as today)initializeResponsethis.init— the timeout above (or any other init rejection)On timeout, the child is SIGKILLed so the exit-race can settle. Late race losers (
firstMessageafter a successful resolve viaexit, orexitafter a successful first message) attach a no-op.catchso a normal post-init child exit does not surface as an unhandled rejection.ProcPool.procWatchTaskalready has atry { ... } catch {}aroundproc.initialize()(proc_pool.ts:108-118) that releases the acquired slot on failure, so onceinitialize()throws, the pool replenishes as intended. The early-return when!proc.connectedis also converted to a throw for the same reason.A stale comment in
start()(supervised_proc.ts:99-102) said the run-catch intentionally avoids killing the child because killing would race theonce('message')and deadlockinitialize(). That race no longer exists; updated the comment.Test plan
initialize()now rejects (instead of silently resolving) and thatjoin()still resolves.initialize()must reject within the timeout — confirmed it pends past 2 s onmainand rejects in <200 ms after the fix.pnpm vitest run agents/src/ipc/supervised_proc.test.ts— 16/16 pass (13 existing + 3 in this area). Verified the new test fails onmainby reverting only the production change.pnpm build:agents— cleanpnpm lint— no new warnings on touched filespnpm format:check— cleanCloses #1748