Skip to content

fix(ipc): make SupervisedProc.initialize() always settle (closes #1748 wedge)#1768

Open
tsushanth wants to merge 1 commit into
livekit:mainfrom
tsushanth:fix/supervised-proc-init-wedge
Open

fix(ipc): make SupervisedProc.initialize() always settle (closes #1748 wedge)#1768
tsushanth wants to merge 1 commit into
livekit:mainfrom
tsushanth:fix/supervised-proc-init-wedge

Conversation

@tsushanth

Copy link
Copy Markdown

Summary

In @livekit/agents ≥1.4.x, a warming child process that dies or hangs before sending its first IPC message permanently wedges the worker's process pool: the worker keeps reporting available and accepting received job request, but never launches another job until externally restarted.

I hit this against the production wedge described in #1748 — the reporter saw 5 wedge events across 2 days in a contact-center deployment, each black-holing every job routed to the worker for hours.

Mechanism

  1. SupervisedProc.initialize() only completes via await once(proc, 'message') (supervised_proc.ts:217). A child killed mid-prewarm (OOM, V8 heap abort, import crash) emits neither 'message' nor 'error', so once() pends forever.
  2. The initializeProcessTimeout timer only rejects the side init future — it does not kill the child and does not unblock initialize().
  3. ProcPool.procWatchTask is parked at await proc.initialize() (proc_pool.ts:109) holding both initMutex and its procMutex slot. With numIdleProcesses: 1 that is the only slot, so run() blocks at procMutex.lock() forever, warmedProcQueue never refills, and every accepted job blocks at warmedProcQueue.get() — unbounded, no timeout.
  4. The worker's load loop is CPU-based, so the server keeps dispatching jobs into the black hole.

Fix

initialize() now races three signals so it always settles:

  1. firstMessage — the happy path (the same once(proc, 'message') as today)
  2. exit — the child crashed/exited before initializeResponse
  3. this.init — the timeout above (or any other init rejection)

On timeout, the child is SIGKILLed so the exit-race can settle. Late race losers (firstMessage after a successful resolve via exit, or exit after a successful first message) attach a no-op .catch so a normal post-init child exit does not surface as an unhandled rejection.

ProcPool.procWatchTask already has a try { ... } catch {} around proc.initialize() (proc_pool.ts:108-118) that releases the acquired slot on failure, so once initialize() throws, the pool replenishes as intended. The early-return when !proc.connected is also converted to a throw for the same reason.

A stale comment in start() (supervised_proc.ts:99-102) said the run-catch intentionally avoids killing the child because killing would race the once('message') and deadlock initialize(). That race no longer exists; updated the comment.

Test plan

Closes #1748

…wedge)

initialize() only completed via `await once(proc, 'message')`, so a
warming child that died or hung before sending its first IPC message
left initialize() pending forever. The init timeout rejected the side
`init` future but did not unblock initialize() itself, so
ProcPool.procWatchTask was parked at `await proc.initialize()` holding
both initMutex and its procMutex slot — the worker kept reporting
available and accepting jobs that could never launch. In production this
black-holed every job routed to the worker until the process was
externally restarted.

initialize() now races three signals — first message, child exit, and
the init timeout — and SIGKILLs the child on timeout. Late race losers
swallow their own rejection so a normal child exit after a successful
init never surfaces as an unhandled rejection. procWatchTask already
catches initialization failures, so its mutex slots release and the
pool replenishes as intended. Update the stale comment in start() and
the two existing init-timeout tests to assert initialize() now rejects;
add a regression test for the never-sends-a-message wedge.

Closes livekit#1748
@changeset-bot

changeset-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: df6592f

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 35 packages
Name Type
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-did Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-fishaudio Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-hume Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-mistralai Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-perplexity Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-sarvam Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugin-soniox Patch
@livekit/agents-plugin-tavus Patch
@livekit/agents-plugins-test Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugin-xai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

@CLAassistant

CLAassistant commented Jun 11, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants