Skip to content

ProcPool wedges permanently when a warming child dies before its first IPC message — initialize() never settles, worker keeps accepting jobs it can never launch #1748

@rinarakaki

Description

@rinarakaki

Summary

In @livekit/agents ≥1.4.x (verified on 1.4.4 and 1.4.5 — the relevant files are byte-identical), a warming child process that dies or hangs during import/prewarm without sending its first IPC message permanently wedges the worker's process pool: the worker stays registered and available, keeps logging received job request and accepting, but never launches another job until the process is externally restarted.

We hit this repeatedly in a production contact-center deployment (5 independent wedge events across 2 days, each black-holing all calls routed to the worker for hours). Related: the review on #1669 already confirmed "the bug is real in JS" before that PR was closed unmerged; see also #909 and #927.

Mechanism (file:line refs from 1.4.4 dist)

  1. SupervisedProc.initialize()'s only completion path is the child's first IPC message: await once(this.proc, "message") (dist/ipc/supervised_proc.js:139). A child killed mid-prewarm (e.g. kernel OOM, V8 heap abort, import crash in the agent module) emits neither "message" nor "error" — normal exit fires no "error" — so events.once pends forever.
  2. The initializeProcessTimeout timer only rejects the side future this.init (supervised_proc.js:123-125). It does not kill the child and does not unblock initialize().
  3. The init rejection makes run() throw at await this.init.await, which start()'s catch swallows into a single WARN — supervised process run failed (supervised_proc.js:46-49). Because run() died at its first await, the exit listener / ping loop / memory monitor were never attached.
  4. Meanwhile ProcPool.procWatchTask is still parked at await proc.initialize() (dist/ipc/proc_pool.js:85), holding both initMutex and its procMutex slot — the releases sit in a finally that is unreachable while initialize() pends. With numIdleProcesses: 1 that is the only slot, so run() blocks at procMutex.lock() (proc_pool.js:120) forever, warmedProcQueue never refills, and every accepted job blocks at warmedProcQueue.get() (proc_pool.js:41) — unbounded, no timeout.
  5. The worker keeps reporting available (load loop is CPU-based), so the server keeps dispatching jobs into the black hole.

Production log signature: exactly one supervised process run failed WARN per wedge (≈60 s after the last successful prewarm), then received job request lines forever with zero launches. We matched this 5/5 against our wedge onsets.

Regression vs 1.0.46

In 1.0.46, start() called this.run() bare (no .catch), so the same init-timeout rejection escaped as an unhandledRejection — in our deployment that crashed the worker process and the orchestrator restarted it. Accidentally self-healing. 1.4.x's added .catch turns the same failure into a silent permanent wedge. (Not arguing for the crash — arguing the timeout should actually recover the pool.)

Minimal repro

const path = require("path");
const { fork } = require("child_process");
const fs = require("fs");
const base = path.dirname(require.resolve("@livekit/agents")); // dist/
const { initializeLogger } = require(path.join(base, "log.cjs"));
initializeLogger({ pretty: false, level: "silent" });
const { SupervisedProc } = require(path.join(base, "ipc/supervised_proc.cjs"));

const hang = "/tmp/lk-hang.cjs";
fs.writeFileSync(hang, "setInterval(() => {}, 1000);\n"); // never sends a message

const p = new SupervisedProc(300 /* initializeTimeout */, 500, 0, 0, 60000, 60000, 60000);
p.createProcess = () => fork(hang, [], { stdio: ["ignore", "ignore", "ignore", "ipc"] });

(async () => {
  await p.start();
  const outcome = await Promise.race([
    p.initialize().then(() => "resolved", (e) => "rejected: " + e.message),
    new Promise((r) => setTimeout(() => r("STILL PENDING AFTER 2s (WEDGED)"), 2000)),
  ]);
  console.log("initialize() outcome:", outcome); // 1.4.4/1.4.5: STILL PENDING AFTER 2s (WEDGED)
  process.exit(0);
})();

(Same scenario with a child that exits before its first message also wedges; with a child whose first message isn't initializeResponse, the throw happens after clearTimeout and procWatchTask's empty catch proceeds to await proc.join() where init never settles — wedge with zero log lines.)

Suggested fix

Make initialize() always settle and reclaim the child:

  • race the first message against the child's exit event and the init timeout,
  • on any failure: kill the child, reject init, and throw, so procWatchTask's catch/finally release the mutex slots and the pool replenishes.

We are running exactly that as a local patch (diff below) in production; happy to open it as a PR if useful. Care is needed that a late-losing once(proc, "exit") rejection is pre-handled, otherwise every normal child exit after a successful init becomes an unhandledRejection.

   async initialize() {
     var _a;
     const timer = setTimeout(() => {
       this.init.reject(new Error("runner initialization timed out"));
+      try { this.proc?.kill("SIGKILL"); } catch {}
     }, this.#opts.initializeTimeout);
     if (!((_a = this.proc) == null ? void 0 : _a.connected)) {
-      this.init.reject(new Error("process not connected"));
-      return;
+      const err = new Error("process not connected");
+      this.init.reject(err);
+      clearTimeout(timer);
+      throw err;
     }
     this.proc.send({ case: "initializeRequest", value: { ... } });
-    await once(this.proc, "message").then(([msg]) => {
-      clearTimeout(timer);
-      if (msg.case !== "initializeResponse") {
-        throw new Error("first message must be InitializeResponse");
-      }
-    });
+    const firstMessage = once(this.proc, "message").then(([msg]) => {
+      if (msg.case !== "initializeResponse") throw new Error("first message must be InitializeResponse");
+    });
+    const exited = once(this.proc, "exit").then(() => {
+      throw new Error("process exited before initialization completed");
+    });
+    firstMessage.catch(() => {});
+    exited.catch(() => {}); // late race losers must not become unhandledRejection
+    try {
+      await Promise.race([firstMessage, exited, this.init.await]);
+    } catch (err) {
+      this.init.reject(err);
+      try { this.proc?.kill("SIGKILL"); } catch {}
+      throw err;
+    } finally {
+      clearTimeout(timer);
+    }
     this.init.resolve();
   }

Environment

  • @livekit/agents 1.4.4 (also inspected 1.4.5 — supervised_proc.js / proc_pool.js identical)
  • Node.js 24.x, Linux (ECS/EC2), self-hosted workers (apiKey/apiSecret, no workerToken)
  • numIdleProcesses: 1, initializeProcessTimeout: 60s; trigger in our case was memory pressure killing/stalling the warming child during a heavy import+VAD prewarm

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions