ProcPool wedges permanently when a warming child dies before its first IPC message — initialize() never settles, worker keeps accepting jobs it can never launch

## Summary

In `@livekit/agents` ≥1.4.x (verified on 1.4.4 and 1.4.5 — the relevant files are byte-identical), a warming child process that **dies or hangs during import/prewarm without sending its first IPC message** permanently wedges the worker's process pool: the worker stays registered and available, keeps logging `received job request` and accepting, but never launches another job until the process is externally restarted.

We hit this repeatedly in a production contact-center deployment (5 independent wedge events across 2 days, each black-holing all calls routed to the worker for hours). Related: the review on #1669 already confirmed "the bug is real in JS" before that PR was closed unmerged; see also #909 and #927.

## Mechanism (file:line refs from 1.4.4 dist)

1. `SupervisedProc.initialize()`'s only completion path is the child's first IPC message: `await once(this.proc, "message")` (`dist/ipc/supervised_proc.js:139`). A child killed mid-prewarm (e.g. kernel OOM, V8 heap abort, import crash in the agent module) emits neither `"message"` nor `"error"` — normal exit fires no `"error"` — so `events.once` **pends forever**.
2. The `initializeProcessTimeout` timer only rejects the side future `this.init` (`supervised_proc.js:123-125`). It does **not** kill the child and does **not** unblock `initialize()`.
3. The `init` rejection makes `run()` throw at `await this.init.await`, which `start()`'s catch swallows into a single WARN — `supervised process run failed` (`supervised_proc.js:46-49`). Because `run()` died at its first await, the exit listener / ping loop / memory monitor were never attached.
4. Meanwhile `ProcPool.procWatchTask` is still parked at `await proc.initialize()` (`dist/ipc/proc_pool.js:85`), holding **both** `initMutex` and its `procMutex` slot — the releases sit in a `finally` that is unreachable while `initialize()` pends. With `numIdleProcesses: 1` that is the **only** slot, so `run()` blocks at `procMutex.lock()` (`proc_pool.js:120`) forever, `warmedProcQueue` never refills, and every accepted job blocks at `warmedProcQueue.get()` (`proc_pool.js:41`) — unbounded, no timeout.
5. The worker keeps reporting available (load loop is CPU-based), so the server keeps dispatching jobs into the black hole.

Production log signature: exactly one `supervised process run failed` WARN per wedge (≈60 s after the last successful prewarm), then `received job request` lines forever with zero launches. We matched this 5/5 against our wedge onsets.

## Regression vs 1.0.46

In 1.0.46, `start()` called `this.run()` bare (no `.catch`), so the same init-timeout rejection escaped as an **unhandledRejection** — in our deployment that crashed the worker process and the orchestrator restarted it. Accidentally self-healing. 1.4.x's added `.catch` turns the same failure into a silent permanent wedge. (Not arguing for the crash — arguing the timeout should actually recover the pool.)

## Minimal repro

```js
const path = require("path");
const { fork } = require("child_process");
const fs = require("fs");
const base = path.dirname(require.resolve("@livekit/agents")); // dist/
const { initializeLogger } = require(path.join(base, "log.cjs"));
initializeLogger({ pretty: false, level: "silent" });
const { SupervisedProc } = require(path.join(base, "ipc/supervised_proc.cjs"));

const hang = "/tmp/lk-hang.cjs";
fs.writeFileSync(hang, "setInterval(() => {}, 1000);\n"); // never sends a message

const p = new SupervisedProc(300 /* initializeTimeout */, 500, 0, 0, 60000, 60000, 60000);
p.createProcess = () => fork(hang, [], { stdio: ["ignore", "ignore", "ignore", "ipc"] });

(async () => {
  await p.start();
  const outcome = await Promise.race([
    p.initialize().then(() => "resolved", (e) => "rejected: " + e.message),
    new Promise((r) => setTimeout(() => r("STILL PENDING AFTER 2s (WEDGED)"), 2000)),
  ]);
  console.log("initialize() outcome:", outcome); // 1.4.4/1.4.5: STILL PENDING AFTER 2s (WEDGED)
  process.exit(0);
})();
```

(Same scenario with a child that exits before its first message also wedges; with a child whose first message isn't `initializeResponse`, the throw happens after `clearTimeout` and `procWatchTask`'s empty catch proceeds to `await proc.join()` where `init` never settles — wedge with zero log lines.)

## Suggested fix

Make `initialize()` always settle and reclaim the child:
- race the first message against the child's `exit` event and the init timeout,
- on any failure: kill the child, reject `init`, and **throw**, so `procWatchTask`'s catch/finally release the mutex slots and the pool replenishes.

We are running exactly that as a local patch (diff below) in production; happy to open it as a PR if useful. Care is needed that a late-losing `once(proc, "exit")` rejection is pre-handled, otherwise every *normal* child exit after a successful init becomes an unhandledRejection.

```diff
   async initialize() {
     var _a;
     const timer = setTimeout(() => {
       this.init.reject(new Error("runner initialization timed out"));
+      try { this.proc?.kill("SIGKILL"); } catch {}
     }, this.#opts.initializeTimeout);
     if (!((_a = this.proc) == null ? void 0 : _a.connected)) {
-      this.init.reject(new Error("process not connected"));
-      return;
+      const err = new Error("process not connected");
+      this.init.reject(err);
+      clearTimeout(timer);
+      throw err;
     }
     this.proc.send({ case: "initializeRequest", value: { ... } });
-    await once(this.proc, "message").then(([msg]) => {
-      clearTimeout(timer);
-      if (msg.case !== "initializeResponse") {
-        throw new Error("first message must be InitializeResponse");
-      }
-    });
+    const firstMessage = once(this.proc, "message").then(([msg]) => {
+      if (msg.case !== "initializeResponse") throw new Error("first message must be InitializeResponse");
+    });
+    const exited = once(this.proc, "exit").then(() => {
+      throw new Error("process exited before initialization completed");
+    });
+    firstMessage.catch(() => {});
+    exited.catch(() => {}); // late race losers must not become unhandledRejection
+    try {
+      await Promise.race([firstMessage, exited, this.init.await]);
+    } catch (err) {
+      this.init.reject(err);
+      try { this.proc?.kill("SIGKILL"); } catch {}
+      throw err;
+    } finally {
+      clearTimeout(timer);
+    }
     this.init.resolve();
   }
```

## Environment

- `@livekit/agents` 1.4.4 (also inspected 1.4.5 — `supervised_proc.js` / `proc_pool.js` identical)
- Node.js 24.x, Linux (ECS/EC2), self-hosted workers (`apiKey`/`apiSecret`, no `workerToken`)
- `numIdleProcesses: 1`, `initializeProcessTimeout: 60s`; trigger in our case was memory pressure killing/stalling the warming child during a heavy import+VAD prewarm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProcPool wedges permanently when a warming child dies before its first IPC message — initialize() never settles, worker keeps accepting jobs it can never launch #1748

Summary

Mechanism (file:line refs from 1.4.4 dist)

Regression vs 1.0.46

Minimal repro

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ProcPool wedges permanently when a warming child dies before its first IPC message — initialize() never settles, worker keeps accepting jobs it can never launch #1748

Description

Summary

Mechanism (file:line refs from 1.4.4 dist)

Regression vs 1.0.46

Minimal repro

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions