Summary
In @livekit/agents ≥1.4.x (verified on 1.4.4 and 1.4.5 — the relevant files are byte-identical), a warming child process that dies or hangs during import/prewarm without sending its first IPC message permanently wedges the worker's process pool: the worker stays registered and available, keeps logging received job request and accepting, but never launches another job until the process is externally restarted.
We hit this repeatedly in a production contact-center deployment (5 independent wedge events across 2 days, each black-holing all calls routed to the worker for hours). Related: the review on #1669 already confirmed "the bug is real in JS" before that PR was closed unmerged; see also #909 and #927.
Mechanism (file:line refs from 1.4.4 dist)
SupervisedProc.initialize()'s only completion path is the child's first IPC message: await once(this.proc, "message") (dist/ipc/supervised_proc.js:139). A child killed mid-prewarm (e.g. kernel OOM, V8 heap abort, import crash in the agent module) emits neither "message" nor "error" — normal exit fires no "error" — so events.once pends forever.
- The
initializeProcessTimeout timer only rejects the side future this.init (supervised_proc.js:123-125). It does not kill the child and does not unblock initialize().
- The
init rejection makes run() throw at await this.init.await, which start()'s catch swallows into a single WARN — supervised process run failed (supervised_proc.js:46-49). Because run() died at its first await, the exit listener / ping loop / memory monitor were never attached.
- Meanwhile
ProcPool.procWatchTask is still parked at await proc.initialize() (dist/ipc/proc_pool.js:85), holding both initMutex and its procMutex slot — the releases sit in a finally that is unreachable while initialize() pends. With numIdleProcesses: 1 that is the only slot, so run() blocks at procMutex.lock() (proc_pool.js:120) forever, warmedProcQueue never refills, and every accepted job blocks at warmedProcQueue.get() (proc_pool.js:41) — unbounded, no timeout.
- The worker keeps reporting available (load loop is CPU-based), so the server keeps dispatching jobs into the black hole.
Production log signature: exactly one supervised process run failed WARN per wedge (≈60 s after the last successful prewarm), then received job request lines forever with zero launches. We matched this 5/5 against our wedge onsets.
Regression vs 1.0.46
In 1.0.46, start() called this.run() bare (no .catch), so the same init-timeout rejection escaped as an unhandledRejection — in our deployment that crashed the worker process and the orchestrator restarted it. Accidentally self-healing. 1.4.x's added .catch turns the same failure into a silent permanent wedge. (Not arguing for the crash — arguing the timeout should actually recover the pool.)
Minimal repro
const path = require("path");
const { fork } = require("child_process");
const fs = require("fs");
const base = path.dirname(require.resolve("@livekit/agents")); // dist/
const { initializeLogger } = require(path.join(base, "log.cjs"));
initializeLogger({ pretty: false, level: "silent" });
const { SupervisedProc } = require(path.join(base, "ipc/supervised_proc.cjs"));
const hang = "/tmp/lk-hang.cjs";
fs.writeFileSync(hang, "setInterval(() => {}, 1000);\n"); // never sends a message
const p = new SupervisedProc(300 /* initializeTimeout */, 500, 0, 0, 60000, 60000, 60000);
p.createProcess = () => fork(hang, [], { stdio: ["ignore", "ignore", "ignore", "ipc"] });
(async () => {
await p.start();
const outcome = await Promise.race([
p.initialize().then(() => "resolved", (e) => "rejected: " + e.message),
new Promise((r) => setTimeout(() => r("STILL PENDING AFTER 2s (WEDGED)"), 2000)),
]);
console.log("initialize() outcome:", outcome); // 1.4.4/1.4.5: STILL PENDING AFTER 2s (WEDGED)
process.exit(0);
})();
(Same scenario with a child that exits before its first message also wedges; with a child whose first message isn't initializeResponse, the throw happens after clearTimeout and procWatchTask's empty catch proceeds to await proc.join() where init never settles — wedge with zero log lines.)
Suggested fix
Make initialize() always settle and reclaim the child:
- race the first message against the child's
exit event and the init timeout,
- on any failure: kill the child, reject
init, and throw, so procWatchTask's catch/finally release the mutex slots and the pool replenishes.
We are running exactly that as a local patch (diff below) in production; happy to open it as a PR if useful. Care is needed that a late-losing once(proc, "exit") rejection is pre-handled, otherwise every normal child exit after a successful init becomes an unhandledRejection.
async initialize() {
var _a;
const timer = setTimeout(() => {
this.init.reject(new Error("runner initialization timed out"));
+ try { this.proc?.kill("SIGKILL"); } catch {}
}, this.#opts.initializeTimeout);
if (!((_a = this.proc) == null ? void 0 : _a.connected)) {
- this.init.reject(new Error("process not connected"));
- return;
+ const err = new Error("process not connected");
+ this.init.reject(err);
+ clearTimeout(timer);
+ throw err;
}
this.proc.send({ case: "initializeRequest", value: { ... } });
- await once(this.proc, "message").then(([msg]) => {
- clearTimeout(timer);
- if (msg.case !== "initializeResponse") {
- throw new Error("first message must be InitializeResponse");
- }
- });
+ const firstMessage = once(this.proc, "message").then(([msg]) => {
+ if (msg.case !== "initializeResponse") throw new Error("first message must be InitializeResponse");
+ });
+ const exited = once(this.proc, "exit").then(() => {
+ throw new Error("process exited before initialization completed");
+ });
+ firstMessage.catch(() => {});
+ exited.catch(() => {}); // late race losers must not become unhandledRejection
+ try {
+ await Promise.race([firstMessage, exited, this.init.await]);
+ } catch (err) {
+ this.init.reject(err);
+ try { this.proc?.kill("SIGKILL"); } catch {}
+ throw err;
+ } finally {
+ clearTimeout(timer);
+ }
this.init.resolve();
}
Environment
@livekit/agents 1.4.4 (also inspected 1.4.5 — supervised_proc.js / proc_pool.js identical)
- Node.js 24.x, Linux (ECS/EC2), self-hosted workers (
apiKey/apiSecret, no workerToken)
numIdleProcesses: 1, initializeProcessTimeout: 60s; trigger in our case was memory pressure killing/stalling the warming child during a heavy import+VAD prewarm
Summary
In
@livekit/agents≥1.4.x (verified on 1.4.4 and 1.4.5 — the relevant files are byte-identical), a warming child process that dies or hangs during import/prewarm without sending its first IPC message permanently wedges the worker's process pool: the worker stays registered and available, keeps loggingreceived job requestand accepting, but never launches another job until the process is externally restarted.We hit this repeatedly in a production contact-center deployment (5 independent wedge events across 2 days, each black-holing all calls routed to the worker for hours). Related: the review on #1669 already confirmed "the bug is real in JS" before that PR was closed unmerged; see also #909 and #927.
Mechanism (file:line refs from 1.4.4 dist)
SupervisedProc.initialize()'s only completion path is the child's first IPC message:await once(this.proc, "message")(dist/ipc/supervised_proc.js:139). A child killed mid-prewarm (e.g. kernel OOM, V8 heap abort, import crash in the agent module) emits neither"message"nor"error"— normal exit fires no"error"— soevents.oncepends forever.initializeProcessTimeouttimer only rejects the side futurethis.init(supervised_proc.js:123-125). It does not kill the child and does not unblockinitialize().initrejection makesrun()throw atawait this.init.await, whichstart()'s catch swallows into a single WARN —supervised process run failed(supervised_proc.js:46-49). Becauserun()died at its first await, the exit listener / ping loop / memory monitor were never attached.ProcPool.procWatchTaskis still parked atawait proc.initialize()(dist/ipc/proc_pool.js:85), holding bothinitMutexand itsprocMutexslot — the releases sit in afinallythat is unreachable whileinitialize()pends. WithnumIdleProcesses: 1that is the only slot, sorun()blocks atprocMutex.lock()(proc_pool.js:120) forever,warmedProcQueuenever refills, and every accepted job blocks atwarmedProcQueue.get()(proc_pool.js:41) — unbounded, no timeout.Production log signature: exactly one
supervised process run failedWARN per wedge (≈60 s after the last successful prewarm), thenreceived job requestlines forever with zero launches. We matched this 5/5 against our wedge onsets.Regression vs 1.0.46
In 1.0.46,
start()calledthis.run()bare (no.catch), so the same init-timeout rejection escaped as an unhandledRejection — in our deployment that crashed the worker process and the orchestrator restarted it. Accidentally self-healing. 1.4.x's added.catchturns the same failure into a silent permanent wedge. (Not arguing for the crash — arguing the timeout should actually recover the pool.)Minimal repro
(Same scenario with a child that exits before its first message also wedges; with a child whose first message isn't
initializeResponse, the throw happens afterclearTimeoutandprocWatchTask's empty catch proceeds toawait proc.join()whereinitnever settles — wedge with zero log lines.)Suggested fix
Make
initialize()always settle and reclaim the child:exitevent and the init timeout,init, and throw, soprocWatchTask's catch/finally release the mutex slots and the pool replenishes.We are running exactly that as a local patch (diff below) in production; happy to open it as a PR if useful. Care is needed that a late-losing
once(proc, "exit")rejection is pre-handled, otherwise every normal child exit after a successful init becomes an unhandledRejection.async initialize() { var _a; const timer = setTimeout(() => { this.init.reject(new Error("runner initialization timed out")); + try { this.proc?.kill("SIGKILL"); } catch {} }, this.#opts.initializeTimeout); if (!((_a = this.proc) == null ? void 0 : _a.connected)) { - this.init.reject(new Error("process not connected")); - return; + const err = new Error("process not connected"); + this.init.reject(err); + clearTimeout(timer); + throw err; } this.proc.send({ case: "initializeRequest", value: { ... } }); - await once(this.proc, "message").then(([msg]) => { - clearTimeout(timer); - if (msg.case !== "initializeResponse") { - throw new Error("first message must be InitializeResponse"); - } - }); + const firstMessage = once(this.proc, "message").then(([msg]) => { + if (msg.case !== "initializeResponse") throw new Error("first message must be InitializeResponse"); + }); + const exited = once(this.proc, "exit").then(() => { + throw new Error("process exited before initialization completed"); + }); + firstMessage.catch(() => {}); + exited.catch(() => {}); // late race losers must not become unhandledRejection + try { + await Promise.race([firstMessage, exited, this.init.await]); + } catch (err) { + this.init.reject(err); + try { this.proc?.kill("SIGKILL"); } catch {} + throw err; + } finally { + clearTimeout(timer); + } this.init.resolve(); }Environment
@livekit/agents1.4.4 (also inspected 1.4.5 —supervised_proc.js/proc_pool.jsidentical)apiKey/apiSecret, noworkerToken)numIdleProcesses: 1,initializeProcessTimeout: 60s; trigger in our case was memory pressure killing/stalling the warming child during a heavy import+VAD prewarm