Summary
On a forced AgentSession close (e.g. RoomInputOptions.closeOnDisconnect firing when a SIP caller hangs up), the close can hang indefinitely in activity.drain(). Because drain() runs before activity.close() — and holds the activity lock the whole time — the STT/realtime/TTS teardown never runs, so the inference STT websocket is orphaned. The idle socket is then closed by the server every ~30 min with code 2007 ("session closed due to agent inactivity") and the SpeechStream retry loop reconnects, repeating until maxRetry is exhausted.
Observed in a production session: closing agent session due to participant disconnect was logged, but AgentSession closed never was; STT code 2007 errors + retries recurred on a 30-minute cadence long after the participant left.
Root cause
closeImplInner (agents/src/voice/agent_session.ts) on a forced close does, in order:
interrupt({ force: true })
await this.activity.drain() ← hangs here, unbounded
await this.activity.close() ← never reached; this is what aborts the STT socket
drain() → _drainImpl (agent_activity.ts) takes the activity lock, runs on_exit, then await _pauseSchedulingTask() → await this._mainTask.result. The scheduling mainTask only exits when:
this._schedulingPaused && getDrainPendingSpeechTasks().length === 0
If any speech task never settles, that condition is never met and mainTask loops forever → _pauseSchedulingTask never returns → _drainImpl never returns → drain() never returns.
Two compounding factors:
interrupt({force:true}) doesn't guarantee tasks settle. It calls task.cancel() (aborts a signal), but a task parked on an await that doesn't observe the signal stays "running". The loop comment at agent_activity.ts (~// Skip speech handles that were already interrupted/done…) already documents this hang class; this is another instance of it.
- The lock makes it unrecoverable.
activity.close() (which tears down STT and does not itself wait on drain) starts with await this.lock.lock() — the same lock the wedged _drainImpl holds and never releases. So the one method that would abort the socket is blocked behind the hang.
The only existing guardrail, the job-level SESSION_CLOSE_TIMEOUT (60s, ipc/job_proc_lazy_main.ts), just stops awaiting the close and proceeds — it never cancels the drain or forces teardown, so the orphaned socket lives on.
Things that are NOT the cause (ruled out)
- Not
waitForPlayout(). rtc-node AudioSource.waitForPlayout() resolves on a local setTimeout(release, queuedDuration) timer (plus release() on clearQueue()/close()), so it does not depend on a subscriber and does not hang when the participant leaves.
- Not room deletion. The orphaned socket is the inference-gateway websocket, which room teardown never touches; the hang fires on the immediate
closeOnDisconnect path, independent of room state.
Python parity
The Python livekit-agents close path is structurally identical (_aclose_impl → unbounded await activity.drain(); _pause_scheduling_task → unbounded await asyncio.shield(self._scheduling_atask)), confirmed on current main. So this is latent in Python too — there is no upstream fix to port, and a fix here is a candidate to mirror upstream. (Note: Python's drain_timeout is the worker job-drain timeout, unrelated to this.)
Proposed direction (for discussion)
Primary — make forced-close teardown bounded and self-healing:
- Bound
drain() during a forced (!drain) close with a timeout; on timeout, force-cancel the scheduling task via a new lock-free AgentActivity method so _drainImpl unwinds, releases the lock, and activity.close() (STT abort) runs. Graceful drains (agent handoff, shutdown({drain:true})) stay unbounded/unchanged.
- Open question: bound
drain() itself, vs. make _pauseSchedulingTask's wait cancellable, vs. restructure so resource teardown lives in a finally independent of drain. Leaning toward the first (smallest, most targeted).
Secondary — stop the retry budget from eroding (matches Python): make SpeechStream's retry counter persistent and reset it on a successful FINAL_TRANSCRIPT, so a long-lived stream that drops/reconnects doesn't accumulate toward maxRetry. (Doesn't fix the orphan on its own — no transcripts flow when orphaned — but is correct independently and helps the legitimately-idle case.)
Tertiary — dedup close(): route AgentSession.close() through the shared closingTask (only _closeSoon/_onError check it today) so a direct close() racing a _closeSoon() can't run closeImplInner twice.
A prototype of all three exists locally with unit tests; holding the PR until we align on direction.
Questions
- Bound-and-cancel vs. a
finally-based teardown that doesn't depend on drain at all — preference?
- Reasonable default for the forced-drain timeout (prototype uses 5s)?
- Worth also special-casing
code 2007 as a clean reconnect (skip retry budget entirely) in the inference STT, or is the counter-reset enough?
- Should we mirror the primary fix upstream in Python at the same time?
Summary
On a forced
AgentSessionclose (e.g.RoomInputOptions.closeOnDisconnectfiring when a SIP caller hangs up), the close can hang indefinitely inactivity.drain(). Becausedrain()runs beforeactivity.close()— and holds the activity lock the whole time — the STT/realtime/TTS teardown never runs, so the inference STT websocket is orphaned. The idle socket is then closed by the server every ~30 min withcode 2007("session closed due to agent inactivity") and theSpeechStreamretry loop reconnects, repeating untilmaxRetryis exhausted.Observed in a production session:
closing agent session due to participant disconnectwas logged, butAgentSession closednever was; STTcode 2007errors + retries recurred on a 30-minute cadence long after the participant left.Root cause
closeImplInner(agents/src/voice/agent_session.ts) on a forced close does, in order:interrupt({ force: true })await this.activity.drain()← hangs here, unboundedawait this.activity.close()← never reached; this is what aborts the STT socketdrain()→_drainImpl(agent_activity.ts) takes the activitylock, runson_exit, thenawait _pauseSchedulingTask()→await this._mainTask.result. The schedulingmainTaskonly exits when:If any speech task never settles, that condition is never met and
mainTaskloops forever →_pauseSchedulingTasknever returns →_drainImplnever returns →drain()never returns.Two compounding factors:
interrupt({force:true})doesn't guarantee tasks settle. It callstask.cancel()(aborts a signal), but a task parked on anawaitthat doesn't observe the signal stays "running". The loop comment atagent_activity.ts(~// Skip speech handles that were already interrupted/done…) already documents this hang class; this is another instance of it.activity.close()(which tears down STT and does not itself wait on drain) starts withawait this.lock.lock()— the same lock the wedged_drainImplholds and never releases. So the one method that would abort the socket is blocked behind the hang.The only existing guardrail, the job-level
SESSION_CLOSE_TIMEOUT(60s,ipc/job_proc_lazy_main.ts), just stops awaiting the close and proceeds — it never cancels the drain or forces teardown, so the orphaned socket lives on.Things that are NOT the cause (ruled out)
waitForPlayout(). rtc-nodeAudioSource.waitForPlayout()resolves on a localsetTimeout(release, queuedDuration)timer (plusrelease()onclearQueue()/close()), so it does not depend on a subscriber and does not hang when the participant leaves.closeOnDisconnectpath, independent of room state.Python parity
The Python
livekit-agentsclose path is structurally identical (_aclose_impl→ unboundedawait activity.drain();_pause_scheduling_task→ unboundedawait asyncio.shield(self._scheduling_atask)), confirmed on currentmain. So this is latent in Python too — there is no upstream fix to port, and a fix here is a candidate to mirror upstream. (Note: Python'sdrain_timeoutis the worker job-drain timeout, unrelated to this.)Proposed direction (for discussion)
Primary — make forced-close teardown bounded and self-healing:
drain()during a forced (!drain) close with a timeout; on timeout, force-cancel the scheduling task via a new lock-freeAgentActivitymethod so_drainImplunwinds, releases the lock, andactivity.close()(STT abort) runs. Graceful drains (agent handoff,shutdown({drain:true})) stay unbounded/unchanged.drain()itself, vs. make_pauseSchedulingTask's wait cancellable, vs. restructure so resource teardown lives in afinallyindependent of drain. Leaning toward the first (smallest, most targeted).Secondary — stop the retry budget from eroding (matches Python): make
SpeechStream's retry counter persistent and reset it on a successfulFINAL_TRANSCRIPT, so a long-lived stream that drops/reconnects doesn't accumulate towardmaxRetry. (Doesn't fix the orphan on its own — no transcripts flow when orphaned — but is correct independently and helps the legitimately-idle case.)Tertiary — dedup
close(): routeAgentSession.close()through the sharedclosingTask(only_closeSoon/_onErrorcheck it today) so a directclose()racing a_closeSoon()can't runcloseImplInnertwice.A prototype of all three exists locally with unit tests; holding the PR until we align on direction.
Questions
finally-based teardown that doesn't depend on drain at all — preference?code 2007as a clean reconnect (skip retry budget entirely) in the inference STT, or is the counter-reset enough?