Skip to content

Forced AgentSession close can hang in drain(), orphaning the inference STT socket #1731

@chenghao-mou

Description

@chenghao-mou

Summary

On a forced AgentSession close (e.g. RoomInputOptions.closeOnDisconnect firing when a SIP caller hangs up), the close can hang indefinitely in activity.drain(). Because drain() runs before activity.close() — and holds the activity lock the whole time — the STT/realtime/TTS teardown never runs, so the inference STT websocket is orphaned. The idle socket is then closed by the server every ~30 min with code 2007 ("session closed due to agent inactivity") and the SpeechStream retry loop reconnects, repeating until maxRetry is exhausted.

Observed in a production session: closing agent session due to participant disconnect was logged, but AgentSession closed never was; STT code 2007 errors + retries recurred on a 30-minute cadence long after the participant left.

Root cause

closeImplInner (agents/src/voice/agent_session.ts) on a forced close does, in order:

  1. interrupt({ force: true })
  2. await this.activity.drain() ← hangs here, unbounded
  3. await this.activity.close() ← never reached; this is what aborts the STT socket

drain()_drainImpl (agent_activity.ts) takes the activity lock, runs on_exit, then await _pauseSchedulingTask()await this._mainTask.result. The scheduling mainTask only exits when:

this._schedulingPaused && getDrainPendingSpeechTasks().length === 0

If any speech task never settles, that condition is never met and mainTask loops forever → _pauseSchedulingTask never returns → _drainImpl never returns → drain() never returns.

Two compounding factors:

  • interrupt({force:true}) doesn't guarantee tasks settle. It calls task.cancel() (aborts a signal), but a task parked on an await that doesn't observe the signal stays "running". The loop comment at agent_activity.ts (~// Skip speech handles that were already interrupted/done…) already documents this hang class; this is another instance of it.
  • The lock makes it unrecoverable. activity.close() (which tears down STT and does not itself wait on drain) starts with await this.lock.lock() — the same lock the wedged _drainImpl holds and never releases. So the one method that would abort the socket is blocked behind the hang.

The only existing guardrail, the job-level SESSION_CLOSE_TIMEOUT (60s, ipc/job_proc_lazy_main.ts), just stops awaiting the close and proceeds — it never cancels the drain or forces teardown, so the orphaned socket lives on.

Things that are NOT the cause (ruled out)

  • Not waitForPlayout(). rtc-node AudioSource.waitForPlayout() resolves on a local setTimeout(release, queuedDuration) timer (plus release() on clearQueue()/close()), so it does not depend on a subscriber and does not hang when the participant leaves.
  • Not room deletion. The orphaned socket is the inference-gateway websocket, which room teardown never touches; the hang fires on the immediate closeOnDisconnect path, independent of room state.

Python parity

The Python livekit-agents close path is structurally identical (_aclose_impl → unbounded await activity.drain(); _pause_scheduling_task → unbounded await asyncio.shield(self._scheduling_atask)), confirmed on current main. So this is latent in Python too — there is no upstream fix to port, and a fix here is a candidate to mirror upstream. (Note: Python's drain_timeout is the worker job-drain timeout, unrelated to this.)

Proposed direction (for discussion)

Primary — make forced-close teardown bounded and self-healing:

  • Bound drain() during a forced (!drain) close with a timeout; on timeout, force-cancel the scheduling task via a new lock-free AgentActivity method so _drainImpl unwinds, releases the lock, and activity.close() (STT abort) runs. Graceful drains (agent handoff, shutdown({drain:true})) stay unbounded/unchanged.
  • Open question: bound drain() itself, vs. make _pauseSchedulingTask's wait cancellable, vs. restructure so resource teardown lives in a finally independent of drain. Leaning toward the first (smallest, most targeted).

Secondary — stop the retry budget from eroding (matches Python): make SpeechStream's retry counter persistent and reset it on a successful FINAL_TRANSCRIPT, so a long-lived stream that drops/reconnects doesn't accumulate toward maxRetry. (Doesn't fix the orphan on its own — no transcripts flow when orphaned — but is correct independently and helps the legitimately-idle case.)

Tertiary — dedup close(): route AgentSession.close() through the shared closingTask (only _closeSoon/_onError check it today) so a direct close() racing a _closeSoon() can't run closeImplInner twice.

A prototype of all three exists locally with unit tests; holding the PR until we align on direction.

Questions

  1. Bound-and-cancel vs. a finally-based teardown that doesn't depend on drain at all — preference?
  2. Reasonable default for the forced-drain timeout (prototype uses 5s)?
  3. Worth also special-casing code 2007 as a clean reconnect (skip retry budget entirely) in the inference STT, or is the counter-reset enough?
  4. Should we mirror the primary fix upstream in Python at the same time?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions