Forced AgentSession close can hang in drain(), orphaning the inference STT socket

## Summary

On a **forced** `AgentSession` close (e.g. `RoomInputOptions.closeOnDisconnect` firing when a SIP caller hangs up), the close can hang indefinitely in `activity.drain()`. Because `drain()` runs *before* `activity.close()` — and holds the activity lock the whole time — the STT/realtime/TTS teardown never runs, so the **inference STT websocket is orphaned**. The idle socket is then closed by the server every ~30 min with `code 2007` ("session closed due to agent inactivity") and the `SpeechStream` retry loop reconnects, repeating until `maxRetry` is exhausted.

Observed in a production session: `closing agent session due to participant disconnect` was logged, but `AgentSession closed` never was; STT `code 2007` errors + retries recurred on a 30-minute cadence long after the participant left.

## Root cause

`closeImplInner` (`agents/src/voice/agent_session.ts`) on a forced close does, in order:

1. `interrupt({ force: true })`
2. `await this.activity.drain()`   ← hangs here, **unbounded**
3. `await this.activity.close()`   ← never reached; this is what aborts the STT socket

`drain()` → `_drainImpl` (`agent_activity.ts`) takes the activity `lock`, runs `on_exit`, then `await _pauseSchedulingTask()` → `await this._mainTask.result`. The scheduling `mainTask` only exits when:

```js
this._schedulingPaused && getDrainPendingSpeechTasks().length === 0
```

If **any speech task never settles**, that condition is never met and `mainTask` loops forever → `_pauseSchedulingTask` never returns → `_drainImpl` never returns → `drain()` never returns.

Two compounding factors:

- **`interrupt({force:true})` doesn't guarantee tasks settle.** It calls `task.cancel()` (aborts a signal), but a task parked on an `await` that doesn't observe the signal stays "running". The loop comment at `agent_activity.ts` (~`// Skip speech handles that were already interrupted/done…`) already documents this hang class; this is another instance of it.
- **The lock makes it unrecoverable.** `activity.close()` (which tears down STT and does *not* itself wait on drain) starts with `await this.lock.lock()` — the same lock the wedged `_drainImpl` holds and never releases. So the one method that would abort the socket is blocked behind the hang.

The only existing guardrail, the job-level `SESSION_CLOSE_TIMEOUT` (60s, `ipc/job_proc_lazy_main.ts`), just *stops awaiting* the close and proceeds — it never cancels the drain or forces teardown, so the orphaned socket lives on.

### Things that are NOT the cause (ruled out)
- **Not `waitForPlayout()`.** rtc-node `AudioSource.waitForPlayout()` resolves on a local `setTimeout(release, queuedDuration)` timer (plus `release()` on `clearQueue()`/`close()`), so it does not depend on a subscriber and does not hang when the participant leaves.
- **Not room deletion.** The orphaned socket is the inference-gateway websocket, which room teardown never touches; the hang fires on the immediate `closeOnDisconnect` path, independent of room state.

### Python parity
The Python `livekit-agents` close path is structurally identical (`_aclose_impl` → unbounded `await activity.drain()`; `_pause_scheduling_task` → unbounded `await asyncio.shield(self._scheduling_atask)`), confirmed on current `main`. So this is **latent in Python too** — there is no upstream fix to port, and a fix here is a candidate to mirror upstream. (Note: Python's `drain_timeout` is the *worker* job-drain timeout, unrelated to this.)

## Proposed direction (for discussion)

Primary — **make forced-close teardown bounded and self-healing**:
- Bound `drain()` during a forced (`!drain`) close with a timeout; on timeout, force-cancel the scheduling task via a new **lock-free** `AgentActivity` method so `_drainImpl` unwinds, releases the lock, and `activity.close()` (STT abort) runs. Graceful drains (agent handoff, `shutdown({drain:true})`) stay unbounded/unchanged.
- Open question: bound `drain()` itself, vs. make `_pauseSchedulingTask`'s wait cancellable, vs. restructure so resource teardown lives in a `finally` independent of drain. Leaning toward the first (smallest, most targeted).

Secondary — **stop the retry budget from eroding** (matches Python): make `SpeechStream`'s retry counter persistent and reset it on a successful `FINAL_TRANSCRIPT`, so a long-lived stream that drops/reconnects doesn't accumulate toward `maxRetry`. (Doesn't fix the orphan on its own — no transcripts flow when orphaned — but is correct independently and helps the legitimately-idle case.)

Tertiary — **dedup `close()`**: route `AgentSession.close()` through the shared `closingTask` (only `_closeSoon`/`_onError` check it today) so a direct `close()` racing a `_closeSoon()` can't run `closeImplInner` twice.

A prototype of all three exists locally with unit tests; holding the PR until we align on direction.

## Questions
1. Bound-and-cancel vs. a `finally`-based teardown that doesn't depend on drain at all — preference?
2. Reasonable default for the forced-drain timeout (prototype uses 5s)?
3. Worth also special-casing `code 2007` as a clean reconnect (skip retry budget entirely) in the inference STT, or is the counter-reset enough?
4. Should we mirror the primary fix upstream in Python at the same time?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forced AgentSession close can hang in drain(), orphaning the inference STT socket #1731

Summary

Root cause

Things that are NOT the cause (ruled out)

Python parity

Proposed direction (for discussion)

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Forced AgentSession close can hang in drain(), orphaning the inference STT socket #1731

Description

Summary

Root cause

Things that are NOT the cause (ruled out)

Python parity

Proposed direction (for discussion)

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions