Skip to content

fix(voice): VAD-mode minEndpointingDelay collapses to ~0 (closes #1741)#1771

Open
tsushanth wants to merge 1 commit into
livekit:mainfrom
tsushanth:fix/vad-endpointing-delay-collapse
Open

fix(voice): VAD-mode minEndpointingDelay collapses to ~0 (closes #1741)#1771
tsushanth wants to merge 1 commit into
livekit:mainfrom
tsushanth:fix/vad-endpointing-delay-collapse

Conversation

@tsushanth

Copy link
Copy Markdown

Summary

In VAD-based turn detection, minEndpointingDelay was effectively a no-op whenever it was ≤ the VAD's minSilenceDuration — so with the defaults (minDelay = 500, Silero minSilenceDuration = 550), the post-EOS grouping window collapsed to ~−50 ms and the turn committed the instant END_OF_SPEECH fired. Any natural mid-sentence pause split into two segments, and with realtime models using manual activity detection the second segment's userTurnCompleted never fired (the agent stalled, never responding).

This is the same root cause as the Python issue livekit/agents#4325, now reproduced in @livekit/agents@1.4.5 (and identical on main).

Root cause

agents/src/voice/audio_recognition.ts, inside runEOUDetection's bounceEOUTask closure (lines 1155–1163 on main):

let extraSleep = endpointingDelay;
if (lastSpeakingTime !== undefined) {
  extraSleep += lastSpeakingTime - Date.now();  // subtracts already-elapsed silence
}
  • lastSpeakingTime is stamped on VAD INFERENCE_DONE (≈ when the user stops).
  • bounceEOUTask only runs at VAD END_OF_SPEECH, which Silero emits minSilenceDuration later.
  • So lastSpeakingTime - Date.now() ≈ −minSilenceDuration, giving extraSleep ≈ minDelay − minSilenceDuration.
  • With the defaults that is negative → no wait → immediate commit.

In STT-based turn detection the adjustment is intentional and correct — bounceEOUTask runs from STT's INFERENCE_DONE event, and subtracting elapsed time keeps the post-speech window roughly minDelay long even when transcription took a while.

Fix

Skip the elapsed-since-speech adjustment in VAD mode so minDelay actually provides a real post-EOS grouping window that an upcoming START_OF_SPEECH can cancel. STT mode is unchanged.

 let extraSleep = endpointingDelay;
-if (lastSpeakingTime !== undefined) {
+if (lastSpeakingTime !== undefined && !this.vadBaseTurnDetection) {
   extraSleep += lastSpeakingTime - Date.now();
 }

Matches solution #2 from the linked Python issue.

Tests

New file agents/src/voice/audio_recognition_endpointing_delay.test.ts with two cases:

  1. VAD mode regression (minEndpointingDelay is nullified in VAD turn-detection mode (swallowed by Silero minSilenceDuration), breaking multi-segment turn grouping #1741) — sets lastSpeakingTime = Date.now() − 550 (mirroring Silero's minSilenceDuration) and minEndpointingDelay = 300, then drives runEOUDetection(empty) and measures the time to onEndOfTurn. Expects ≥ 250 ms (i.e. roughly the configured minDelay).
    • On main: fails with elapsed ≈ 2 ms (turn commits immediately).
    • With the fix: passes (~300 ms).
  2. STT mode regression guardturnDetectionMode: 'stt', no VAD, lastSpeakingTime 150 ms ago, minDelay = 400. Expects elapsed ≈ 250 ms (the existing subtraction stays).

Verified that test 1 fails on main and both tests pass after the production change.

Test plan

  • pnpm vitest run agents/src/voice/audio_recognition_endpointing_delay.test.ts — 2/2 pass
  • pnpm vitest run agents/src/voice/audio_recognition_endpointing.test.ts agents/src/voice/audio_recognition_vad_reset.test.ts — 5/5 pass (unchanged)
  • pnpm build:agents — clean
  • pnpm lint — no new warnings on touched files
  • pnpm format:check — clean
  • Changeset added (patch)

Closes #1741

In VAD-based turn detection, bounceEOUTask runs at VAD END_OF_SPEECH,
which Silero emits `minSilenceDuration` (~550 ms) after the user
actually stops. lastSpeakingTime is stamped earlier — at VAD
INFERENCE_DONE. The post-EOS delay was computed as

    extraSleep = endpointingDelay + (lastSpeakingTime - Date.now())

so it collapsed to `endpointingDelay - elapsedSilence` ≈ −50 ms with
the defaults (minDelay=500, minSilenceDuration=550). The turn committed
the instant END_OF_SPEECH fired and any natural mid-sentence pause —
or even any silence shorter than the configured min delay — split into
two segments. With realtime models using manual activity detection,
the second segment's userTurnCompleted never fires and the agent never
responds.

Skip the elapsed-since-speech adjustment in VAD mode so `minDelay`
actually provides a real post-EOS grouping window that an upcoming
START_OF_SPEECH can cancel. STT mode keeps the adjustment — there it
correctly compensates for transcription latency between
INFERENCE_DONE and END_OF_SPEECH on the STT side. Adds two regression
tests in audio_recognition_endpointing_delay.test.ts: a livekit#1741 repro
that fails on main (~2 ms vs the required ≥250 ms), and a guard for
the STT path so the fix can't regress that branch.

Closes livekit#1741
@changeset-bot

changeset-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: d715765

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 35 packages
Name Type
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-did Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-fishaudio Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-hume Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-mistralai Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-perplexity Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-sarvam Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugin-soniox Patch
@livekit/agents-plugin-tavus Patch
@livekit/agents-plugins-test Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugin-xai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@CLAassistant

CLAassistant commented Jun 11, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

minEndpointingDelay is nullified in VAD turn-detection mode (swallowed by Silero minSilenceDuration), breaking multi-segment turn grouping

2 participants