fix(voice): VAD-mode minEndpointingDelay collapses to ~0 (closes #1741) by tsushanth · Pull Request #1771 · livekit/agents-js

tsushanth · 2026-06-11T18:38:21Z

Summary

In VAD-based turn detection, minEndpointingDelay was effectively a no-op whenever it was ≤ the VAD's minSilenceDuration — so with the defaults (minDelay = 500, Silero minSilenceDuration = 550), the post-EOS grouping window collapsed to ~−50 ms and the turn committed the instant END_OF_SPEECH fired. Any natural mid-sentence pause split into two segments, and with realtime models using manual activity detection the second segment's userTurnCompleted never fired (the agent stalled, never responding).

This is the same root cause as the Python issue livekit/agents#4325, now reproduced in @livekit/agents@1.4.5 (and identical on main).

Root cause

agents/src/voice/audio_recognition.ts, inside runEOUDetection's bounceEOUTask closure (lines 1155–1163 on main):

let extraSleep = endpointingDelay;
if (lastSpeakingTime !== undefined) {
  extraSleep += lastSpeakingTime - Date.now();  // subtracts already-elapsed silence
}

lastSpeakingTime is stamped on VAD INFERENCE_DONE (≈ when the user stops).
bounceEOUTask only runs at VAD END_OF_SPEECH, which Silero emits minSilenceDuration later.
So lastSpeakingTime - Date.now() ≈ −minSilenceDuration, giving extraSleep ≈ minDelay − minSilenceDuration.
With the defaults that is negative → no wait → immediate commit.

In STT-based turn detection the adjustment is intentional and correct — bounceEOUTask runs from STT's INFERENCE_DONE event, and subtracting elapsed time keeps the post-speech window roughly minDelay long even when transcription took a while.

Fix

Skip the elapsed-since-speech adjustment in VAD mode so minDelay actually provides a real post-EOS grouping window that an upcoming START_OF_SPEECH can cancel. STT mode is unchanged.

 let extraSleep = endpointingDelay;
-if (lastSpeakingTime !== undefined) {
+if (lastSpeakingTime !== undefined && !this.vadBaseTurnDetection) {
   extraSleep += lastSpeakingTime - Date.now();
 }

Matches solution #2 from the linked Python issue.

Tests

New file agents/src/voice/audio_recognition_endpointing_delay.test.ts with two cases:

VAD mode regression (minEndpointingDelay is nullified in VAD turn-detection mode (swallowed by Silero minSilenceDuration), breaking multi-segment turn grouping #1741) — sets lastSpeakingTime = Date.now() − 550 (mirroring Silero's minSilenceDuration) and minEndpointingDelay = 300, then drives runEOUDetection(empty) and measures the time to onEndOfTurn. Expects ≥ 250 ms (i.e. roughly the configured minDelay).
- On main: fails with elapsed ≈ 2 ms (turn commits immediately).
- With the fix: passes (~300 ms).
STT mode regression guard — turnDetectionMode: 'stt', no VAD, lastSpeakingTime 150 ms ago, minDelay = 400. Expects elapsed ≈ 250 ms (the existing subtraction stays).

Verified that test 1 fails on main and both tests pass after the production change.

Test plan

pnpm vitest run agents/src/voice/audio_recognition_endpointing_delay.test.ts — 2/2 pass
pnpm vitest run agents/src/voice/audio_recognition_endpointing.test.ts agents/src/voice/audio_recognition_vad_reset.test.ts — 5/5 pass (unchanged)
pnpm build:agents — clean
pnpm lint — no new warnings on touched files
pnpm format:check — clean
Changeset added (patch)

Closes #1741

In VAD-based turn detection, bounceEOUTask runs at VAD END_OF_SPEECH, which Silero emits `minSilenceDuration` (~550 ms) after the user actually stops. lastSpeakingTime is stamped earlier — at VAD INFERENCE_DONE. The post-EOS delay was computed as extraSleep = endpointingDelay + (lastSpeakingTime - Date.now()) so it collapsed to `endpointingDelay - elapsedSilence` ≈ −50 ms with the defaults (minDelay=500, minSilenceDuration=550). The turn committed the instant END_OF_SPEECH fired and any natural mid-sentence pause — or even any silence shorter than the configured min delay — split into two segments. With realtime models using manual activity detection, the second segment's userTurnCompleted never fires and the agent never responds. Skip the elapsed-since-speech adjustment in VAD mode so `minDelay` actually provides a real post-EOS grouping window that an upcoming START_OF_SPEECH can cancel. STT mode keeps the adjustment — there it correctly compensates for transcription latency between INFERENCE_DONE and END_OF_SPEECH on the STT side. Adds two regression tests in audio_recognition_endpointing_delay.test.ts: a livekit#1741 repro that fails on main (~2 ms vs the required ≥250 ms), and a guard for the STT path so the fix can't regress that branch. Closes livekit#1741

changeset-bot · 2026-06-11T18:38:27Z

🦋 Changeset detected

Latest commit: d715765

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 35 packages

Name	Type
@livekit/agents	Patch
@livekit/agents-plugin-anam	Patch
@livekit/agents-plugin-assemblyai	Patch
@livekit/agents-plugin-baseten	Patch
@livekit/agents-plugin-bey	Patch
@livekit/agents-plugin-cartesia	Patch
@livekit/agents-plugin-cerebras	Patch
@livekit/agents-plugin-deepgram	Patch
@livekit/agents-plugin-did	Patch
@livekit/agents-plugin-elevenlabs	Patch
@livekit/agents-plugin-fishaudio	Patch
@livekit/agents-plugin-google	Patch
@livekit/agents-plugin-hedra	Patch
@livekit/agents-plugin-hume	Patch
@livekit/agents-plugin-inworld	Patch
@livekit/agents-plugin-lemonslice	Patch
@livekit/agents-plugin-liveavatar	Patch
@livekit/agents-plugin-livekit	Patch
@livekit/agents-plugin-minimax	Patch
@livekit/agents-plugin-mistral	Patch
@livekit/agents-plugin-mistralai	Patch
@livekit/agents-plugin-neuphonic	Patch
@livekit/agents-plugin-openai	Patch
@livekit/agents-plugin-perplexity	Patch
@livekit/agents-plugin-phonic	Patch
@livekit/agents-plugin-resemble	Patch
@livekit/agents-plugin-rime	Patch
@livekit/agents-plugin-runway	Patch
@livekit/agents-plugin-sarvam	Patch
@livekit/agents-plugin-silero	Patch
@livekit/agents-plugin-soniox	Patch
@livekit/agents-plugin-tavus	Patch
@livekit/agents-plugins-test	Patch
@livekit/agents-plugin-trugen	Patch
@livekit/agents-plugin-xai	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

CLAassistant · 2026-06-11T18:38:28Z

All committers have signed the CLA.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

devin-ai-integration Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(voice): VAD-mode minEndpointingDelay collapses to ~0 (closes #1741)#1771

fix(voice): VAD-mode minEndpointingDelay collapses to ~0 (closes #1741)#1771
tsushanth wants to merge 1 commit into
livekit:mainfrom
tsushanth:fix/vad-endpointing-delay-collapse

tsushanth commented Jun 11, 2026

Uh oh!

changeset-bot Bot commented Jun 11, 2026

Uh oh!

CLAassistant commented Jun 11, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tsushanth commented Jun 11, 2026

Summary

Root cause

Fix

Tests

Test plan

Uh oh!

changeset-bot Bot commented Jun 11, 2026

🦋 Changeset detected

Uh oh!

CLAassistant commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented Jun 11, 2026 •

edited

Loading