fix(voice): VAD-mode minEndpointingDelay collapses to ~0 (closes #1741)#1771
Open
tsushanth wants to merge 1 commit into
Open
fix(voice): VAD-mode minEndpointingDelay collapses to ~0 (closes #1741)#1771tsushanth wants to merge 1 commit into
tsushanth wants to merge 1 commit into
Conversation
In VAD-based turn detection, bounceEOUTask runs at VAD END_OF_SPEECH,
which Silero emits `minSilenceDuration` (~550 ms) after the user
actually stops. lastSpeakingTime is stamped earlier — at VAD
INFERENCE_DONE. The post-EOS delay was computed as
extraSleep = endpointingDelay + (lastSpeakingTime - Date.now())
so it collapsed to `endpointingDelay - elapsedSilence` ≈ −50 ms with
the defaults (minDelay=500, minSilenceDuration=550). The turn committed
the instant END_OF_SPEECH fired and any natural mid-sentence pause —
or even any silence shorter than the configured min delay — split into
two segments. With realtime models using manual activity detection,
the second segment's userTurnCompleted never fires and the agent never
responds.
Skip the elapsed-since-speech adjustment in VAD mode so `minDelay`
actually provides a real post-EOS grouping window that an upcoming
START_OF_SPEECH can cancel. STT mode keeps the adjustment — there it
correctly compensates for transcription latency between
INFERENCE_DONE and END_OF_SPEECH on the STT side. Adds two regression
tests in audio_recognition_endpointing_delay.test.ts: a livekit#1741 repro
that fails on main (~2 ms vs the required ≥250 ms), and a guard for
the STT path so the fix can't regress that branch.
Closes livekit#1741
🦋 Changeset detectedLatest commit: d715765 The changes in this PR will be included in the next version bump. This PR includes changesets to release 35 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In VAD-based turn detection,
minEndpointingDelaywas effectively a no-op whenever it was ≤ the VAD'sminSilenceDuration— so with the defaults (minDelay = 500, SilerominSilenceDuration = 550), the post-EOS grouping window collapsed to ~−50 ms and the turn committed the instantEND_OF_SPEECHfired. Any natural mid-sentence pause split into two segments, and with realtime models using manual activity detection the second segment'suserTurnCompletednever fired (the agent stalled, never responding).This is the same root cause as the Python issue livekit/agents#4325, now reproduced in
@livekit/agents@1.4.5(and identical onmain).Root cause
agents/src/voice/audio_recognition.ts, insiderunEOUDetection'sbounceEOUTaskclosure (lines 1155–1163 on main):lastSpeakingTimeis stamped on VADINFERENCE_DONE(≈ when the user stops).bounceEOUTaskonly runs at VADEND_OF_SPEECH, which Silero emitsminSilenceDurationlater.lastSpeakingTime - Date.now() ≈ −minSilenceDuration, givingextraSleep ≈ minDelay − minSilenceDuration.In STT-based turn detection the adjustment is intentional and correct —
bounceEOUTaskruns from STT'sINFERENCE_DONEevent, and subtracting elapsed time keeps the post-speech window roughlyminDelaylong even when transcription took a while.Fix
Skip the elapsed-since-speech adjustment in VAD mode so
minDelayactually provides a real post-EOS grouping window that an upcomingSTART_OF_SPEECHcan cancel. STT mode is unchanged.Matches solution #2 from the linked Python issue.
Tests
New file
agents/src/voice/audio_recognition_endpointing_delay.test.tswith two cases:lastSpeakingTime = Date.now() − 550(mirroring Silero'sminSilenceDuration) andminEndpointingDelay = 300, then drivesrunEOUDetection(empty)and measures the time toonEndOfTurn. Expects ≥ 250 ms (i.e. roughly the configuredminDelay).main: fails withelapsed ≈ 2 ms(turn commits immediately).turnDetectionMode: 'stt', no VAD,lastSpeakingTime150 ms ago,minDelay = 400. Expectselapsed ≈ 250 ms(the existing subtraction stays).Verified that test 1 fails on
mainand both tests pass after the production change.Test plan
pnpm vitest run agents/src/voice/audio_recognition_endpointing_delay.test.ts— 2/2 passpnpm vitest run agents/src/voice/audio_recognition_endpointing.test.ts agents/src/voice/audio_recognition_vad_reset.test.ts— 5/5 pass (unchanged)pnpm build:agents— cleanpnpm lint— no new warnings on touched filespnpm format:check— cleanCloses #1741