Summary
In turnDetection: 'vad' mode, minEndpointingDelay has effectively no effect when it is ≤ the VAD's minSilenceDuration. The end-of-utterance grouping window collapses to ~0, so the turn commits the instant END_OF_SPEECH fires. This is the same root cause as Python issue livekit/agents#4325 (closed without a code fix), and it is still present in the latest @livekit/agents@1.4.5.
Root cause
agents/src/voice/audio_recognition.ts, in bounceEOUTask (compiled dist/voice/audio_recognition.js, 1.4.5 line ~818):
let extraSleep = endpointingDelay; // = endpointing.minDelay (default 500ms)
if (lastSpeakingTime !== void 0) {
extraSleep += lastSpeakingTime - Date.now(); // subtracts silence already elapsed
}
if (extraSleep > 0) {
await delay(Math.max(extraSleep, 0), { signal: controller.signal });
}
lastSpeakingTime is stamped on INFERENCE_DONE (≈ when the user stops), but bounceEOUTask only runs at END_OF_SPEECH, which Silero emits minSilenceDuration (~550ms) later. So lastSpeakingTime - Date.now() ≈ -550ms, giving extraSleep ≈ minDelay - minSilenceDuration. With the defaults (minDelay=500, minSilence=550) that's negative → no wait → immediate commit. Effective delay ≈ max(minSilenceDuration, minDelay), so minDelay is silently ignored unless it exceeds minSilenceDuration.
Why it matters (worse than latency in realtime/manual-activity mode)
With a realtime model using manual activity detection (e.g. @livekit/agents-plugin-google with automaticActivityDetection.disabled), the missing grouping window means a natural mid-sentence pause ("No, that's okay. … just use Alex") splits into two VAD segments:
- Segment 1 commits a turn immediately (generation starts).
- Segment 2 begins while segment 1 is still generating, so a second
userTurnCompleted/generateReply never fires for it.
- The activity window opened for segment 2 is never closed → the model waits indefinitely → the agent never responds (dead call).
Reproduced consistently in low-concurrency local runs: any caller utterance containing a ~1s pause stalls the turn.
Expected
minDelay should provide a real grouping window after END_OF_SPEECH (so START_OF_SPEECH can cancel the pending commit), independent of how long silence detection took.
Suggested fix
Per #4325's proposed solution #2: in VAD-based turn detection, measure the endpointing delay from END_OF_SPEECH rather than from lastSpeakingTime — e.g. skip the lastSpeakingTime - Date.now() adjustment when vadBaseTurnDetection is true. STT mode (where the adjustment compensates for transcription latency) is unaffected.
Environment
@livekit/agents 1.3.4 (verified identical in 1.4.5)
@livekit/agents-plugin-silero 1.3.4, @livekit/agents-plugin-google 1.3.4 (realtime, manual activity)
turnHandling: { turnDetection: 'vad', endpointing: { minDelay: 500, maxDelay: 1500 } }, Silero minSilenceDuration 550ms
- Node.js, Linux/macOS
Related: #926 (unnecessary delay in manual mode — opposite direction).
Summary
In
turnDetection: 'vad'mode,minEndpointingDelayhas effectively no effect when it is ≤ the VAD'sminSilenceDuration. The end-of-utterance grouping window collapses to ~0, so the turn commits the instantEND_OF_SPEECHfires. This is the same root cause as Python issue livekit/agents#4325 (closed without a code fix), and it is still present in the latest@livekit/agents@1.4.5.Root cause
agents/src/voice/audio_recognition.ts, inbounceEOUTask(compileddist/voice/audio_recognition.js, 1.4.5 line ~818):lastSpeakingTimeis stamped onINFERENCE_DONE(≈ when the user stops), butbounceEOUTaskonly runs atEND_OF_SPEECH, which Silero emitsminSilenceDuration(~550ms) later. SolastSpeakingTime - Date.now() ≈ -550ms, givingextraSleep ≈ minDelay - minSilenceDuration. With the defaults (minDelay=500,minSilence=550) that's negative → no wait → immediate commit. Effective delay ≈max(minSilenceDuration, minDelay), sominDelayis silently ignored unless it exceedsminSilenceDuration.Why it matters (worse than latency in realtime/manual-activity mode)
With a realtime model using manual activity detection (e.g.
@livekit/agents-plugin-googlewithautomaticActivityDetection.disabled), the missing grouping window means a natural mid-sentence pause ("No, that's okay. … just use Alex") splits into two VAD segments:userTurnCompleted/generateReplynever fires for it.Reproduced consistently in low-concurrency local runs: any caller utterance containing a ~1s pause stalls the turn.
Expected
minDelayshould provide a real grouping window afterEND_OF_SPEECH(soSTART_OF_SPEECHcan cancel the pending commit), independent of how long silence detection took.Suggested fix
Per #4325's proposed solution #2: in VAD-based turn detection, measure the endpointing delay from
END_OF_SPEECHrather than fromlastSpeakingTime— e.g. skip thelastSpeakingTime - Date.now()adjustment whenvadBaseTurnDetectionis true. STT mode (where the adjustment compensates for transcription latency) is unaffected.Environment
@livekit/agents1.3.4 (verified identical in 1.4.5)@livekit/agents-plugin-silero1.3.4,@livekit/agents-plugin-google1.3.4 (realtime, manual activity)turnHandling: { turnDetection: 'vad', endpointing: { minDelay: 500, maxDelay: 1500 } }, SilerominSilenceDuration550msRelated: #926 (unnecessary delay in manual mode — opposite direction).