Skip to content

minEndpointingDelay is nullified in VAD turn-detection mode (swallowed by Silero minSilenceDuration), breaking multi-segment turn grouping #1741

@dbloembe

Description

@dbloembe

Summary

In turnDetection: 'vad' mode, minEndpointingDelay has effectively no effect when it is ≤ the VAD's minSilenceDuration. The end-of-utterance grouping window collapses to ~0, so the turn commits the instant END_OF_SPEECH fires. This is the same root cause as Python issue livekit/agents#4325 (closed without a code fix), and it is still present in the latest @livekit/agents@1.4.5.

Root cause

agents/src/voice/audio_recognition.ts, in bounceEOUTask (compiled dist/voice/audio_recognition.js, 1.4.5 line ~818):

let extraSleep = endpointingDelay;            // = endpointing.minDelay (default 500ms)
if (lastSpeakingTime !== void 0) {
  extraSleep += lastSpeakingTime - Date.now(); // subtracts silence already elapsed
}
if (extraSleep > 0) {
  await delay(Math.max(extraSleep, 0), { signal: controller.signal });
}

lastSpeakingTime is stamped on INFERENCE_DONE (≈ when the user stops), but bounceEOUTask only runs at END_OF_SPEECH, which Silero emits minSilenceDuration (~550ms) later. So lastSpeakingTime - Date.now() ≈ -550ms, giving extraSleep ≈ minDelay - minSilenceDuration. With the defaults (minDelay=500, minSilence=550) that's negative → no wait → immediate commit. Effective delay ≈ max(minSilenceDuration, minDelay), so minDelay is silently ignored unless it exceeds minSilenceDuration.

Why it matters (worse than latency in realtime/manual-activity mode)

With a realtime model using manual activity detection (e.g. @livekit/agents-plugin-google with automaticActivityDetection.disabled), the missing grouping window means a natural mid-sentence pause ("No, that's okay. … just use Alex") splits into two VAD segments:

  1. Segment 1 commits a turn immediately (generation starts).
  2. Segment 2 begins while segment 1 is still generating, so a second userTurnCompleted/generateReply never fires for it.
  3. The activity window opened for segment 2 is never closed → the model waits indefinitely → the agent never responds (dead call).

Reproduced consistently in low-concurrency local runs: any caller utterance containing a ~1s pause stalls the turn.

Expected

minDelay should provide a real grouping window after END_OF_SPEECH (so START_OF_SPEECH can cancel the pending commit), independent of how long silence detection took.

Suggested fix

Per #4325's proposed solution #2: in VAD-based turn detection, measure the endpointing delay from END_OF_SPEECH rather than from lastSpeakingTime — e.g. skip the lastSpeakingTime - Date.now() adjustment when vadBaseTurnDetection is true. STT mode (where the adjustment compensates for transcription latency) is unaffected.

Environment

  • @livekit/agents 1.3.4 (verified identical in 1.4.5)
  • @livekit/agents-plugin-silero 1.3.4, @livekit/agents-plugin-google 1.3.4 (realtime, manual activity)
  • turnHandling: { turnDetection: 'vad', endpointing: { minDelay: 500, maxDelay: 1500 } }, Silero minSilenceDuration 550ms
  • Node.js, Linux/macOS

Related: #926 (unnecessary delay in manual mode — opposite direction).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions