fix(sarvam): use provider speech timing for eos#1763
fix(sarvam): use provider speech timing for eos#1763rosetta-livekit-bot[bot] wants to merge 1 commit into
Conversation
🦋 Changeset detectedLatest commit: 4baec50 The changes in this PR will be included in the next version bump. This PR includes changesets to release 34 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
| } else if (this.#sendFinalTranscript(td, putMessage)) { | ||
| this.#finalReceivedForUtterance = true; | ||
| } |
There was a problem hiding this comment.
🟡 Missing #eosEmittedForUtterance guard allows FINAL_TRANSCRIPT after END_OF_SPEECH
When the EOS fallback timer fires (because the server sent END_SPEECH but no transcript arrived within 1000ms), #pendingEos is set to false and #eosEmittedForUtterance is set to true. If a late transcript data message subsequently arrives, the code at line 909 takes the else if branch (since #pendingEos is false) and calls #sendFinalTranscript without checking #eosEmittedForUtterance. This emits a FINAL_TRANSCRIPT event after END_OF_SPEECH was already emitted, violating the expected event ordering (START_OF_SPEECH → FINAL_TRANSCRIPT → END_OF_SPEECH). Downstream in audio_recognition.ts:837-897, this late FINAL_TRANSCRIPT updates audioTranscript, triggers preemptive generation, and runs EOU detection again — all after the user turn was already committed at audio_recognition.ts:1047.
Note that #tryCommitUtterance at plugins/sarvam/src/stt.ts:661 correctly guards against this with this.#eosEmittedForUtterance, but the direct #sendFinalTranscript call path at line 909 does not.
| } else if (this.#sendFinalTranscript(td, putMessage)) { | |
| this.#finalReceivedForUtterance = true; | |
| } | |
| } else if (!this.#eosEmittedForUtterance && this.#sendFinalTranscript(td, putMessage)) { | |
| this.#finalReceivedForUtterance = true; | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
|
|
||
| const SAMPLE_RATE = 16000; | ||
| const NUM_CHANNELS = 1; | ||
| const EOS_FALLBACK_TIMEOUT = 1000; |
There was a problem hiding this comment.
🚩 EOS_FALLBACK_TIMEOUT of 1000ms may need tuning
The EOS_FALLBACK_TIMEOUT constant is set to 1000ms at line 38. This is the maximum time the system will wait for a transcript after receiving END_SPEECH before emitting END_OF_SPEECH without one. If Sarvam's server processing latency is sometimes >1000ms (e.g., for longer utterances or under load), the fallback could fire prematurely, causing the transcript to arrive after END_OF_SPEECH (which is the scenario in BUG-0001). The Sarvam STT metrics logging at line 894-896 captures processing_latency — monitoring this in production would help determine if the 1000ms timeout is appropriate.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Tests
Ported from livekit/agents#6052
Original PR description
Problem
The Sarvam streaming STT plugin tried to manufacture an audio-relative speech-end time from two sources that don't actually provide one:
Sarvam's streaming socket genuinely sends no usable word timing (no timestamps array; speech_start/speech_end come back null), so all this machinery produced misleading timestamps.
Change
Aligned Sarvam with how every other STT plugin (Deepgram, AssemblyAI, Google, Azure…) handles this: