Skip to content

fix(sarvam): use provider speech timing for eos#1763

Open
rosetta-livekit-bot[bot] wants to merge 1 commit into
1.5.0from
vatting-twilled-fifes
Open

fix(sarvam): use provider speech timing for eos#1763
rosetta-livekit-bot[bot] wants to merge 1 commit into
1.5.0from
vatting-twilled-fifes

Conversation

@rosetta-livekit-bot

@rosetta-livekit-bot rosetta-livekit-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • use Sarvam streaming speech_start/speech_end fields for final transcript timing
  • apply startTimeOffset to provider-relative speech timings
  • keep END_OF_SPEECH bare and delay it briefly so final transcripts normally arrive first

Tests

  • pnpm exec prettier --write plugins/sarvam/src/stt.ts
  • pnpm --filter @livekit/agents-plugin-sarvam lint
  • pnpm build:agents
  • pnpm --filter @livekit/agents-plugins-test --filter @livekit/agents-plugin-silero build
  • pnpm --filter @livekit/agents-plugin-openai build
  • pnpm --filter @livekit/agents-plugin-sarvam build
  • pnpm test -- plugins/sarvam/src/stt.test.ts

Ported from livekit/agents#6052

Original PR description

Problem

The Sarvam streaming STT plugin tried to manufacture an audio-relative speech-end time from two sources that don't actually provide one:

  • The VAD END_SPEECH event's occured_at field — which logging proved is a wall-clock Unix epoch, not an audio-stream offset, so it was rejected 100% of the time by a range-check heuristic.
  • A local send-clock counter (_audio_position) that counts audio uploaded, not processed — biased and fabricated.

Sarvam's streaming socket genuinely sends no usable word timing (no timestamps array; speech_start/speech_end come back null), so all this machinery produced misleading timestamps.

Change

Aligned Sarvam with how every other STT plugin (Deepgram, AssemblyAI, Google, Azure…) handles this:

  1. END_OF_SPEECH is now emitted bare — no alternatives/end_time. Previously it carried an empty-text SpeechData with a fabricated end_time + unused speech_end_wall_time metadata.
  2. FINAL_TRANSCRIPT timing comes only from the provider — start_time/end_time read from speech_start/speech_end, falling back to 0.0. The voice pipeline then uses wall-clock for EOU timing (standard behavior for providers without streaming word timing).
  3. Deleted the dead machinery: _interpret_signal_time, _resolved_speech_end, the _audio_position send-clock (+ its hot-loop increment), _utterance_server_speech_end, _utterance_speech_end_audio_pos, _utterance_speech_end_wall, and the require_end_time param.

@changeset-bot

changeset-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 4baec50

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 34 packages
Name Type
@livekit/agents-plugin-sarvam Patch
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-fishaudio Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-hume Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-mistralai Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-perplexity Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugin-soniox Patch
@livekit/agents-plugin-tavus Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugin-xai Patch
@livekit/agents-plugins-test Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review

Comment thread plugins/sarvam/src/stt.ts
Comment on lines +909 to +911
} else if (this.#sendFinalTranscript(td, putMessage)) {
this.#finalReceivedForUtterance = true;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing #eosEmittedForUtterance guard allows FINAL_TRANSCRIPT after END_OF_SPEECH

When the EOS fallback timer fires (because the server sent END_SPEECH but no transcript arrived within 1000ms), #pendingEos is set to false and #eosEmittedForUtterance is set to true. If a late transcript data message subsequently arrives, the code at line 909 takes the else if branch (since #pendingEos is false) and calls #sendFinalTranscript without checking #eosEmittedForUtterance. This emits a FINAL_TRANSCRIPT event after END_OF_SPEECH was already emitted, violating the expected event ordering (START_OF_SPEECH → FINAL_TRANSCRIPT → END_OF_SPEECH). Downstream in audio_recognition.ts:837-897, this late FINAL_TRANSCRIPT updates audioTranscript, triggers preemptive generation, and runs EOU detection again — all after the user turn was already committed at audio_recognition.ts:1047.

Note that #tryCommitUtterance at plugins/sarvam/src/stt.ts:661 correctly guards against this with this.#eosEmittedForUtterance, but the direct #sendFinalTranscript call path at line 909 does not.

Suggested change
} else if (this.#sendFinalTranscript(td, putMessage)) {
this.#finalReceivedForUtterance = true;
}
} else if (!this.#eosEmittedForUtterance && this.#sendFinalTranscript(td, putMessage)) {
this.#finalReceivedForUtterance = true;
}
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread plugins/sarvam/src/stt.ts

const SAMPLE_RATE = 16000;
const NUM_CHANNELS = 1;
const EOS_FALLBACK_TIMEOUT = 1000;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 EOS_FALLBACK_TIMEOUT of 1000ms may need tuning

The EOS_FALLBACK_TIMEOUT constant is set to 1000ms at line 38. This is the maximum time the system will wait for a transcript after receiving END_SPEECH before emitting END_OF_SPEECH without one. If Sarvam's server processing latency is sometimes >1000ms (e.g., for longer utterances or under load), the fallback could fire prematurely, causing the transcript to arrive after END_OF_SPEECH (which is the scenario in BUG-0001). The Sarvam STT metrics logging at line 894-896 captures processing_latency — monitoring this in production would help determine if the 1000ms timeout is appropriate.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants