Problem
VAPI delivers telephony audio at 8kHz. Sarvam's streaming STT WebSocket
requires sample_rate set explicitly in BOTH the connection parameter and
the audio data parameter. If they mismatch, which happens silently when
using VAPI's default audio pipeline,transcription quality degrades
significantly with no error thrown. The API just returns bad transcripts.
This is not obvious from the VAPI integration docs and is likely causing
silent WER degradation in the current PR #12 implementation.
Proposed fix
Explicitly set sample_rate=8000 in both:
- WebSocket connection handshake
- Every audio chunk sent via the transcribe parameter
Also switch from Saarika v2.5 (being deprecated) to Saaras v3 with
mode="transcribe" as recommended in Sarvam's own deprecation notice.
Impact
Fixing this alone could recover several WER points on all telephony
calls without any model or architecture changes.
Problem
VAPI delivers telephony audio at 8kHz. Sarvam's streaming STT WebSocket
requires sample_rate set explicitly in BOTH the connection parameter and
the audio data parameter. If they mismatch, which happens silently when
using VAPI's default audio pipeline,transcription quality degrades
significantly with no error thrown. The API just returns bad transcripts.
This is not obvious from the VAPI integration docs and is likely causing
silent WER degradation in the current PR #12 implementation.
Proposed fix
Explicitly set sample_rate=8000 in both:
Also switch from Saarika v2.5 (being deprecated) to Saaras v3 with
mode="transcribe" as recommended in Sarvam's own deprecation notice.
Impact
Fixing this alone could recover several WER points on all telephony
calls without any model or architecture changes.