Disclosure up front: I work on speech-sdk (Apache 2.0). This is a proposal to add an optional TTS plugin built on it. The integration runs fully BYOK against provider APIs with your users' own keys; no account with us is needed.
What
A new @livekit/agents-plugin-speechsdk package: non-streaming TTS where the model is one provider/model string across 15 providers:
import * as speechsdk from '@livekit/agents-plugin-speechsdk';
const tts = new speechsdk.TTS({ model: 'openai/gpt-4o-mini-tts', voice: 'alloy' });
// or: { model: 'murf/FALCON', voice: 'en-US-amara' }
// or: { model: 'fal-ai/kokoro/american-english', voice: 'af_heart' }
Keys resolve from each provider's standard env var (OPENAI_API_KEY, MURF_API_KEY, ...), and calls go directly to that provider.
Why this might be useful despite the existing plugins
Being upfront: agents-js already has dedicated plugins for most major TTS vendors, and for production streaming those remain the better choice (this plugin is ChunkedStream-only; AgentSession wraps it in the sentence-level StreamAdapter). The cases this adds:
- Providers with no dedicated plugin today: Murf, Smallest.ai, and fal.ai-hosted open-weight models (Kokoro, Orpheus, F5), plus xAI TTS (the xai plugin currently covers Realtime/STT/LLM).
- Provider evaluation: swapping
model: 'elevenlabs/eleven_flash_v2_5' for model: 'cartesia/sonic-3' is a one-string change with no new dependency, which makes A/B-ing voices across vendors cheap during development before committing to a dedicated streaming plugin.
- Default that works with keys most users already have: defaults to
openai/gpt-4o-mini-tts, so anyone with OPENAI_API_KEY set can try it immediately.
Implementation sketch
- Mirrors the OpenAI TTS plugin's shape:
tts.TTS subclass with { streaming: false }, ChunkedStream, stream() throws like openai.TTS.
- Requests raw PCM from speech-sdk and resamples with
@livekit/rtc-node's AudioResampler when a provider's native rate differs from the configured frame rate (default 24 kHz).
- speech-sdk's internal retry is disabled (
maxRetries: 0) so the framework's ChunkedStream retry policy owns retries; speech-sdk errors are mapped to APIError / APIStatusError with sensible retryable flags.
- speech-sdk is ESM-only, so the plugin loads it via dynamic
import(), which tsup preserves in the CJS build (same pattern as the @huggingface/transformers import in the livekit plugin).
- Optionally, setting
SPEECHBASE_API_KEY routes the same provider/model strings through speechbase.ai, the hosted gateway we run, so one key covers every provider; without it, calls go directly to the provider. Direct is the default.
- Dependency footprint:
@speech-sdk/core (Apache 2.0) plus its four runtime deps (mediabunny, an mp3 encoder for it, p-retry, zod).
I've opened #1754 alongside this issue so the diff is concrete; happy to close either if this isn't a fit.
I'll maintain this integration and take responsibility for breakage in it. If this isn't a direction you want, totally fine, close both with no hard feelings.
Disclosure up front: I work on speech-sdk (Apache 2.0). This is a proposal to add an optional TTS plugin built on it. The integration runs fully BYOK against provider APIs with your users' own keys; no account with us is needed.
What
A new
@livekit/agents-plugin-speechsdkpackage: non-streaming TTS where the model is oneprovider/modelstring across 15 providers:Keys resolve from each provider's standard env var (
OPENAI_API_KEY,MURF_API_KEY, ...), and calls go directly to that provider.Why this might be useful despite the existing plugins
Being upfront: agents-js already has dedicated plugins for most major TTS vendors, and for production streaming those remain the better choice (this plugin is
ChunkedStream-only;AgentSessionwraps it in the sentence-levelStreamAdapter). The cases this adds:model: 'elevenlabs/eleven_flash_v2_5'formodel: 'cartesia/sonic-3'is a one-string change with no new dependency, which makes A/B-ing voices across vendors cheap during development before committing to a dedicated streaming plugin.openai/gpt-4o-mini-tts, so anyone withOPENAI_API_KEYset can try it immediately.Implementation sketch
tts.TTSsubclass with{ streaming: false },ChunkedStream,stream()throws likeopenai.TTS.@livekit/rtc-node'sAudioResamplerwhen a provider's native rate differs from the configured frame rate (default 24 kHz).maxRetries: 0) so the framework'sChunkedStreamretry policy owns retries; speech-sdk errors are mapped toAPIError/APIStatusErrorwith sensibleretryableflags.import(), which tsup preserves in the CJS build (same pattern as the@huggingface/transformersimport in the livekit plugin).SPEECHBASE_API_KEYroutes the sameprovider/modelstrings through speechbase.ai, the hosted gateway we run, so one key covers every provider; without it, calls go directly to the provider. Direct is the default.@speech-sdk/core(Apache 2.0) plus its four runtime deps (mediabunny, an mp3 encoder for it, p-retry, zod).I've opened #1754 alongside this issue so the diff is concrete; happy to close either if this isn't a fit.
I'll maintain this integration and take responsibility for breakage in it. If this isn't a direction you want, totally fine, close both with no hard feelings.