Skip to content

Proposal: speech-sdk TTS plugin (Murf, Smallest.ai, fal.ai-hosted open-weight models, one-string provider switching) #1753

@btpod

Description

@btpod

Disclosure up front: I work on speech-sdk (Apache 2.0). This is a proposal to add an optional TTS plugin built on it. The integration runs fully BYOK against provider APIs with your users' own keys; no account with us is needed.

What

A new @livekit/agents-plugin-speechsdk package: non-streaming TTS where the model is one provider/model string across 15 providers:

import * as speechsdk from '@livekit/agents-plugin-speechsdk';

const tts = new speechsdk.TTS({ model: 'openai/gpt-4o-mini-tts', voice: 'alloy' });
// or: { model: 'murf/FALCON', voice: 'en-US-amara' }
// or: { model: 'fal-ai/kokoro/american-english', voice: 'af_heart' }

Keys resolve from each provider's standard env var (OPENAI_API_KEY, MURF_API_KEY, ...), and calls go directly to that provider.

Why this might be useful despite the existing plugins

Being upfront: agents-js already has dedicated plugins for most major TTS vendors, and for production streaming those remain the better choice (this plugin is ChunkedStream-only; AgentSession wraps it in the sentence-level StreamAdapter). The cases this adds:

  1. Providers with no dedicated plugin today: Murf, Smallest.ai, and fal.ai-hosted open-weight models (Kokoro, Orpheus, F5), plus xAI TTS (the xai plugin currently covers Realtime/STT/LLM).
  2. Provider evaluation: swapping model: 'elevenlabs/eleven_flash_v2_5' for model: 'cartesia/sonic-3' is a one-string change with no new dependency, which makes A/B-ing voices across vendors cheap during development before committing to a dedicated streaming plugin.
  3. Default that works with keys most users already have: defaults to openai/gpt-4o-mini-tts, so anyone with OPENAI_API_KEY set can try it immediately.

Implementation sketch

  • Mirrors the OpenAI TTS plugin's shape: tts.TTS subclass with { streaming: false }, ChunkedStream, stream() throws like openai.TTS.
  • Requests raw PCM from speech-sdk and resamples with @livekit/rtc-node's AudioResampler when a provider's native rate differs from the configured frame rate (default 24 kHz).
  • speech-sdk's internal retry is disabled (maxRetries: 0) so the framework's ChunkedStream retry policy owns retries; speech-sdk errors are mapped to APIError / APIStatusError with sensible retryable flags.
  • speech-sdk is ESM-only, so the plugin loads it via dynamic import(), which tsup preserves in the CJS build (same pattern as the @huggingface/transformers import in the livekit plugin).
  • Optionally, setting SPEECHBASE_API_KEY routes the same provider/model strings through speechbase.ai, the hosted gateway we run, so one key covers every provider; without it, calls go directly to the provider. Direct is the default.
  • Dependency footprint: @speech-sdk/core (Apache 2.0) plus its four runtime deps (mediabunny, an mp3 encoder for it, p-retry, zod).

I've opened #1754 alongside this issue so the diff is concrete; happy to close either if this isn't a fit.

I'll maintain this integration and take responsibility for breakage in it. If this isn't a direction you want, totally fine, close both with no hard feelings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions