Proposal: speech-sdk TTS plugin (Murf, Smallest.ai, fal.ai-hosted open-weight models, one-string provider switching)

Disclosure up front: I work on [speech-sdk](https://github.com/Jellypod-Inc/speech-sdk) (Apache 2.0). This is a proposal to add an optional TTS plugin built on it. The integration runs fully BYOK against provider APIs with your users' own keys; no account with us is needed.

## What

A new `@livekit/agents-plugin-speechsdk` package: non-streaming TTS where the model is one `provider/model` string across 15 providers:

```ts
import * as speechsdk from '@livekit/agents-plugin-speechsdk';

const tts = new speechsdk.TTS({ model: 'openai/gpt-4o-mini-tts', voice: 'alloy' });
// or: { model: 'murf/FALCON', voice: 'en-US-amara' }
// or: { model: 'fal-ai/kokoro/american-english', voice: 'af_heart' }
```

Keys resolve from each provider's standard env var (`OPENAI_API_KEY`, `MURF_API_KEY`, ...), and calls go directly to that provider.

## Why this might be useful despite the existing plugins

Being upfront: agents-js already has dedicated plugins for most major TTS vendors, and for production streaming those remain the better choice (this plugin is `ChunkedStream`-only; `AgentSession` wraps it in the sentence-level `StreamAdapter`). The cases this adds:

1. **Providers with no dedicated plugin today**: Murf, Smallest.ai, and fal.ai-hosted open-weight models (Kokoro, Orpheus, F5), plus xAI TTS (the xai plugin currently covers Realtime/STT/LLM).
2. **Provider evaluation**: swapping `model: 'elevenlabs/eleven_flash_v2_5'` for `model: 'cartesia/sonic-3'` is a one-string change with no new dependency, which makes A/B-ing voices across vendors cheap during development before committing to a dedicated streaming plugin.
3. **Default that works with keys most users already have**: defaults to `openai/gpt-4o-mini-tts`, so anyone with `OPENAI_API_KEY` set can try it immediately.

## Implementation sketch

- Mirrors the OpenAI TTS plugin's shape: `tts.TTS` subclass with `{ streaming: false }`, `ChunkedStream`, `stream()` throws like `openai.TTS`.
- Requests raw PCM from speech-sdk and resamples with `@livekit/rtc-node`'s `AudioResampler` when a provider's native rate differs from the configured frame rate (default 24 kHz).
- speech-sdk's internal retry is disabled (`maxRetries: 0`) so the framework's `ChunkedStream` retry policy owns retries; speech-sdk errors are mapped to `APIError` / `APIStatusError` with sensible `retryable` flags.
- speech-sdk is ESM-only, so the plugin loads it via dynamic `import()`, which tsup preserves in the CJS build (same pattern as the `@huggingface/transformers` import in the livekit plugin).
- Optionally, setting `SPEECHBASE_API_KEY` routes the same `provider/model` strings through speechbase.ai, the hosted gateway we run, so one key covers every provider; without it, calls go directly to the provider. Direct is the default.
- Dependency footprint: `@speech-sdk/core` (Apache 2.0) plus its four runtime deps (mediabunny, an mp3 encoder for it, p-retry, zod).

I've opened #1754 alongside this issue so the diff is concrete; happy to close either if this isn't a fit.

I'll maintain this integration and take responsibility for breakage in it. If this isn't a direction you want, totally fine, close both with no hard feelings.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: speech-sdk TTS plugin (Murf, Smallest.ai, fal.ai-hosted open-weight models, one-string provider switching) #1753

What

Why this might be useful despite the existing plugins

Implementation sketch

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Proposal: speech-sdk TTS plugin (Murf, Smallest.ai, fal.ai-hosted open-weight models, one-string provider switching) #1753

Description

What

Why this might be useful despite the existing plugins

Implementation sketch

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions