feat(speechsdk): add speech-sdk multi-provider TTS plugin#1754
Open
btpod wants to merge 1 commit into
Open
Conversation
New @livekit/agents-plugin-speechsdk package: non-streaming TTS across 15 providers through one provider/model string, including providers without a dedicated plugin (Murf, Smallest.ai, fal.ai-hosted open-weight models). Synthesis requests raw PCM and resamples to the configured frame rate with AudioResampler when a provider's native rate differs. speech-sdk's internal retry is disabled so the framework's ChunkedStream retry policy owns retries. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 96ee459 The changes in this PR will be included in the next version bump. This PR includes changesets to release 36 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disclosure up front: I work on speech-sdk (Apache 2.0). The integration runs fully BYOK against provider APIs with your users' own keys; no account with us is needed.
Proposed in #1753 per the CONTRIBUTING "discuss first" guideline; opening the PR alongside so the diff is concrete. Happy to close either if this isn't a fit.
Summary
@livekit/agents-plugin-speechsdkpackage: non-streaming TTS where the model is oneprovider/modelstring across 15 providers.elevenlabs/eleven_flash_v2_5tocartesia/sonic-3) with no new dependency per vendor; for production streaming, the dedicated provider plugins remain the better choice and the README says so.openai/gpt-4o-mini-tts, so anyone withOPENAI_API_KEYalready set can use it immediately.globalEnventry, and the changeset.Implementation notes
tts.TTSsubclass with{ streaming: false },ChunkedStream,stream()throws likeopenai.TTS(AgentSession wraps non-streaming TTS in the sentence-levelStreamAdapterautomatically).mediaType, and@livekit/rtc-node'sAudioResamplerresamples when a provider's native rate differs from the configured frame rate (default 24 kHz). This avoids per-provider sample-rate matrices: some providers are fixed at 48 kHz (Hume), some have no 24 kHz option (Resemble), and fal has no rate selection at all.maxRetries: 0) so the framework'sChunkedStreamretry policy owns retries. speech-sdk errors map toAPIStatusError(HTTP errors, retryable on 408/429/5xx) or non-retryableAPIError(for example a missing key produces "OpenAI API key is required. Pass it via apiKey option or set the OPENAI_API_KEY environment variable").OPENAI_API_KEY,MURF_API_KEY, ...) or an explicitapiKeyoption.import(), which tsup preserves in the CJS build (same pattern as the@huggingface/transformersimport in the livekit plugin). Types come from staticimport type, which is erased.SPEECHBASE_API_KEYroutes the sameprovider/modelstrings through speechbase.ai, the hosted gateway we run, so one key covers every provider; without it, calls go directly to the provider. Direct is the default.fal-ai/kokoro/american-english). Unknown provider prefixes are rejected at construction with the supported list.@speech-sdk/coreplus four transitive runtime deps (mediabunny, its mp3 encoder, p-retry, zod, of which zod was already in the lockfile). The pnpm-lock.yaml diff is purely additive (81 lines); the existing Renovate-pinned vitest resolutions are deliberately left untouched.Test plan
pnpm build(all 37 turbo tasks),pnpm lint,pnpm format:check,pnpm throws:check, andreuse lintpass locally;pnpm install --frozen-lockfile --ignore-scripts(the CI install command) accepts the spliced lockfile. Re-verified on a clean checkout of current main before opening this PR.APIErrorwith env-var guidance before any network call; an invalid gateway key produces a real HTTP 401 mapped toAPIStatusError(statusCode 401, retryable false).OPENAI_API_KEY=... npx vitest run plugins/speechsdk, 4/4): streams text throughStreamAdapter, synthesizesopenai/gpt-4o-mini-tts, and validates the transcript with OpenAI STT. Note: the test constructs the STT asnew STT({ useRealtime: false, model: 'whisper-1' }); barenew STT()(as the sibling plugin tests use) now defaults to the realtime model and throws without a VAD when run with keys.sampleRate: 16000theAudioResamplerbranch emits frames at 16000 Hz with matching duration.I'll maintain this integration and take responsibility for breakage in it. If this isn't a direction you want, totally fine, close both with no hard feelings.
🤖 Generated with Claude Code