Skip to content

Add Speechmatics as alternative speech provider#78

Closed
claudio-pi wants to merge 2 commits intomainfrom
feature/speechmatics-provider
Closed

Add Speechmatics as alternative speech provider#78
claudio-pi wants to merge 2 commits intomainfrom
feature/speechmatics-provider

Conversation

@claudio-pi
Copy link
Collaborator

Summary

  • Adds Speechmatics as a configurable alternative to ElevenLabs for TTS and STT
  • New SPEECH_PROVIDER config option in service.env (default: elevenlabs, alternative: speechmatics)
  • Speechmatics TTS uses the preview endpoint (preview.tts.speechmatics.com) with WAV output
  • Speechmatics STT uses the batch API (asr.api.speechmatics.com/v2) with async job submission, polling, and plain-text transcript retrieval
  • Provider dispatch in handlers.py selects the appropriate TTS/STT functions based on config

New config variables (in service.env)

Variable Default Description
SPEECH_PROVIDER elevenlabs Speech provider: elevenlabs or speechmatics
SPEECHMATICS_API_KEY (empty) Speechmatics API key
SPEECHMATICS_VOICE_ID sarah Voice: sarah, theo, megan, jack
SPEECHMATICS_STT_REGION eu1 STT API region: eu1, us1, au1

Files changed

  • New: lib/speechmatics.py — TTS and STT module (stdlib only)
  • New: tests/test_speechmatics.py — 33 tests
  • Modified: lib/config.py — Added Speechmatics config fields
  • Modified: lib/handlers.py — Provider dispatch functions
  • Modified: tests/test_handlers.py — Updated mocks for dispatch layer
  • Modified: CLAUDE.md — Documentation updates

Test plan

  • All 673 tests pass (640 existing + 33 new)
  • Speechmatics TTS API tested with real API key — returns valid WAV audio
  • Speechmatics STT batch API tested end-to-end — submit job, poll, get transcript
  • Verify ElevenLabs path still works unchanged (default config)
  • Test switching SPEECH_PROVIDER=speechmatics in production

🤖 Generated with Claude Code

Adds configurable speech provider selection via SPEECH_PROVIDER env var
(default: "elevenlabs"). When set to "speechmatics", voice transcription
uses the batch STT API and TTS uses the streaming WAV endpoint.

New files:
- lib/speechmatics.py: TTS (WAV output) and STT (async batch job API)
- tests/test_speechmatics.py: 33 tests covering TTS, STT, polling, validation

Modified:
- lib/config.py: SPEECH_PROVIDER, SPEECHMATICS_API_KEY, SPEECHMATICS_VOICE_ID,
  SPEECHMATICS_STT_REGION in BotConfig and ClaudioConfig
- lib/handlers.py: provider dispatch via _stt_transcribe(), _tts_convert(),
  _get_speech_api_key() — selects provider based on config
- tests/test_handlers.py: updated mocks for new dispatch functions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link

Summary of Changes

Hello @claudio-pi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's voice processing capabilities by integrating Speechmatics as a new, configurable alternative to ElevenLabs for both text-to-speech and speech-to-text. This provides greater flexibility and choice for users, allowing them to leverage different speech technologies based on their preferences or requirements. The changes involve core logic updates, new API integrations, and comprehensive configuration management to ensure seamless operation.

Highlights

  • Speechmatics Integration: Added Speechmatics as an alternative provider for both Text-to-Speech (TTS) and Speech-to-Text (STT) functionalities.
  • Configurable Speech Provider: Introduced a new SPEECH_PROVIDER configuration option, allowing users to switch between elevenlabs (default) and speechmatics.
  • Speechmatics TTS Implementation: Implemented Speechmatics TTS using its preview endpoint, generating WAV audio output, and including markdown stripping and text truncation.
  • Speechmatics STT Implementation: Implemented Speechmatics STT using its asynchronous batch API, which involves submitting transcription jobs, polling for completion, and retrieving plain-text transcripts.
  • Centralized Speech Dispatch: Refactored lib/handlers.py to include dispatch functions (_stt_transcribe, _tts_convert) that dynamically select the appropriate speech provider based on the configuration.
  • New Configuration Variables: Added SPEECHMATICS_API_KEY, SPEECHMATICS_VOICE_ID, and SPEECHMATICS_STT_REGION to the system's configuration.
  • Comprehensive Testing: Included a new dedicated test file (tests/test_speechmatics.py) with 33 new tests for Speechmatics functionalities and updated existing handler tests to reflect the new dispatch layer.
Changelog
  • CLAUDE.md
    • Updated the description of lib/handlers.py to reflect the new speech provider dispatch logic.
    • Added lib/speechmatics.py to the list of core modules.
    • Adjusted the total test count from 640 to 673 and included speechmatics in the list of covered test modules.
  • lib/config.py
    • Added speech_provider to the BotConfig class attributes.
    • Included speechmatics_api_key, speechmatics_voice_id, and speechmatics_stt_region in BotConfig attributes and their initialization.
    • Integrated new Speechmatics-related environment variables (SPEECH_PROVIDER, SPEECHMATICS_API_KEY, SPEECHMATICS_VOICE_ID, SPEECHMATICS_STT_REGION) into ClaudioConfig's managed keys and default service environment.
  • lib/handlers.py
    • Renamed imports for ElevenLabs TTS/STT functions and added imports for Speechmatics TTS/STT functions, aliasing them to prevent naming conflicts.
    • Introduced _get_speech_api_key, _stt_transcribe, and _tts_convert functions to dispatch speech processing requests to the configured provider.
    • Modified _process_message to use the new speech dispatch functions for voice transcription and API key validation.
    • Updated _deliver_voice_response to utilize the new _tts_convert dispatch function and dynamically set the temporary audio file extension based on the selected speech provider.
  • lib/speechmatics.py
    • Added a new module implementing Speechmatics Text-to-Speech (tts_convert) and Speech-to-Text (stt_transcribe) functionalities.
    • Implemented tts_convert to interact with the Speechmatics preview TTS endpoint, handling text processing, API requests, and WAV output validation.
    • Implemented stt_transcribe to manage the Speechmatics batch STT API workflow, including job submission, status polling, and transcript retrieval.
    • Included helper functions for WAV magic byte validation, job submission, job status polling, transcript fetching, and safe file deletion.
  • tests/test_handlers.py
    • Updated mock paths for stt_transcribe and tts_convert to target the new dispatch functions (_stt_transcribe, _tts_convert) within lib.handlers.
  • tests/test_speechmatics.py
    • Added a new test file containing extensive unit tests for the lib/speechmatics.py module.
    • Included tests for tts_convert covering success, API errors, URL errors, text truncation, invalid inputs, and non-audio responses.
    • Provided tests for stt_transcribe covering success, submission errors, file size validation, missing API keys, invalid regions, empty transcriptions, polling logic, and multipart request verification.
    • Added tests for _validate_wav_magic to ensure correct WAV file identification.
    • Included tests for _wait_for_job and _get_transcript to verify job polling and transcript retrieval mechanisms.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 6 files

Confidence score: 3/5

  • Telegram voice replies may fail because send_voice doesn’t accept WAV files, so Speechmatics responses may not reach Telegram users until a supported format is used.
  • Unbounded TTS response reads in lib/speechmatics.py could allow excessive memory use on malformed/large responses, which is a moderate stability risk.
  • Score reflects a couple of medium‑severity, user‑impacting issues but no critical blockers reported.
  • Pay close attention to lib/handlers.py and lib/speechmatics.py - media format compatibility and response/file handling safeguards.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="lib/speechmatics.py">

<violation number="1" location="lib/speechmatics.py:102">
P2: Cap the TTS response read to a maximum size so a malformed/large response can’t consume unbounded memory.

(Based on your team's feedback about capping HTTP response reads for downloaded media.) [FEEDBACK_USED]</violation>

<violation number="2" location="lib/speechmatics.py:123">
P2: Create the TTS output file with restrictive permissions (0o600) to avoid leaking voice content to other local users.

(Based on your team's feedback about creating downloaded media files with restrictive permissions.) [FEEDBACK_USED]</violation>
</file>

<file name="lib/handlers.py">

<violation number="1" location="lib/handlers.py:831">
P2: Telegram `send_voice` does not accept WAV files, so Speechmatics voice replies will fail for Telegram. Use a supported format (e.g., request MP3/OGG from Speechmatics or transcode before calling send_voice).</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


# Write output file
try:
with open(output_path, 'wb') as f:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Create the TTS output file with restrictive permissions (0o600) to avoid leaking voice content to other local users.

(Based on your team's feedback about creating downloaded media files with restrictive permissions.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/speechmatics.py, line 123:

<comment>Create the TTS output file with restrictive permissions (0o600) to avoid leaking voice content to other local users.

(Based on your team's feedback about creating downloaded media files with restrictive permissions.) </comment>

<file context>
@@ -0,0 +1,324 @@
+
+    # Write output file
+    try:
+        with open(output_path, 'wb') as f:
+            f.write(data)
+    except OSError as e:
</file context>


try:
with urllib.request.urlopen(req, timeout=120) as resp:
data = resp.read()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Cap the TTS response read to a maximum size so a malformed/large response can’t consume unbounded memory.

(Based on your team's feedback about capping HTTP response reads for downloaded media.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/speechmatics.py, line 102:

<comment>Cap the TTS response read to a maximum size so a malformed/large response can’t consume unbounded memory.

(Based on your team's feedback about capping HTTP response reads for downloaded media.) </comment>

<file context>
@@ -0,0 +1,324 @@
+
+    try:
+        with urllib.request.urlopen(req, timeout=120) as resp:
+            data = resp.read()
+    except urllib.error.HTTPError as e:
+        error_detail = f"HTTP {e.code}"
</file context>

def _deliver_voice_response(response, config, client, msg, platform,
tmp_dir, tmp_files, bot_id):
"""Convert response to voice/audio and send, falling back to text."""
tts_ext = '.wav' if config.speech_provider == 'speechmatics' else '.mp3'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Telegram send_voice does not accept WAV files, so Speechmatics voice replies will fail for Telegram. Use a supported format (e.g., request MP3/OGG from Speechmatics or transcode before calling send_voice).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/handlers.py, line 831:

<comment>Telegram `send_voice` does not accept WAV files, so Speechmatics voice replies will fail for Telegram. Use a supported format (e.g., request MP3/OGG from Speechmatics or transcode before calling send_voice).</comment>

<file context>
@@ -787,15 +828,15 @@ def _typing_loop():
 def _deliver_voice_response(response, config, client, msg, platform,
                             tmp_dir, tmp_files, bot_id):
     """Convert response to voice/audio and send, falling back to text."""
+    tts_ext = '.wav' if config.speech_provider == 'speechmatics' else '.mp3'
     fd, tts_file = tempfile.mkstemp(
-        prefix='claudio-tts-', suffix='.mp3', dir=tmp_dir,
</file context>

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Speechmatics as an alternative speech provider, which is a great enhancement. The implementation is generally clean and well-structured. However, a high-severity prompt injection vulnerability was found in lib/handlers.py where the output of the new speech-to-text functionality is not sanitized before being used in a language model prompt. Additionally, two suggestions for improvement were identified in lib/speechmatics.py: the speech-to-text language is currently hardcoded to English and could be made configurable for better flexibility, and the polling logic for transcription jobs could be made more resilient to transient network errors, aligning with best practices for handling non-fatal exceptions.

config.elevenlabs_api_key,
model=config.elevenlabs_stt_model,
)
transcription = _stt_transcribe(voice_file, config)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The introduction of the new speech-to-text provider creates a prompt injection vector. The transcription variable, which holds the output from the _stt_transcribe function, should be treated as untrusted input. This variable is directly concatenated into the prompt text on line 662 and passed to the language model in run_claude without sanitization. A malicious user could craft an audio file that transcribes into malicious instructions, allowing them to manipulate the language model's behavior. This could lead to the model ignoring its system prompt, leaking sensitive data from the conversation history, or performing other unintended actions.

Suggested change
transcription = _stt_transcribe(voice_file, config)
transcription = sanitize_for_prompt(_stt_transcribe(voice_file, config))

return True


def stt_transcribe(audio_path, api_key, region='eu1', language='en'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The language parameter is hardcoded to 'en'. This limits the speech-to-text functionality to English only. The other speech provider, ElevenLabs, appears to support multilingual transcription, so this could be a feature regression for users switching to Speechmatics.

To better support multilingual users, this should be made configurable. I recommend the following changes:

  1. In lib/config.py, add a new configuration variable SPEECHMATICS_STT_LANGUAGE to ClaudioConfig and BotConfig (with a default of 'en' to maintain current behavior).
  2. In lib/handlers.py, update _stt_transcribe to pass this new config value to speechmatics_stt.
  3. Here in stt_transcribe, you could then remove the default value for the language parameter to ensure it's always explicitly provided by the caller.

This will make the Speechmatics integration more flexible and consistent with the multilingual capabilities of the project.

Comment on lines +275 to +281
try:
with urllib.request.urlopen(req, timeout=30) as resp:
data = json.loads(resp.read())
except (urllib.error.HTTPError, urllib.error.URLError, OSError,
json.JSONDecodeError, ValueError) as e:
log_error("stt", f"Error polling job {job_id}: {e}")
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current error handling in the polling loop is not resilient to transient network errors. Any URLError or OSError will cause the entire transcription to fail immediately. It would be more robust to handle these potentially transient errors by logging them and continuing to poll until the deadline. Permanent errors like HTTPError (especially 4xx client errors) or JSONDecodeError should still cause a failure.

Suggested change
try:
with urllib.request.urlopen(req, timeout=30) as resp:
data = json.loads(resp.read())
except (urllib.error.HTTPError, urllib.error.URLError, OSError,
json.JSONDecodeError, ValueError) as e:
log_error("stt", f"Error polling job {job_id}: {e}")
return False
try:
with urllib.request.urlopen(req, timeout=30) as resp:
data = json.loads(resp.read())
except (urllib.error.URLError, OSError) as e:
log("stt", f"Network error polling job {job_id}, will retry: {e}")
time.sleep(STT_POLL_INTERVAL)
continue
except (urllib.error.HTTPError, json.JSONDecodeError, ValueError) as e:
log_error("stt", f"API/parsing error polling job {job_id}: {e}")
return False
References
  1. In non-fatal contexts like hooks, catch broad exceptions but log them to stderr for debuggability instead of silently swallowing them.

@edgarjs edgarjs closed this Feb 12, 2026
@edgarjs edgarjs deleted the feature/speechmatics-provider branch February 12, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants