Add Speechmatics as alternative speech provider#78
Conversation
Adds configurable speech provider selection via SPEECH_PROVIDER env var (default: "elevenlabs"). When set to "speechmatics", voice transcription uses the batch STT API and TTS uses the streaming WAV endpoint. New files: - lib/speechmatics.py: TTS (WAV output) and STT (async batch job API) - tests/test_speechmatics.py: 33 tests covering TTS, STT, polling, validation Modified: - lib/config.py: SPEECH_PROVIDER, SPEECHMATICS_API_KEY, SPEECHMATICS_VOICE_ID, SPEECHMATICS_STT_REGION in BotConfig and ClaudioConfig - lib/handlers.py: provider dispatch via _stt_transcribe(), _tts_convert(), _get_speech_api_key() — selects provider based on config - tests/test_handlers.py: updated mocks for new dispatch functions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @claudio-pi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's voice processing capabilities by integrating Speechmatics as a new, configurable alternative to ElevenLabs for both text-to-speech and speech-to-text. This provides greater flexibility and choice for users, allowing them to leverage different speech technologies based on their preferences or requirements. The changes involve core logic updates, new API integrations, and comprehensive configuration management to ensure seamless operation. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
3 issues found across 6 files
Confidence score: 3/5
- Telegram voice replies may fail because
send_voicedoesn’t accept WAV files, so Speechmatics responses may not reach Telegram users until a supported format is used. - Unbounded TTS response reads in
lib/speechmatics.pycould allow excessive memory use on malformed/large responses, which is a moderate stability risk. - Score reflects a couple of medium‑severity, user‑impacting issues but no critical blockers reported.
- Pay close attention to
lib/handlers.pyandlib/speechmatics.py- media format compatibility and response/file handling safeguards.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="lib/speechmatics.py">
<violation number="1" location="lib/speechmatics.py:102">
P2: Cap the TTS response read to a maximum size so a malformed/large response can’t consume unbounded memory.
(Based on your team's feedback about capping HTTP response reads for downloaded media.) [FEEDBACK_USED]</violation>
<violation number="2" location="lib/speechmatics.py:123">
P2: Create the TTS output file with restrictive permissions (0o600) to avoid leaking voice content to other local users.
(Based on your team's feedback about creating downloaded media files with restrictive permissions.) [FEEDBACK_USED]</violation>
</file>
<file name="lib/handlers.py">
<violation number="1" location="lib/handlers.py:831">
P2: Telegram `send_voice` does not accept WAV files, so Speechmatics voice replies will fail for Telegram. Use a supported format (e.g., request MP3/OGG from Speechmatics or transcode before calling send_voice).</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| # Write output file | ||
| try: | ||
| with open(output_path, 'wb') as f: |
There was a problem hiding this comment.
P2: Create the TTS output file with restrictive permissions (0o600) to avoid leaking voice content to other local users.
(Based on your team's feedback about creating downloaded media files with restrictive permissions.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/speechmatics.py, line 123:
<comment>Create the TTS output file with restrictive permissions (0o600) to avoid leaking voice content to other local users.
(Based on your team's feedback about creating downloaded media files with restrictive permissions.) </comment>
<file context>
@@ -0,0 +1,324 @@
+
+ # Write output file
+ try:
+ with open(output_path, 'wb') as f:
+ f.write(data)
+ except OSError as e:
</file context>
|
|
||
| try: | ||
| with urllib.request.urlopen(req, timeout=120) as resp: | ||
| data = resp.read() |
There was a problem hiding this comment.
P2: Cap the TTS response read to a maximum size so a malformed/large response can’t consume unbounded memory.
(Based on your team's feedback about capping HTTP response reads for downloaded media.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/speechmatics.py, line 102:
<comment>Cap the TTS response read to a maximum size so a malformed/large response can’t consume unbounded memory.
(Based on your team's feedback about capping HTTP response reads for downloaded media.) </comment>
<file context>
@@ -0,0 +1,324 @@
+
+ try:
+ with urllib.request.urlopen(req, timeout=120) as resp:
+ data = resp.read()
+ except urllib.error.HTTPError as e:
+ error_detail = f"HTTP {e.code}"
</file context>
| def _deliver_voice_response(response, config, client, msg, platform, | ||
| tmp_dir, tmp_files, bot_id): | ||
| """Convert response to voice/audio and send, falling back to text.""" | ||
| tts_ext = '.wav' if config.speech_provider == 'speechmatics' else '.mp3' |
There was a problem hiding this comment.
P2: Telegram send_voice does not accept WAV files, so Speechmatics voice replies will fail for Telegram. Use a supported format (e.g., request MP3/OGG from Speechmatics or transcode before calling send_voice).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At lib/handlers.py, line 831:
<comment>Telegram `send_voice` does not accept WAV files, so Speechmatics voice replies will fail for Telegram. Use a supported format (e.g., request MP3/OGG from Speechmatics or transcode before calling send_voice).</comment>
<file context>
@@ -787,15 +828,15 @@ def _typing_loop():
def _deliver_voice_response(response, config, client, msg, platform,
tmp_dir, tmp_files, bot_id):
"""Convert response to voice/audio and send, falling back to text."""
+ tts_ext = '.wav' if config.speech_provider == 'speechmatics' else '.mp3'
fd, tts_file = tempfile.mkstemp(
- prefix='claudio-tts-', suffix='.mp3', dir=tmp_dir,
</file context>
There was a problem hiding this comment.
Code Review
This pull request adds Speechmatics as an alternative speech provider, which is a great enhancement. The implementation is generally clean and well-structured. However, a high-severity prompt injection vulnerability was found in lib/handlers.py where the output of the new speech-to-text functionality is not sanitized before being used in a language model prompt. Additionally, two suggestions for improvement were identified in lib/speechmatics.py: the speech-to-text language is currently hardcoded to English and could be made configurable for better flexibility, and the polling logic for transcription jobs could be made more resilient to transient network errors, aligning with best practices for handling non-fatal exceptions.
| config.elevenlabs_api_key, | ||
| model=config.elevenlabs_stt_model, | ||
| ) | ||
| transcription = _stt_transcribe(voice_file, config) |
There was a problem hiding this comment.
The introduction of the new speech-to-text provider creates a prompt injection vector. The transcription variable, which holds the output from the _stt_transcribe function, should be treated as untrusted input. This variable is directly concatenated into the prompt text on line 662 and passed to the language model in run_claude without sanitization. A malicious user could craft an audio file that transcribes into malicious instructions, allowing them to manipulate the language model's behavior. This could lead to the model ignoring its system prompt, leaking sensitive data from the conversation history, or performing other unintended actions.
| transcription = _stt_transcribe(voice_file, config) | |
| transcription = sanitize_for_prompt(_stt_transcribe(voice_file, config)) |
| return True | ||
|
|
||
|
|
||
| def stt_transcribe(audio_path, api_key, region='eu1', language='en'): |
There was a problem hiding this comment.
The language parameter is hardcoded to 'en'. This limits the speech-to-text functionality to English only. The other speech provider, ElevenLabs, appears to support multilingual transcription, so this could be a feature regression for users switching to Speechmatics.
To better support multilingual users, this should be made configurable. I recommend the following changes:
- In
lib/config.py, add a new configuration variableSPEECHMATICS_STT_LANGUAGEtoClaudioConfigandBotConfig(with a default of'en'to maintain current behavior). - In
lib/handlers.py, update_stt_transcribeto pass this new config value tospeechmatics_stt. - Here in
stt_transcribe, you could then remove the default value for thelanguageparameter to ensure it's always explicitly provided by the caller.
This will make the Speechmatics integration more flexible and consistent with the multilingual capabilities of the project.
| try: | ||
| with urllib.request.urlopen(req, timeout=30) as resp: | ||
| data = json.loads(resp.read()) | ||
| except (urllib.error.HTTPError, urllib.error.URLError, OSError, | ||
| json.JSONDecodeError, ValueError) as e: | ||
| log_error("stt", f"Error polling job {job_id}: {e}") | ||
| return False |
There was a problem hiding this comment.
The current error handling in the polling loop is not resilient to transient network errors. Any URLError or OSError will cause the entire transcription to fail immediately. It would be more robust to handle these potentially transient errors by logging them and continuing to poll until the deadline. Permanent errors like HTTPError (especially 4xx client errors) or JSONDecodeError should still cause a failure.
| try: | |
| with urllib.request.urlopen(req, timeout=30) as resp: | |
| data = json.loads(resp.read()) | |
| except (urllib.error.HTTPError, urllib.error.URLError, OSError, | |
| json.JSONDecodeError, ValueError) as e: | |
| log_error("stt", f"Error polling job {job_id}: {e}") | |
| return False | |
| try: | |
| with urllib.request.urlopen(req, timeout=30) as resp: | |
| data = json.loads(resp.read()) | |
| except (urllib.error.URLError, OSError) as e: | |
| log("stt", f"Network error polling job {job_id}, will retry: {e}") | |
| time.sleep(STT_POLL_INTERVAL) | |
| continue | |
| except (urllib.error.HTTPError, json.JSONDecodeError, ValueError) as e: | |
| log_error("stt", f"API/parsing error polling job {job_id}: {e}") | |
| return False |
References
- In non-fatal contexts like hooks, catch broad exceptions but log them to stderr for debuggability instead of silently swallowing them.
Summary
SPEECH_PROVIDERconfig option inservice.env(default:elevenlabs, alternative:speechmatics)preview.tts.speechmatics.com) with WAV outputasr.api.speechmatics.com/v2) with async job submission, polling, and plain-text transcript retrievalhandlers.pyselects the appropriate TTS/STT functions based on configNew config variables (in
service.env)SPEECH_PROVIDERelevenlabselevenlabsorspeechmaticsSPEECHMATICS_API_KEYSPEECHMATICS_VOICE_IDsarahsarah,theo,megan,jackSPEECHMATICS_STT_REGIONeu1eu1,us1,au1Files changed
lib/speechmatics.py— TTS and STT module (stdlib only)tests/test_speechmatics.py— 33 testslib/config.py— Added Speechmatics config fieldslib/handlers.py— Provider dispatch functionstests/test_handlers.py— Updated mocks for dispatch layerCLAUDE.md— Documentation updatesTest plan
SPEECH_PROVIDER=speechmaticsin production🤖 Generated with Claude Code