Skip to content

feat(voice): Azure Speech voice subsystem (dictation + hands-free, STT/TTS)#387

Open
patschmittdev wants to merge 2 commits into
refactor/ui-foundationfrom
feat/azure-speech-voice
Open

feat(voice): Azure Speech voice subsystem (dictation + hands-free, STT/TTS)#387
patschmittdev wants to merge 2 commits into
refactor/ui-foundationfrom
feat/azure-speech-voice

Conversation

@patschmittdev

Copy link
Copy Markdown
Collaborator

Summary

Adds an optional Azure Speech voice subsystem, behind a feature flag (off by default):

  • Microphone dictation into the composer (speech-to-text).
  • Hands-free conversation mode (STT in, TTS out).

This is a standalone cloud voice alternative to the local Foundry dictation in #385. They are different architectures (cloud Azure vs on-device); shipping this does not preclude later converging Azure STT under #385's TranscriptionProvider contract.

Security posture (please review)

  • The subscription key is stored only via the keytar CredentialStore port. AzureSpeechStore throws rather than falling back if no OS keychain is available.
  • The JSON config file holds non-secret metadata only: writeConfig runs stripKey, and coerce never emits apiKey. The key never reaches disk in plaintext.
  • The renderer never receives the key. It authenticates with short-lived (~9 min) tokens minted in the main process via mintToken / issueToken.
  • SSRF defense: the region is validated against ^[a-z0-9-]+$ before building the issueToken URL host.
  • The Electron session security boundary (sessionSecurity.ts) is extended to scope the Azure Speech endpoints and mic permission.
  • AzureSpeechStore.ts is registered in the credential-write security invariant allowlist (security-boundaries.invariant.test.ts) with the review documented inline, same boundary contract as ByoLlmStore.

Branch shape (off the current master tip, 0 behind)

  • bb8b482 refactor(ui): extract shared UI foundation off master
  • 6b40ed8 feat(voice): Azure Speech voice subsystem (STT/TTS) + security boundary
  • 080bfa3 test(security): register AzureSpeechStore in credential-write allowlist + changelog

54 files.

Test evidence

  • npm run lint: green (tsc + eslint + dependency-cruiser 537 modules / 0 violations + yaml + markdown).
  • Security invariant security-boundaries.invariant.test.ts: 12/12 pass after the reviewed allowlist registration.
  • Full npm test before the fix was 2078 pass / 2 fail, where the two failures were exactly (a) this credential-boundary invariant and (b) MindProfileService > rejects symlinked profile files. The invariant now passes; the only remaining failure is the symlink test, a Windows fs.symlinkSync EPERM (Developer Mode) limitation that is unrelated and green in CI on Linux.

Notes

  • Feature-flagged off by default.
  • No linked issue.

Patrick Schmitt and others added 2 commits June 7, 2026 10:54
Split the Azure Speech voice FEATURE out of feat/webgl-ambient-background onto
the ui-foundation base. Delivers the full voice subsystem with its trust
boundary intact:

Main / security:
- AzureSpeechStore: subscription key lives in the OS keychain (injected
  CredentialStore); only non-secret metadata persists to disk; region is
  regex-validated (SSRF guard); renderer gets short-lived minted tokens, never
  the key
- azureSpeech IPC adapter (get/save/disable/test/mintToken) gated on the flag
- sessionSecurity: connect-src allows the Azure Speech STT/TTS endpoints; the
  permission handler grants microphone only when the voice flag is on and
  always denies camera (video). Theme-hash CSP changes are intentionally NOT
  here (they belong to shell-theming)
- azureSpeech feature flag across feature-flags / devFeatureFlags / docs

Renderer:
- components/voice: VoiceModeController + VoiceModeOverlay
- hooks: useVoiceInput (dictation) + useVoiceConversation (hands-free)
- lib: azureSpeechRecognizer / azureSpeechSynthesizer / sentenceChunker
- settings: AzureSpeechSettingsSection

Folded-in audit fixes: microsoft-cognitiveservices-speech-sdk pinned EXACT
(1.50.0, was a caret range) since it is a security-relevant SDK.

DIVERGENCE (intentional, owned): this branch is the voice *capability layer*.
The integration points into ChatInput/ChatPanel (dictation + voice-mode
buttons) and SettingsView (the Voice section) are NOT wired here, because those
files are owned by feat/chat-polish and feat/settings-ia. Re-wiring them is the
explicit reconciliation step when voice merges with those branches. The voice
components/hooks are covered directly by their own tests.

Verification: tsc clean; lint clean (537 modules, 0 violations); 93 voice +
security tests pass, including AzureSpeechStore key-isolation / SSRF-guard /
token-mint and the sessionSecurity mic/camera + speech-CSP tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Azure Speech voice subsystem stores the subscription key only through the
keytar-backed CredentialStore port; its JSON config is key-stripped (stripKey
and coerce never persist apiKey) and the renderer authenticates with short-lived
issued tokens, never the key. Add AzureSpeechStore.ts to the security-boundary
invariant's approved credential-writer allowlist (same contract as ByoLlmStore),
with the review documented inline, and record the voice subsystem under the
Unreleased changelog.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant