Skip to content

feat: voice mode for hands-free interaction #60

@dimakis

Description

@dimakis

Problem

Mitzo is mobile-first but interaction is still text-only. When walking, driving, or doing anything with your hands occupied, you can't use it. The whole point of a mobile command center is that it's accessible anywhere — forcing keyboard input defeats that.

Proposal

Add a voice mode that enables hands-free interaction:

Input (speech-to-text)

  • Tap-to-talk button or push-to-talk gesture in the chat input area
  • Browser Web Speech API (SpeechRecognition) for on-device STT — no server roundtrip for transcription
  • Interim results shown as the user speaks, final transcript sent as the prompt
  • Fallback: if Web Speech API unavailable (some browsers), show a clear "not supported" message

Output (text-to-speech)

  • Auto-read agent responses aloud when voice mode is active
  • Browser SpeechSynthesis API for on-device TTS
  • Stop/skip button to interrupt readback
  • Only read text blocks — skip tool calls, thinking blocks, and code blocks (or summarize them: "Running 3 tools...")

UX considerations

  • Voice mode toggle in the chat header (next to model/mode selectors)
  • Visual indicator when listening (pulsing microphone icon)
  • Works alongside text input — not a replacement, an addition
  • Permission prompt for microphone access on first use
  • Should work in iOS Safari PWA mode (the primary deployment target)

Non-goals (for now)

  • Real-time streaming STT (interim results from Web Speech API are sufficient)
  • Custom wake word / always-listening
  • Server-side STT/TTS (keep it client-side to avoid latency and API costs)
  • Voice-to-voice (Anthropic real-time audio API) — revisit when available

Technical notes

  • Web Speech API is available in Safari (iOS 14.5+), Chrome, Edge. Not Firefox.
  • SpeechRecognition needs HTTPS or localhost — Mitzo runs HTTP over Tailscale, which may require testing. iOS Safari PWA may have different permissions behavior.
  • No new dependencies needed — both APIs are browser-native.

Priority

Medium-high. This is a UX multiplier for the mobile-first use case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions