Problem
Mitzo is mobile-first but interaction is still text-only. When walking, driving, or doing anything with your hands occupied, you can't use it. The whole point of a mobile command center is that it's accessible anywhere — forcing keyboard input defeats that.
Proposal
Add a voice mode that enables hands-free interaction:
Input (speech-to-text)
- Tap-to-talk button or push-to-talk gesture in the chat input area
- Browser Web Speech API (
SpeechRecognition) for on-device STT — no server roundtrip for transcription
- Interim results shown as the user speaks, final transcript sent as the prompt
- Fallback: if Web Speech API unavailable (some browsers), show a clear "not supported" message
Output (text-to-speech)
- Auto-read agent responses aloud when voice mode is active
- Browser
SpeechSynthesis API for on-device TTS
- Stop/skip button to interrupt readback
- Only read text blocks — skip tool calls, thinking blocks, and code blocks (or summarize them: "Running 3 tools...")
UX considerations
- Voice mode toggle in the chat header (next to model/mode selectors)
- Visual indicator when listening (pulsing microphone icon)
- Works alongside text input — not a replacement, an addition
- Permission prompt for microphone access on first use
- Should work in iOS Safari PWA mode (the primary deployment target)
Non-goals (for now)
- Real-time streaming STT (interim results from Web Speech API are sufficient)
- Custom wake word / always-listening
- Server-side STT/TTS (keep it client-side to avoid latency and API costs)
- Voice-to-voice (Anthropic real-time audio API) — revisit when available
Technical notes
- Web Speech API is available in Safari (iOS 14.5+), Chrome, Edge. Not Firefox.
SpeechRecognition needs HTTPS or localhost — Mitzo runs HTTP over Tailscale, which may require testing. iOS Safari PWA may have different permissions behavior.
- No new dependencies needed — both APIs are browser-native.
Priority
Medium-high. This is a UX multiplier for the mobile-first use case.
Problem
Mitzo is mobile-first but interaction is still text-only. When walking, driving, or doing anything with your hands occupied, you can't use it. The whole point of a mobile command center is that it's accessible anywhere — forcing keyboard input defeats that.
Proposal
Add a voice mode that enables hands-free interaction:
Input (speech-to-text)
SpeechRecognition) for on-device STT — no server roundtrip for transcriptionOutput (text-to-speech)
SpeechSynthesisAPI for on-device TTSUX considerations
Non-goals (for now)
Technical notes
SpeechRecognitionneeds HTTPS or localhost — Mitzo runs HTTP over Tailscale, which may require testing. iOS Safari PWA may have different permissions behavior.Priority
Medium-high. This is a UX multiplier for the mobile-first use case.