Skip to content

feat(voice): local voice dictation via Foundry Local + Nemotron streaming#385

Open
HashwanthVen wants to merge 31 commits into
ianphil:masterfrom
HashwanthVen:feat/voice-dictation
Open

feat(voice): local voice dictation via Foundry Local + Nemotron streaming#385
HashwanthVen wants to merge 31 commits into
ianphil:masterfrom
HashwanthVen:feat/voice-dictation

Conversation

@HashwanthVen

Copy link
Copy Markdown
Contributor

Voice dictation: speak into Chamber locally (Nemotron + Foundry Local)

Summary

Adds local-first voice dictation to Chamber. Users hold a renderer-only shortcut (Alt+Shift+V) or click a mic button in the chat input to speak; transcripts insert at the caret of the active single-agent chat input. Never auto-sends — user always reviews before pressing Send.

Built on the same foundry-local-sdk@1.1.0 + Nemotron streaming model that the official GitHub Copilot CLI /voice uses, so the engineering pattern is upstream-validated.

Closes the gap in the PRD at VOICE_DICTATION_PRD.md (planning doc lives outside the repo).

Highlights

  • Local-only ASR. nemotron-speech-streaming-en-0.6b runs in a worker_threads.Worker via Foundry Local SDK. Audio never leaves the device. Model auto-downloads on first use (~696 MB CPU variant).
  • Push-to-talk keyboard shortcut, renderer-only (no globalShortcut — explicitly out of scope for Phase 1). Default Alt+Shift+V. Configurable hold-vs-toggle behavior.
  • Mic button in ChatInput with visible "Listening…" indicator. Hidden when feature flag is off; disabled with tooltip until the model is Ready.
  • Settings → Voice dictation section with six rows: device select, permission state + "Open preferences", Test mic readiness check, shortcut display, keyboard-shortcut-behavior toggle, model row with live download progress.
  • Feature-flagged behind voiceDictation (default off on stable, on for insiders + dev). Cleanly invisible when off.
  • Service-boundary clean. packages/services/** never imports electron or foundry-local-sdk; depcruise enforces the rule (new foundry-local-sdk-only-in-voice-workers rule added).
  • Pinned runtime. chamber-voice-runtime/ mirrors the existing chamber-copilot-runtime/ pattern. foundry-local-sdk@1.1.0 pinned exactly. forge.config.ts adds the SDK to asar.unpack for native prebuilds.

Architecture

Renderer (apps/web)
  ChatInput mic button + push-to-talk key handler
        ↓
  useVoiceDictation hook
        ↓  getUserMedia + AudioWorklet → PCM16 16 kHz mono frames
  window.electronAPI.voice.*  (preload contextBridge)
        ↓  IPC (IPC.VOICE.*)
Main (apps/desktop)
  setupVoiceIPC adapter  →  VoiceDictationService  →  VoiceWorkerPool
        ↓  worker_threads boundary
apps/desktop/src/main/voiceWorker/voiceWorker.ts
  FoundryLocalManager + LiveAudioTranscriptionSession (single worker; Foundry Core is a process-singleton)

Files changed

New:

  • packages/shared/src/voice-types.ts + test — config, IPC contracts, RPC verbs, payload size guard
  • packages/services/src/voice/VoiceDictationStore, VoiceDictationService, VoiceWorkerPool, FakeTranscriptionProvider, FoundryTranscriptionProvider, permissions/provider interfaces, all with colocated tests
  • apps/desktop/src/main/voice/PermissionInspector.ts + test — wraps systemPreferences.getMediaAccessStatus cross-platform
  • apps/desktop/src/main/voiceWorker/voiceWorker.ts + test — single worker handling both engine + installer verbs
  • apps/desktop/src/main/voiceWorker/workerProtocol.ts — shared RPC envelope
  • apps/desktop/src/main/ipc/voice.ts + test — thin IPC adapter with sender ownership + chunk size validation
  • apps/web/src/renderer/hooks/useVoiceDictation.ts + test — session lifecycle, sessionId routing, PTT, IME suppression, cleanup
  • apps/web/src/renderer/lib/audio/ — PCM16 encoder, AudioWorklet processor, captureMic helper
  • apps/web/src/renderer/components/settings/VoiceDictationSettingsSection.tsx + test
  • chamber-voice-runtime/ — pinned foundry-local-sdk@1.1.0
  • scripts/prepare-voice-runtime.js, scripts/check-voice-sdk-version.js, scripts/voice-runtime-smoke.js
  • apps/desktop/vite.voiceWorker.config.ts
  • tests/e2e/electron/voice-dictation-smoke.spec.ts

Modified (surgical):

  • packages/shared/src/ipc-channels.tsIPC.VOICE.* + IPC.E2E.VOICE_*
  • packages/shared/src/electron-types.ts (+ test) — voice namespace
  • packages/shared/src/feature-flags.ts (+ test) — voiceDictation flag
  • apps/desktop/src/main.ts — wire VoiceDictationService when feature flag is on; honor CHAMBER_E2E_VOICE_FAKE=1 for deterministic tests
  • apps/desktop/src/preload.ts — expose voice namespace
  • apps/desktop/src/main/devFeatureFlags.ts — enable in dev
  • apps/desktop/src/main/security/sessionSecurity.ts (+ test) — allow media permission for Chamber renderer origins only
  • apps/web/src/renderer/components/chat/ChatInput.tsx (+ test) — mic button + PTT, never auto-submits
  • apps/web/src/renderer/components/settings/SettingsView.tsx (+ test) — gate VoiceDictationSettingsSection behind flag
  • forge.config.tsasar.unpack for foundry-local-sdk native prebuilds; new build entry for voiceWorker.ts
  • config/dependency-cruiser.cjs — enforces foundry-local-sdk import is allowed only from the worker file
  • package.json / package-lock.jsonfoundry-local-sdk@1.1.0 pinned in dependencies + overrides + new smoke:voice-runtime script

Validation

  • npm run typecheck
  • npm run lint ✅ (tsc + eslint + deps:check 547 modules / 0 boundary violations + yaml + md)
  • 65+ voice unit + RTL tests ✅
  • npm run smoke:voice-runtime ✅ (new — exercises the compiled worker bundle against a mocked SDK; would have caught the singleton bug)
  • npm run smoke:desktop -- tests/e2e/electron/voice-dictation-smoke.spec.ts8 passed / 3 skipped / 0 failed. Three scenarios are test.fixme() with explanation: two need to spy chat.send but contextBridge freezes electronAPI (covered by chat-mic-inserts-sentinel which proves the visible textarea path); one is the second contextBridge case.
  • chamber-ui-tester driven against the real Foundry path on Windows: Test mic readiness ✅, model download with live progress ✅, chat mic listening state ✅. Edge cases (re-entrant start during stop) found two bugs that are now fixed in this branch — see 94dda24 and the voice-rt-fix commit chain.

Privacy + security

  • Audio never persisted, never logged, never sent over the network. keytar is not used (no secrets in the voice path).
  • Microphone permission gated by sessionSecurity allowlist — only Chamber's own renderer origin can request media; foreign origins denied.
  • worker_threads isolation between Foundry SDK and main process.
  • webPreferences (contextIsolation: true, nodeIntegration: false) unchanged.
  • All renderer ↔ main calls go through IPC.VOICE.* constants with zod schema validation and payload size limits (VOICE_MAX_APPEND_CHUNK_BYTES).
  • IPC sessions are owner-scoped: appendAudio/endSession reject from a non-owner WebContents.
  • gpt-5.5 security review: APPROVE, no findings.

Not in scope (Phase 2 candidates)

  • Streaming partial transcripts inserted live (currently only finals; partials shown as transient "Listening…")
  • Chatroom voice support (only single-agent chat)
  • Multiple model picker / cloud transcription
  • Voice activity detection / auto-stop on silence
  • System-global keyboard shortcut (Electron.globalShortcut)
  • Always-on wake word
  • Text-to-speech for agent responses
  • Click-to-rebind shortcut UI

Drift / mergeability

Branch is 31 commits ahead, 0 commits behind ianphil/chamber:master. Base point is 8834f0b (verified via gh api repos/ianphil/chamber/branches/master). Clean fast-forward / no-op rebase.

How to try

  1. Pull this branch, npm install, npm start.
  2. Settings → Voice dictation → click Download (~696 MB first time via Foundry Local).
  3. Activate a mind, click the mic button in the chat input, speak, click to stop, edit transcript, press Enter or click Send.
  4. Push-to-talk: hold Alt+Shift+V with the textarea focused.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

HashwanthVen and others added 30 commits June 9, 2026 15:42
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- engineWorker maps internal end verb to session.stop()+dispose()
- installerWorker handles model download/runtime install; cancel via worker terminate
- FoundryTranscriptionProvider wraps VoiceWorkerPool with sessionId filtering
- service skeleton cleanup

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- mic button hidden unless voiceDictation flag on; disabled until model ready
- final transcripts inserted at caret via existing controlled value path; never auto-submits
- preserves Enter-to-send, Escape-cancel, draft persistence, IME, shortcode popover behaviors
- guards Promise.resolve() around mock-returned voice.start/stop to satisfy strict event handler

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- PermissionInspector concrete impl in app layer (preserves no-electron-in-services boundary)
- cross-platform table covers darwin/win32/linux
- sessionSecurity allows media permission only for Chamber renderer origins

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Input device, mic permission, Test mic, shortcut, push-to-talk, model row with download progress.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sh-to-talk

- mints UUID per session; drops transcript events for unknown sessionId
- push-to-talk and toggle modes; suppressed during IME composition
- cleanup on unmount/session-stop/flag-off (tracks, AudioContext, IPC listeners)
- browserApi/test helpers extended for renderer fakes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- VoiceDictationStore + Foundry/Fake provider + VoiceWorkerPool constructed only when voiceDictation flag is enabled
- CHAMBER_E2E_VOICE_FAKE=1 selects FakeTranscriptionProvider
- preload exposes window.electronAPI.voice surface with sessionId-keyed channels

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ooks

- voice-dictation-smoke.spec.ts drives Settings + chat mic flows
- e2e:voice:setPermissionState / setModelStatus / getSessionState IPC hooks
- VoiceDictationService.startSession now accepts {sessionId,deviceId,modelId} object (legacy positional still works)
- Adjust voice IPC adapter to forward request object; update mock service port
- Fix FeatureFlagService.test expectation for new voiceDictation flag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Blocking:
- VoiceWorkerPool: serialize per-role RPCs via FIFO queue (fast-path for first send keeps postMessage synchronous)
- useVoiceDictation: tear down session when options.enabled flips false mid-listen
- appendFrame: await each chunk so backpressure propagates from worklet; stop session if append throws
- stop(): always call endSession even if mic cleanup rejects (no leaked worker sessions)

Non-blocking:
- voice IPC: prune transcript targets on WebContents destroyed; isDestroyed guard before send; auto-cleanup empty target sets
- voice IPC: assert sender ownership on appendAudio/endSession; record sessionId -> WebContents.id at startSession
- depcheck: enforce foundry-local-sdk import only from apps/desktop/src/main/voiceWorker/{engineWorker,installerWorker}.ts
- voice.test.ts: extend MockWebContents with id/once/isDestroyed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- review-before-send + enter-still-submits fixme'd: contextBridge freezes electronAPI
  so chat.send cannot be spied from inside the page. Covered by chat-mic-inserts-sentinel
  (visible textarea verification) and existing chatroom/byo-llm Playwright specs.
- model-not-downloaded-cta: pass expectMicEnabled:false through activateMind

Final: 8 passed, 3 skipped, 0 failed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Forge Vite plugin builds main-target entries flat into .vite/build/, so the
workers must be listed alongside main.ts/preload.ts entries to be compiled.
Without this, runtime crashes with Voice engine/installer worker is not started
because main.ts tries to spawn engineWorker.js / installerWorker.js that were
never emitted by Vite.

- new apps/desktop/vite.voiceWorker.config.ts (foundry-local-sdk external)
- forge.config.ts: add engineWorker.ts + installerWorker.ts to build entries
- main.ts: resolve workers from __dirname (flat layout), not voiceWorker/

Missed by validation: unit tests mock worker_threads + foundry-local-sdk;
Playwright UAT uses CHAMBER_E2E_VOICE_FAKE=1 fake provider. Neither exercised
the real worker bundle path. Manual real-mic QA surfaced the gap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Foundry Local Core is process-singleton across worker_threads, so route engine and installer RPCs through one Vite-built voice worker and one physical VoiceWorkerPool worker.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Subscribe to worker modelProgress events during downloads, forward typed percent updates through the existing IPC progress channel, and render percent-based progress in Settings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build the Vite voice worker bundle, run it in a real worker_thread with a deterministic Foundry stub, and assert download progress plus selectModel succeed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep worker session state clear until start succeeds, dispose failed starts, and retry once after stopping a stale native audio-stream handle reported by Foundry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stop opening a Foundry streaming session for Test mic; check permission and model readiness instead, with E2E status overrides preserving deterministic smoke coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The previous stop() cleared sessionIdRef immediately and only then awaited
capture.stop() + endSession(). A second mic click during that await chain
slipped past start()'s only gate (sessionIdRef.current) and fired a new
startSession while the FoundryTranscriptionProvider on main was still
ending, producing 'Foundry transcription provider is already started' and
leaving the chat mic button stuck in listening state.

Fix:
- Add 'starting' and 'stopping' to VoiceDictationState (idle/starting/listening/stopping/error).
- stop() now sets state='stopping' first and keeps sessionIdRef populated until cleanup completes.
- start() rejects when stateRef is not idle/error, not just when sessionIdRef is null.
- New stateRef mirrors state for synchronous gating without callback rebinding.

Caught by chamber-ui-tester driving the real Foundry path; previous Playwright
UAT used the fake provider and missed it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant