Skip to content

Add AssemblyAI background STT with desktop cloud batch and speaker identity#7446

Open
Git-on-my-level wants to merge 45 commits into
BasedHardware:mainfrom
Git-on-my-level:omi-groq
Open

Add AssemblyAI background STT with desktop cloud batch and speaker identity#7446
Git-on-my-level wants to merge 45 commits into
BasedHardware:mainfrom
Git-on-my-level:omi-groq

Conversation

@Git-on-my-level
Copy link
Copy Markdown

@Git-on-my-level Git-on-my-level commented May 21, 2026

Summary

This PR adds the AssemblyAI-backed prerecorded/background STT path and wires it through desktop Audio Recording as a gated cloud batch flow.

At a high level it adds:

  • a provider abstraction for prerecorded STT with AssemblyAI and Deepgram adapters, workload-based routing, retries/fallbacks where appropriate, BYOK-aware provider selection, and provider run bookkeeping
  • desktop cloud batch transcription for microphone Audio Recording via /v2/desktop/background-conversation/* and /v2/desktop/background-transcribe
  • local desktop chunking, overlap merge, backpressure, speech-activity gating, explicit language handling, and best-effort finalization/reconciliation
  • provider speaker-cluster metadata and canonical speaker identity assignment so AssemblyAI diarization can flow into Omi speaker identity without pretending low-confidence matches are known people
  • provider usage ledgers, daily rollups, cost estimates, metrics, and Deepgram-vs-AssemblyAI evaluation tooling
  • rollout docs, local E2E scripts, and expanded backend/desktop/app tests

Intended Behavior

  • Realtime listen, streaming PTT, and voice-message paths stay on Deepgram.
  • AssemblyAI is eligible only for prerecorded-style workloads: sync, background, and postprocess.
  • Desktop batch is microphone-only and server-gated; BLE/non-mic sources continue through /v4/listen.
  • Desktop batch sends ~15s overlapping chunks, drops silent/noise-only chunks before upload, and does not stop recording just because ASR is slower than realtime.
  • Desktop batch uses the selected explicit language instead of multi to avoid AssemblyAI language-detection failures on short or quiet chunks.
  • Provider-local speaker labels are preserved as provider metadata; canonical identity is assigned only when Omi speaker matching has enough evidence.
  • AssemblyAI background is fail-closed once selected; other eligible prerecorded workloads can still fall back to Deepgram as configured.

Rollout / Config

AssemblyAI remains off by default.

Important knobs:

  • ASSEMBLYAI_API_KEY
  • ASSEMBLYAI_BACKGROUND_STT_ENABLED
  • ASSEMBLYAI_BACKGROUND_STT_WORKLOADS (sync,background,postprocess by default)
  • ASSEMBLYAI_STT_MODEL
  • ASSEMBLYAI_BASE_URL
  • ASSEMBLYAI_POLL_INTERVAL_SECONDS
  • ASSEMBLYAI_MAX_POLL_SECONDS
  • ASSEMBLYAI_SMOKE_AUDIO_URL

Rollback is config-only: set ASSEMBLYAI_BACKGROUND_STT_ENABLED=false or remove a workload from ASSEMBLYAI_BACKGROUND_STT_WORKLOADS.

BYOK adds optional X-BYOK-AssemblyAI; existing Deepgram BYOK behavior is preserved when users have only a Deepgram key.

Notable Code Areas

  • Backend provider layer: backend/utils/stt/provider_service.py, backend/utils/stt/providers.py, backend/utils/stt/assemblyai_adapter.py, backend/utils/stt/deepgram_adapter.py
  • Desktop background API: backend/routers/desktop_background.py, backend/utils/conversations/desktop_background.py
  • Speaker identity and transcript normalization: backend/utils/stt/background_speaker_identity.py, backend/utils/stt/conversation_reconstructor.py, backend/models/transcript_segment.py
  • Provider observability/evaluation: backend/database/transcription_provider_usage.py, backend/utils/stt/provider_evaluation.py, backend/scripts/stt/provider_comparison_gate.py
  • Desktop client flow: desktop/Desktop/Sources/AppState.swift, desktop/Desktop/Sources/TranscriptionService.swift, desktop/Desktop/Sources/BackgroundTranscription/*
  • Desktop/BYOK settings: desktop/Desktop/Sources/MainWindow/Pages/SettingsPage.swift, desktop/Desktop/Sources/OnboardingBYOKStepView.swift
  • Docs and E2E: docs/doc/developer/backend/assemblyai_background_rollout.mdx, docs/doc/developer/backend/listen_pusher_pipeline.mdx, scripts/desktop_assemblyai_e2e.py

Validation

Automated checks run on this branch:

  • backend/venv/bin/python -m pytest tests/unit/test_assemblyai_adapter.py tests/unit/test_desktop_background_transcribe.py tests/unit/test_background_provider_service.py tests/unit/test_byok_assemblyai_routing.py tests/unit/test_rate_limiting.py::TestRouterPolicyMapping::test_all_router_policies_exist -v -> 41 passed, 2 skipped
  • xcrun swift test -c debug --package-path Desktop --filter BackgroundTranscription -> 15 passed
  • xcrun swift test -c debug --package-path Desktop --filter APIClientRoutingTests/testFinishBackgroundConversationRoutesToExplicitPythonConversation -> 1 passed
  • xcrun swift test -c debug --package-path Desktop --filter ListenProtocolTests -> 25 passed
  • .git/hooks/pre-commit -> Python formatting clean
  • git diff --check / git diff --cached --check -> clean

Live desktop evidence with Omi Dev (com.omi.desktop-dev) against local backend with AssemblyAI enabled:

  • conversation ba94a0a9-1af8-4d51-b98b-0a0f269bef65, local session 56
  • batch path used language=en
  • uploaded a 480000-byte chunk at 0ms, completed with provider=assemblyai, segments=1, run id 5849a05c-1c65-4072-915c-4e7959fad97a
  • no queue overflow, no Deepgram fallback, and no quiet-room empty-chunk storm after the speech window
  • session reconciled to the backend conversation and the conversation list refreshed/synced

Manual Test Checklist

Before broad rollout, verify:

  • backend canary with ASSEMBLYAI_BACKGROUND_STT_ENABLED=true, ASSEMBLYAI_API_KEY set, and ASSEMBLYAI_BACKGROUND_STT_WORKLOADS=sync,background,postprocess
  • scripts/desktop_assemblyai_e2e.py --background-batch --api http://127.0.0.1:8080 --language en produces non-empty AssemblyAI segments
  • Omi Dev Audio Recording with Batch transcription on produces ~15s chunks, live transcript growth, no overflow alert, and no Deepgram fallback storm
  • quiet-room session does not upload repeated empty chunks
  • stopping desktop recording creates/reconciles a conversation in the Conversations tab
  • /v4/listen, PTT, and voice-message flows still route to Deepgram
  • BYOK behavior with Deepgram-only key and optional AssemblyAI key

Caveats / Reviewer Notes

  • AssemblyAI and Deepgram can differ in segment boundaries, diarization, language handling, and timestamps; the provider comparison tooling is included to make those differences measurable before rollout expansion.
  • Low-confidence speaker identity is intentionally left unknown rather than mapped to a fake person or speaker_id=0.
  • Provider ledger payloads reject raw audio/transcript payloads by design; run records should carry metrics and artifact refs only.
  • In one local desktop run, the explicit finish POST returned HTTP 500 during post-processing, but the existing retry service reconciled the session to the backend conversation. The explicit finish route is unit-covered; this did not reproduce as an ASR/chunking failure.

David Zhang and others added 26 commits May 21, 2026 07:53
Update desktop PTT tests to patch stt_provider_service after chat.py refactor,
remove the unused duration parameter from postprocess_words, and skip self-voice
review candidates when no identity assignment is available.
BYOK users can supply a fifth AssemblyAI key for sync/background/postprocess
workloads; when Assembly routing is enabled but no Assembly key is present,
Deepgram BYOK is used instead of Omi's server Assembly key.

Co-authored-by: Cursor <cursoragent@cursor.com>
Wire desktop Audio Recording to POST /v2/desktop/background-transcribe via chunker/session queue, add backend endpoints and e2e script, and include an agent prompt for multi-chunk non-desktop E2E verification.

Co-authored-by: Cursor <cursoragent@cursor.com>
Stabilize desktop cloud batch recording by using 15s cloud chunks, nonfatal backpressure, speech activity gating, explicit batch language, and resilient ASR queue draining.

Add explicit desktop background conversation finish routing, AssemblyAI background fail-closed behavior, route/rate-limit coverage, and developer docs for the batch path.

Validation: backend focused pytest suite 41 passed, 2 skipped; Swift BackgroundTranscription 15 passed; APIClient finish route 1 passed; ListenProtocol 25 passed; git diff --check clean; live Omi Dev session uploaded 15s AssemblyAI chunks, suppressed quiet-room chunks, and reconciled conversation ba94a0a9-1af8-4d51-b98b-0a0f269bef65.
@Git-on-my-level Git-on-my-level changed the title Fix desktop AssemblyAI background batch transcription Add AssemblyAI background STT with desktop cloud batch and speaker identity May 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1ebf27a4d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1917 to +1925
if isStartingTranscription {
cloudBackgroundStartTask?.cancel()
cloudBackgroundStartTask = nil
isStartingTranscription = false
isCloudBackgroundTranscription = false
cloudBackgroundSession = nil
cloudBackgroundConversationId = nil
AssistantSettings.shared.transcriptionEnabled = false
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop capture when canceling startup

When stopTranscription() is called during startup, this branch cancels cloudBackgroundStartTask and returns without stopping active capture or resetting isTranscribing. In startCloudBackgroundTranscription, isTranscribing is set to true before startup completes (around line 1627), so if the user taps stop during that window, the app can keep recording until a second stop (or until other cleanup happens), which is a privacy/UX regression for the cloud background path.

Useful? React with 👍 / 👎.

Run Omi speaker identity matching on AssemblyAI desktop background chunks before applying global chunk offsets, update provider run identity metrics, and cover the Omi user match path in desktop background transcription tests.

Validation: backend focused pytest suite 42 passed, 2 skipped; speaker identity focused suite 23 passed; pre-commit Python formatting clean; git diff --check clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants