Add AssemblyAI background STT with desktop cloud batch and speaker identity by Git-on-my-level · Pull Request #7446 · BasedHardware/omi

Git-on-my-level · 2026-05-21T16:39:13Z

Summary

This PR adds the AssemblyAI-backed prerecorded/background STT path and wires it through desktop Audio Recording as a gated cloud batch flow.

At a high level it adds:

a provider abstraction for prerecorded STT with AssemblyAI and Deepgram adapters, workload-based routing, retries/fallbacks where appropriate, BYOK-aware provider selection, and provider run bookkeeping
desktop cloud batch transcription for microphone Audio Recording via /v2/desktop/background-conversation/* and /v2/desktop/background-transcribe
local desktop chunking, overlap merge, backpressure, speech-activity gating, explicit language handling, and best-effort finalization/reconciliation
provider speaker-cluster metadata and canonical speaker identity assignment so AssemblyAI diarization can flow into Omi speaker identity without pretending low-confidence matches are known people
provider usage ledgers, daily rollups, cost estimates, metrics, and Deepgram-vs-AssemblyAI evaluation tooling
rollout docs, local E2E scripts, and expanded backend/desktop/app tests

Intended Behavior

Realtime listen, streaming PTT, and voice-message paths stay on Deepgram.
AssemblyAI is eligible only for prerecorded-style workloads: sync, background, and postprocess.
Desktop batch is microphone-only and server-gated; BLE/non-mic sources continue through /v4/listen.
Desktop batch sends ~15s overlapping chunks, drops silent/noise-only chunks before upload, and does not stop recording just because ASR is slower than realtime.
Desktop batch uses the selected explicit language instead of multi to avoid AssemblyAI language-detection failures on short or quiet chunks.
Provider-local speaker labels are preserved as provider metadata; canonical identity is assigned only when Omi speaker matching has enough evidence.
AssemblyAI background is fail-closed once selected; other eligible prerecorded workloads can still fall back to Deepgram as configured.

Rollout / Config

AssemblyAI remains off by default.

Important knobs:

ASSEMBLYAI_API_KEY
ASSEMBLYAI_BACKGROUND_STT_ENABLED
ASSEMBLYAI_BACKGROUND_STT_WORKLOADS (sync,background,postprocess by default)
ASSEMBLYAI_STT_MODEL
ASSEMBLYAI_BASE_URL
ASSEMBLYAI_POLL_INTERVAL_SECONDS
ASSEMBLYAI_MAX_POLL_SECONDS
ASSEMBLYAI_SMOKE_AUDIO_URL

Rollback is config-only: set ASSEMBLYAI_BACKGROUND_STT_ENABLED=false or remove a workload from ASSEMBLYAI_BACKGROUND_STT_WORKLOADS.

BYOK adds optional X-BYOK-AssemblyAI; existing Deepgram BYOK behavior is preserved when users have only a Deepgram key.

Notable Code Areas

Backend provider layer: backend/utils/stt/provider_service.py, backend/utils/stt/providers.py, backend/utils/stt/assemblyai_adapter.py, backend/utils/stt/deepgram_adapter.py
Desktop background API: backend/routers/desktop_background.py, backend/utils/conversations/desktop_background.py
Speaker identity and transcript normalization: backend/utils/stt/background_speaker_identity.py, backend/utils/stt/conversation_reconstructor.py, backend/models/transcript_segment.py
Provider observability/evaluation: backend/database/transcription_provider_usage.py, backend/utils/stt/provider_evaluation.py, backend/scripts/stt/provider_comparison_gate.py
Desktop client flow: desktop/Desktop/Sources/AppState.swift, desktop/Desktop/Sources/TranscriptionService.swift, desktop/Desktop/Sources/BackgroundTranscription/*
Desktop/BYOK settings: desktop/Desktop/Sources/MainWindow/Pages/SettingsPage.swift, desktop/Desktop/Sources/OnboardingBYOKStepView.swift
Docs and E2E: docs/doc/developer/backend/assemblyai_background_rollout.mdx, docs/doc/developer/backend/listen_pusher_pipeline.mdx, scripts/desktop_assemblyai_e2e.py

Validation

Automated checks run on this branch:

backend/venv/bin/python -m pytest tests/unit/test_assemblyai_adapter.py tests/unit/test_desktop_background_transcribe.py tests/unit/test_background_provider_service.py tests/unit/test_byok_assemblyai_routing.py tests/unit/test_rate_limiting.py::TestRouterPolicyMapping::test_all_router_policies_exist -v -> 41 passed, 2 skipped
xcrun swift test -c debug --package-path Desktop --filter BackgroundTranscription -> 15 passed
xcrun swift test -c debug --package-path Desktop --filter APIClientRoutingTests/testFinishBackgroundConversationRoutesToExplicitPythonConversation -> 1 passed
xcrun swift test -c debug --package-path Desktop --filter ListenProtocolTests -> 25 passed
.git/hooks/pre-commit -> Python formatting clean
git diff --check / git diff --cached --check -> clean

Live desktop evidence with Omi Dev (com.omi.desktop-dev) against local backend with AssemblyAI enabled:

conversation ba94a0a9-1af8-4d51-b98b-0a0f269bef65, local session 56
batch path used language=en
uploaded a 480000-byte chunk at 0ms, completed with provider=assemblyai, segments=1, run id 5849a05c-1c65-4072-915c-4e7959fad97a
no queue overflow, no Deepgram fallback, and no quiet-room empty-chunk storm after the speech window
session reconciled to the backend conversation and the conversation list refreshed/synced

Manual Test Checklist

Before broad rollout, verify:

backend canary with ASSEMBLYAI_BACKGROUND_STT_ENABLED=true, ASSEMBLYAI_API_KEY set, and ASSEMBLYAI_BACKGROUND_STT_WORKLOADS=sync,background,postprocess
scripts/desktop_assemblyai_e2e.py --background-batch --api http://127.0.0.1:8080 --language en produces non-empty AssemblyAI segments
Omi Dev Audio Recording with Batch transcription on produces ~15s chunks, live transcript growth, no overflow alert, and no Deepgram fallback storm
quiet-room session does not upload repeated empty chunks
stopping desktop recording creates/reconciles a conversation in the Conversations tab
/v4/listen, PTT, and voice-message flows still route to Deepgram
BYOK behavior with Deepgram-only key and optional AssemblyAI key

Caveats / Reviewer Notes

AssemblyAI and Deepgram can differ in segment boundaries, diarization, language handling, and timestamps; the provider comparison tooling is included to make those differences measurable before rollout expansion.
Low-confidence speaker identity is intentionally left unknown rather than mapped to a fake person or speaker_id=0.
Provider ledger payloads reject raw audio/transcript payloads by design; run records should carry metrics and artifact refs only.
In one local desktop run, the explicit finish POST returned HTTP 500 during post-processing, but the existing retry service reconciled the session to the backend conversation. The explicit finish route is unit-covered; this did not reproduce as an ASR/chunking failure.

Update desktop PTT tests to patch stt_provider_service after chat.py refactor, remove the unused duration parameter from postprocess_words, and skip self-voice review candidates when no identity assignment is available.

BYOK users can supply a fifth AssemblyAI key for sync/background/postprocess workloads; when Assembly routing is enabled but no Assembly key is present, Deepgram BYOK is used instead of Omi's server Assembly key. Co-authored-by: Cursor <cursoragent@cursor.com>

Wire desktop Audio Recording to POST /v2/desktop/background-transcribe via chunker/session queue, add backend endpoints and e2e script, and include an agent prompt for multi-chunk non-desktop E2E verification. Co-authored-by: Cursor <cursoragent@cursor.com>

Stabilize desktop cloud batch recording by using 15s cloud chunks, nonfatal backpressure, speech activity gating, explicit batch language, and resilient ASR queue draining. Add explicit desktop background conversation finish routing, AssemblyAI background fail-closed behavior, route/rate-limit coverage, and developer docs for the batch path. Validation: backend focused pytest suite 41 passed, 2 skipped; Swift BackgroundTranscription 15 passed; APIClient finish route 1 passed; ListenProtocol 25 passed; git diff --check clean; live Omi Dev session uploaded 15s AssemblyAI chunks, suppressed quiet-room chunks, and reconciled conversation ba94a0a9-1af8-4d51-b98b-0a0f269bef65.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1ebf27a4d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T16:46:55Z

+    if isStartingTranscription {
+      cloudBackgroundStartTask?.cancel()
+      cloudBackgroundStartTask = nil
+      isStartingTranscription = false
+      isCloudBackgroundTranscription = false
+      cloudBackgroundSession = nil
+      cloudBackgroundConversationId = nil
+      AssistantSettings.shared.transcriptionEnabled = false
+      return


Stop capture when canceling startup

When stopTranscription() is called during startup, this branch cancels cloudBackgroundStartTask and returns without stopping active capture or resetting isTranscribing. In startCloudBackgroundTranscription, isTranscribing is set to true before startup completes (around line 1627), so if the user taps stop during that window, the app can keep recording until a second stop (or until other cleanup happens), which is a privacy/UX regression for the cloud background path.

Useful? React with 👍 / 👎.

Run Omi speaker identity matching on AssemblyAI desktop background chunks before applying global chunk offsets, update provider run identity metrics, and cover the Omi user match path in desktop background transcription tests. Validation: backend focused pytest suite 42 passed, 2 skipped; speaker identity focused suite 23 passed; pre-commit Python formatting clean; git diff --check clean.

David Zhang and others added 26 commits May 21, 2026 07:53

Add provider-neutral transcript speaker metadata

45d9bba

Introduce Deepgram STT provider facade

487d7ba

Add conversation reconstructor for STT results

9bb1088

Add transcription provider usage ledger

790ad03

Route background transcription through provider service

1117eb8

Implement cluster-scoped speaker identity

79bfafc

Add client support for canonical speaker metadata

1770f34

Add AssemblyAI background STT provider

85f1c33

Add STT provider comparison gate

f9af9c6

Fix provider fallback metric direction

25c4a01

Add provider transcription cost estimates

e6e4e99

Tighten transcription provider retry metrics

373ff29

Add self voice review queue backend

77764e0

Document AssemblyAI background rollout readiness

b57ccd8

Address provider instrumentation review blockers

5db29cb

Hoist provider service imports

105502b

Apply CI Dart formatting

03b9f17

Match CI Dart formatter

4268628

Match CI Python formatter

225628b

Update AssemblyAI transcript API usage

ba1a5ab

Stabilize backend regression tests

165092d

Add AssemblyAI background batch E2E coverage

ca55053

Git-on-my-level changed the title ~~Fix desktop AssemblyAI background batch transcription~~ Add AssemblyAI background STT with desktop cloud batch and speaker identity May 21, 2026

Merge upstream main to resolve changelog conflict

5cbfc55

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Git-on-my-level and others added 17 commits May 21, 2026 13:15

Isolate desktop AssemblyAI e2e user

23cb244

Fix lint CI for Next 16

3bfe844

Finalize prerecorded STT provider policy

949ba39

Make desktop background batch resilient

be5a8f9

Make desktop background chunks idempotent

4efe0e6

Add AssemblyAI speaker identity diagnostics

83c53f6

Validate AssemblyAI background E2E

9cac76b

Fix AssemblyAI BYOK routing test isolation

908e483

Improve AssemblyAI speaker cluster handling

16870f3

Centralize AssemblyAI background provider policy

f9b931f

Verify silence-aware background chunking

2473a3e

Harden background speaker reconciliation

af35650

Add offline STT provider readiness gate

2ed8dee

Add AssemblyAI rollout observability

b09b55c

docs: add AssemblyAI background canary readiness plan

0a85375

Make AssemblyAI the background default policy

bc143cd

Fix STT provider cost assumptions

4ba2dcf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AssemblyAI background STT with desktop cloud batch and speaker identity#7446

Add AssemblyAI background STT with desktop cloud batch and speaker identity#7446
Git-on-my-level wants to merge 45 commits into
BasedHardware:mainfrom
Git-on-my-level:omi-groq

Git-on-my-level commented May 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Git-on-my-level commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Intended Behavior

Rollout / Config

Notable Code Areas

Validation

Manual Test Checklist

Caveats / Reviewer Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Git-on-my-level commented May 21, 2026 •

edited

Loading