diff --git a/.gitignore b/.gitignore index f0e2ee37..bed69243 100644 --- a/.gitignore +++ b/.gitignore @@ -13,6 +13,10 @@ build/ *.pte *.ptd +# DMG files (should be distributed via GitHub Releases) +*.dmg +*.dmg.* + # Xcode xcuserdata/ .build/ @@ -29,3 +33,7 @@ local.properties .externalNativeBuild .cxx *.aar + +node_modules/ +__pycache__/ +.cursor/ diff --git a/LICENSE b/LICENSE index d79f406a..5651f756 100644 --- a/LICENSE +++ b/LICENSE @@ -2,7 +2,7 @@ BSD License For "ExecuTorch" software -Copyright (c) Meta Platforms, Inc. and affiliates. +Copyright (c) Meta Platforms, Inc. and affiliates. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: diff --git a/apps/macos/VoxtralRealtimeApp/README.md b/apps/macos/VoxtralRealtimeApp/README.md new file mode 100644 index 00000000..ded737eb --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/README.md @@ -0,0 +1,269 @@ +# Voxtral Realtime + +A native macOS showcase app for [Voxtral-Mini-4B-Realtime](https://huggingface.co/mistral-labs/Voxtral-Mini-4B-Realtime-2602-ExecuTorch) — Mistral's on-device real-time speech transcription model. All inference runs locally on Apple Silicon via [ExecuTorch](https://github.com/pytorch/executorch) with Metal acceleration. No cloud, no network required. + +https://github.com/user-attachments/assets/0ce4fb05-76b2-47e4-85e2-6f49b4e9e34e + +## Features + +- **Live transcription** — real-time token streaming with audio waveform visualization +- **System-wide dictation** — press `Ctrl+Space` in any app to transcribe speech and auto-paste the result +- **Model preloading** — load the model once, transcribe instantly across sessions +- **Pause / resume** — pause and resume within the same session without losing context +- **Session history** — searchable history with rename, copy, and delete +- **Silence detection** — dictation auto-stops after 2 seconds of silence +- **Self-contained DMG** — runner binary, model weights, and runtime libraries all bundled + +## Download + +**End users**: download the latest DMG from [GitHub Releases](https://github.com/pytorch/executorch-examples/releases). The DMG is self-contained — runner binary, model weights (~6.2 GB), and runtime libraries are all bundled. No terminal, no Python, no model downloads required. + +### Requirements + +- macOS 14.0+ (Sonoma) +- Apple Silicon (M1/M2/M3/M4) +- ~7 GB disk space + +## Usage + +### In-app transcription + +1. Click **Load Model** on the welcome screen (takes ~30s on first load) +2. Click **Start Transcription** (or `Cmd+Shift+R`) +3. Speak — text appears in real time +4. **Pause** (`Cmd+.`) / **Resume** (`Cmd+Shift+R`) within the same session +5. **Done** (`Cmd+Return`) to save the session to history + +### System-wide dictation + +1. Make sure the model is loaded +2. Focus any text field in any app (Notes, Slack, browser, etc.) +3. Press **`Ctrl+Space`** — a floating overlay appears with a waveform +4. Speak — live transcribed text appears in the overlay +5. Press **`Ctrl+Space`** again to stop, or wait for 2 seconds of silence +6. The transcribed text is automatically pasted into the focused text field + +### Keyboard shortcuts + +| Shortcut | Action | +|---|---| +| `Cmd+Shift+R` | Start / Resume transcription | +| `Cmd+.` | Pause transcription | +| `Cmd+Return` | End session and save | +| `Cmd+Shift+C` | Copy transcript | +| `Cmd+Shift+U` | Unload model | +| `Ctrl+Space` | Toggle system-wide dictation | +| `Cmd+,` | Settings | + +--- + +## Build from Source + +Model files (~6.2 GB) are not included in the git repo. The entire build chain — ExecuTorch installation, runner compilation, model downloading — runs inside a conda environment with Metal (MPS) backend support. The build script bundles everything into the `.app` so the resulting DMG is self-contained. + +### Quick build + +If you already have the conda env set up, ExecuTorch built, and models downloaded: + +```bash +conda create -n et-metal python=3.12 -y +conda activate et-metal +cd apps/macos/VoxtralRealtimeApp +./scripts/build.sh +``` + +Or to download models and build in one step: + +```bash +conda create -n et-metal python=3.12 -y +conda activate et-metal +./scripts/build.sh --download-models +``` + +The script checks that you're in a conda env, validates all prerequisites, builds the app, and creates a DMG with all models bundled. Run `./scripts/build.sh --help` for all options. + +### Full setup (from scratch) + +#### Prerequisites + +- macOS 14.0+ (Sonoma) +- Apple Silicon (M1/M2/M3/M4) +- Xcode 16+ +- [Conda](https://docs.conda.io/en/latest/miniconda.html) (Miniconda or Anaconda) + +```bash +brew install xcodegen libomp +``` + +#### 1. Create and activate the conda environment + +All subsequent steps must run inside this environment. The conda env isolates the Python packages and C++ build artifacts that ExecuTorch needs. + +```bash +conda create -n et-metal python=3.10 -y +conda activate et-metal +``` + +> You must run `conda activate et-metal` in every new terminal session before building or running the runner. + +#### 2. Install ExecuTorch with Metal backend + +```bash +export EXECUTORCH_PATH="$HOME/executorch" +git clone https://github.com/pytorch/executorch/ ${EXECUTORCH_PATH} +cd ${EXECUTORCH_PATH} +EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh +``` + +This installs ExecuTorch with the Metal (MPS) backend enabled, which is required for Apple Silicon GPU acceleration. If you run into installation problems, see the official [Voxtral Realtime installation guide](https://github.com/pytorch/executorch/tree/main/examples/models/voxtral_realtime). + +#### 3. Build the voxtral realtime runner + +```bash +cd ${EXECUTORCH_PATH} +make voxtral_realtime-metal +``` + +The runner binary will be at: +``` +${EXECUTORCH_PATH}/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner +``` + +#### 4. Install Python packages + +```bash +pip install huggingface_hub sounddevice +``` + +- `huggingface_hub` — to download model artifacts from HuggingFace +- `sounddevice` — for the CLI mic streaming test script + +#### 5. Download model artifacts + +```bash +export LOCAL_FOLDER="$HOME/voxtral_realtime_quant_metal" +hf download mistralai/Voxtral-Mini-4B-Realtime-2602-Executorch --local-dir ${LOCAL_FOLDER} +``` + +This downloads three files (~6.2 GB total): +- `model-metal-int4.pte` — 4-bit quantized model (Metal) +- `preprocessor.pte` — audio-to-mel spectrogram +- `tekken.json` — tokenizer + +HuggingFace repo: [`mistralai/Voxtral-Mini-4B-Realtime-2602-ExecuTorch`](https://huggingface.co/mistral-labs/Voxtral-Mini-4B-Realtime-2602-ExecuTorch) + +#### 6. Test with CLI (optional) + +Verify the runner works before building the app: + +```bash +export DYLD_LIBRARY_PATH=/usr/lib:$(brew --prefix libomp)/lib +export CMAKE_RUNNER="${EXECUTORCH_PATH}/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner" +export LOCAL_FOLDER="$HOME/voxtral_realtime_quant_metal" + +cd ${LOCAL_FOLDER} && chmod +x stream_audio.py +./stream_audio.py | \ + ${CMAKE_RUNNER} \ + --model_path ./model-metal-int4.pte \ + --tokenizer_path ./tekken.json \ + --preprocessor_path ./preprocessor.pte \ + --mic +``` + +#### 7. Build the app and create DMG + +```bash +cd apps/macos/VoxtralRealtimeApp +./scripts/build.sh +``` + +Or build manually: + +```bash +xcodegen generate +xcodebuild -project VoxtralRealtime.xcodeproj -scheme VoxtralRealtime -configuration Release -derivedDataPath build build +./scripts/create_dmg.sh "./build/Build/Products/Release/Voxtral Realtime.app" "./VoxtralRealtime.dmg" +``` + +The post-compile build script automatically bundles the runner binary, `libomp.dylib`, and all model files into `.app/Contents/Resources/`. The `create_dmg.sh` script validates that all required files are present before creating the DMG. + +The resulting DMG is self-contained — end users just drag to Applications and run. + +--- + +## Architecture + +``` +VoxtralRealtimeApp +├── TranscriptStore (@Observable, @MainActor) +│ ├── SessionState: idle → loading → transcribing ⇆ paused → idle +│ ├── ModelState: unloaded → loading → ready +│ └── RunnerBridge (actor) +│ ├── Process (voxtral_realtime_runner) +│ │ ├── stdin ← raw 16kHz mono f32le PCM +│ │ ├── stdout → transcript tokens +│ │ └── stderr → status messages +│ └── AudioEngine (actor) +│ └── AVAudioEngine → format conversion → pipe +├── DictationManager (@Observable, @MainActor) +│ ├── Global hotkey (Carbon RegisterEventHotKey) +│ ├── DictationPanel (NSPanel, non-activating, floating) +│ └── Paste via CGEvent (Cmd+V to frontmost app) +└── Views (SwiftUI) +``` + +## Troubleshooting + +### Auto-paste doesn't work in dictation mode + +The app needs **Accessibility** permission to simulate `Cmd+V` in other apps. + +1. Open **System Settings → Privacy & Security → Accessibility** +2. Click the **+** button and add `Voxtral Realtime.app` +3. If already listed, toggle it **off and back on** +4. **Quit and relaunch** the app — macOS caches the trust state at process launch + +When running Debug builds from Xcode, each rebuild produces a new binary signature. macOS tracks Accessibility trust per binary identity, so you may need to re-grant permission after rebuilding. To avoid this: +- Remove the old entry from Accessibility settings before re-adding +- Or run the Release build for testing dictation + +Even if Accessibility isn't granted, the transcribed text is always copied to the clipboard — you can paste manually with `Cmd+V`. + +### Model fails to load / runner crashes + +Check Console.app (filter by `VoxtralRealtime`) for diagnostics: + +- `"Runner stderr: ..."` — shows the runner's internal loading messages +- `"Runner exited with code N"` — non-zero exit indicates a crash + +Common causes: +- **Missing model files** — verify `~/voxtral_realtime_quant_metal/` contains all three files (or check app bundle Resources) +- **libomp not found** — run `brew install libomp` and rebuild +- **Runner not built** — run `make voxtral_realtime-metal` in the ExecuTorch directory + +### Microphone permission denied + +The app requests microphone access on first use. If denied: + +1. Open **System Settings → Privacy & Security → Microphone** +2. Enable `Voxtral Realtime` +3. **Quit and relaunch** the app — macOS caches permission grants per process lifetime + +### Permission prompts don't appear (stale TCC entries) + +If you've built or installed the app multiple times (Debug builds, Release builds, DMG installs), macOS may have accumulated multiple permission entries for the same bundle ID. Reset them to get a clean slate: + +```bash +tccutil reset Microphone org.pytorch.executorch.VoxtralRealtime +tccutil reset Accessibility org.pytorch.executorch.VoxtralRealtime +``` + +Then quit and relaunch the app. You should see fresh permission prompts. + +### No transcription output (waveform animates but no text) + +This means audio is captured but the runner isn't producing tokens. Check: + +1. Console.app for `"Audio written: N bytes"` — confirms data flows to the runner +2. Console.app for `"Pipe write failed"` — indicates broken pipe +3. The runner may need a few seconds of speech before producing the first token diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime.xcodeproj/project.pbxproj b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime.xcodeproj/project.pbxproj new file mode 100644 index 00000000..7b9bfb75 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime.xcodeproj/project.pbxproj @@ -0,0 +1,446 @@ +// !$*UTF8*$! +{ + archiveVersion = 1; + classes = { + }; + objectVersion = 77; + objects = { + +/* Begin PBXBuildFile section */ + 00C95410EB7173CB8EC29546 /* DictationManager.swift in Sources */ = {isa = PBXBuildFile; fileRef = 085054419A1BAA91DC852C48 /* DictationManager.swift */; }; + 1850513C3735A844666D0217 /* DictationOverlayView.swift in Sources */ = {isa = PBXBuildFile; fileRef = 54933FCE568E7DC5C512B5D8 /* DictationOverlayView.swift */; }; + 2C0D9096EB59C859E877427A /* WelcomeView.swift in Sources */ = {isa = PBXBuildFile; fileRef = D2245F740610F505B2D2905E /* WelcomeView.swift */; }; + 2C387248936D9F2313B8E2FB /* AudioEngine.swift in Sources */ = {isa = PBXBuildFile; fileRef = D14DCD3E95F36CEACCEF4335 /* AudioEngine.swift */; }; + 437CDA8161D03BC53BF5ACAA /* SettingsView.swift in Sources */ = {isa = PBXBuildFile; fileRef = B977C4A3EF8EE9770D939E98 /* SettingsView.swift */; }; + 4E3EFCC72CC492D94CC9494E /* Session.swift in Sources */ = {isa = PBXBuildFile; fileRef = E0D0245890DF77D480572858 /* Session.swift */; }; + 4EDEB7964D71C06220AE2CE5 /* Preferences.swift in Sources */ = {isa = PBXBuildFile; fileRef = 74547233C8BECECCBCAC2E8C /* Preferences.swift */; }; + 50B57C3652FDC3D30EE08D50 /* SidebarView.swift in Sources */ = {isa = PBXBuildFile; fileRef = 68D216F7ED4F81E147F2CFEE /* SidebarView.swift */; }; + 51009A5D18D14D7B19A77645 /* DictationPanel.swift in Sources */ = {isa = PBXBuildFile; fileRef = 97AF5A7A49BCC1D9F9404C02 /* DictationPanel.swift */; }; + 54B04C6AFEC0BEC16D5A427B /* RunnerBridge.swift in Sources */ = {isa = PBXBuildFile; fileRef = 65D90AECAFB4D759F980E25A /* RunnerBridge.swift */; }; + 5E5BEB5668CE4A42F6B67569 /* ContentView.swift in Sources */ = {isa = PBXBuildFile; fileRef = 9069BFAFAFDDBB137F34675E /* ContentView.swift */; }; + 6ABE5D7F989ACBF0E68BF8AD /* Assets.xcassets in Resources */ = {isa = PBXBuildFile; fileRef = E4B018A732C7D539547855A3 /* Assets.xcassets */; }; + 710AF905B60EB23C212AAC30 /* HealthCheck.swift in Sources */ = {isa = PBXBuildFile; fileRef = 6CAB3C82C8E7393E06E6EE6E /* HealthCheck.swift */; }; + 75FD2E801951086DFE9E28E3 /* AudioLevelView.swift in Sources */ = {isa = PBXBuildFile; fileRef = B5282DE2DFDC7D992AC7D81D /* AudioLevelView.swift */; }; + 83C2211EB3926FCE3AE55BA8 /* TranscriptView.swift in Sources */ = {isa = PBXBuildFile; fileRef = 9539C4044A50A585F04687CA /* TranscriptView.swift */; }; + 9D0149A93C0234587B0A7655 /* RecordingControls.swift in Sources */ = {isa = PBXBuildFile; fileRef = BE9D72CEB451260E46996613 /* RecordingControls.swift */; }; + A23029B5977019EBC0CC217F /* SetupGuideView.swift in Sources */ = {isa = PBXBuildFile; fileRef = A7DC59AEFE4416E91D2A001F /* SetupGuideView.swift */; }; + B4086DFD7477802852F31D0E /* VoxtralRealtimeApp.swift in Sources */ = {isa = PBXBuildFile; fileRef = D1FF01EEFDD18E056509CEB0 /* VoxtralRealtimeApp.swift */; }; + BA4D276D05BDEC9EE49C8979 /* RunnerError.swift in Sources */ = {isa = PBXBuildFile; fileRef = 887F4BCA7C5EBE3951E6BB56 /* RunnerError.swift */; }; + BDC0CBA04D99E79D5EBA5DC6 /* TranscriptStore.swift in Sources */ = {isa = PBXBuildFile; fileRef = 26CC8EBA770C5AE5AE9FE5AF /* TranscriptStore.swift */; }; + DB7FD45624EA9EECDA9DD799 /* ErrorBannerView.swift in Sources */ = {isa = PBXBuildFile; fileRef = A2CB74DFC873A590526E6138 /* ErrorBannerView.swift */; }; +/* End PBXBuildFile section */ + +/* Begin PBXFileReference section */ + 085054419A1BAA91DC852C48 /* DictationManager.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = DictationManager.swift; sourceTree = ""; }; + 1FB850226CD44666C82C7740 /* VoxtralRealtime.entitlements */ = {isa = PBXFileReference; lastKnownFileType = text.plist.entitlements; path = VoxtralRealtime.entitlements; sourceTree = ""; }; + 26CC8EBA770C5AE5AE9FE5AF /* TranscriptStore.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = TranscriptStore.swift; sourceTree = ""; }; + 26D5AB3AA91015848E6CC122 /* VoxtralRealtime.app */ = {isa = PBXFileReference; explicitFileType = wrapper.application; includeInIndex = 0; path = VoxtralRealtime.app; sourceTree = BUILT_PRODUCTS_DIR; }; + 54933FCE568E7DC5C512B5D8 /* DictationOverlayView.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = DictationOverlayView.swift; sourceTree = ""; }; + 65D90AECAFB4D759F980E25A /* RunnerBridge.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = RunnerBridge.swift; sourceTree = ""; }; + 68D216F7ED4F81E147F2CFEE /* SidebarView.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = SidebarView.swift; sourceTree = ""; }; + 6CAB3C82C8E7393E06E6EE6E /* HealthCheck.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = HealthCheck.swift; sourceTree = ""; }; + 74547233C8BECECCBCAC2E8C /* Preferences.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = Preferences.swift; sourceTree = ""; }; + 887F4BCA7C5EBE3951E6BB56 /* RunnerError.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = RunnerError.swift; sourceTree = ""; }; + 9069BFAFAFDDBB137F34675E /* ContentView.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = ContentView.swift; sourceTree = ""; }; + 9539C4044A50A585F04687CA /* TranscriptView.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = TranscriptView.swift; sourceTree = ""; }; + 97AF5A7A49BCC1D9F9404C02 /* DictationPanel.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = DictationPanel.swift; sourceTree = ""; }; + A2CB74DFC873A590526E6138 /* ErrorBannerView.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = ErrorBannerView.swift; sourceTree = ""; }; + A7DC59AEFE4416E91D2A001F /* SetupGuideView.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = SetupGuideView.swift; sourceTree = ""; }; + B5282DE2DFDC7D992AC7D81D /* AudioLevelView.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = AudioLevelView.swift; sourceTree = ""; }; + B977C4A3EF8EE9770D939E98 /* SettingsView.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = SettingsView.swift; sourceTree = ""; }; + BE9D72CEB451260E46996613 /* RecordingControls.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = RecordingControls.swift; sourceTree = ""; }; + C7C0FF6D0B3EC71D1E39E9A8 /* Info.plist */ = {isa = PBXFileReference; lastKnownFileType = text.plist; path = Info.plist; sourceTree = ""; }; + D14DCD3E95F36CEACCEF4335 /* AudioEngine.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = AudioEngine.swift; sourceTree = ""; }; + D1FF01EEFDD18E056509CEB0 /* VoxtralRealtimeApp.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = VoxtralRealtimeApp.swift; sourceTree = ""; }; + D2245F740610F505B2D2905E /* WelcomeView.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = WelcomeView.swift; sourceTree = ""; }; + E0D0245890DF77D480572858 /* Session.swift */ = {isa = PBXFileReference; lastKnownFileType = sourcecode.swift; path = Session.swift; sourceTree = ""; }; + E4B018A732C7D539547855A3 /* Assets.xcassets */ = {isa = PBXFileReference; lastKnownFileType = folder.assetcatalog; path = Assets.xcassets; sourceTree = ""; }; +/* End PBXFileReference section */ + +/* Begin PBXGroup section */ + 30572FD24E5528E224F7ED10 /* Utilities */ = { + isa = PBXGroup; + children = ( + 887F4BCA7C5EBE3951E6BB56 /* RunnerError.swift */, + ); + path = Utilities; + sourceTree = ""; + }; + 34E1BE9EC8ACD121D7795352 /* VoxtralRealtime */ = { + isa = PBXGroup; + children = ( + C7C0FF6D0B3EC71D1E39E9A8 /* Info.plist */, + 1FB850226CD44666C82C7740 /* VoxtralRealtime.entitlements */, + D1FF01EEFDD18E056509CEB0 /* VoxtralRealtimeApp.swift */, + F8D2535D13C2EF6E75B710DD /* Models */, + E26F854C184C60BB8A156578 /* Resources */, + D76E47F26C3F45C2FDFD34DF /* Services */, + 30572FD24E5528E224F7ED10 /* Utilities */, + C452E644F7EA840AFBDB43A4 /* Views */, + ); + path = VoxtralRealtime; + sourceTree = ""; + }; + C452E644F7EA840AFBDB43A4 /* Views */ = { + isa = PBXGroup; + children = ( + B5282DE2DFDC7D992AC7D81D /* AudioLevelView.swift */, + 9069BFAFAFDDBB137F34675E /* ContentView.swift */, + 54933FCE568E7DC5C512B5D8 /* DictationOverlayView.swift */, + 97AF5A7A49BCC1D9F9404C02 /* DictationPanel.swift */, + A2CB74DFC873A590526E6138 /* ErrorBannerView.swift */, + BE9D72CEB451260E46996613 /* RecordingControls.swift */, + B977C4A3EF8EE9770D939E98 /* SettingsView.swift */, + A7DC59AEFE4416E91D2A001F /* SetupGuideView.swift */, + 68D216F7ED4F81E147F2CFEE /* SidebarView.swift */, + 9539C4044A50A585F04687CA /* TranscriptView.swift */, + D2245F740610F505B2D2905E /* WelcomeView.swift */, + ); + path = Views; + sourceTree = ""; + }; + C62DD9BE8818F70C8044D7FF /* Products */ = { + isa = PBXGroup; + children = ( + 26D5AB3AA91015848E6CC122 /* VoxtralRealtime.app */, + ); + name = Products; + sourceTree = ""; + }; + D76E47F26C3F45C2FDFD34DF /* Services */ = { + isa = PBXGroup; + children = ( + D14DCD3E95F36CEACCEF4335 /* AudioEngine.swift */, + 085054419A1BAA91DC852C48 /* DictationManager.swift */, + 6CAB3C82C8E7393E06E6EE6E /* HealthCheck.swift */, + 65D90AECAFB4D759F980E25A /* RunnerBridge.swift */, + ); + path = Services; + sourceTree = ""; + }; + E26F854C184C60BB8A156578 /* Resources */ = { + isa = PBXGroup; + children = ( + E4B018A732C7D539547855A3 /* Assets.xcassets */, + ); + path = Resources; + sourceTree = ""; + }; + E7943F393DA257C006D6411D = { + isa = PBXGroup; + children = ( + 34E1BE9EC8ACD121D7795352 /* VoxtralRealtime */, + C62DD9BE8818F70C8044D7FF /* Products */, + ); + sourceTree = ""; + }; + F8D2535D13C2EF6E75B710DD /* Models */ = { + isa = PBXGroup; + children = ( + 74547233C8BECECCBCAC2E8C /* Preferences.swift */, + E0D0245890DF77D480572858 /* Session.swift */, + 26CC8EBA770C5AE5AE9FE5AF /* TranscriptStore.swift */, + ); + path = Models; + sourceTree = ""; + }; +/* End PBXGroup section */ + +/* Begin PBXNativeTarget section */ + B8C73BD33D08CD15C03EC05D /* VoxtralRealtime */ = { + isa = PBXNativeTarget; + buildConfigurationList = 6C9DE1DB881A58FAD15232B9 /* Build configuration list for PBXNativeTarget "VoxtralRealtime" */; + buildPhases = ( + 590D6B292BC0BAAD0A99F381 /* Sources */, + 5A20038974E50FA223F732F2 /* Bundle Runner & Model Artifacts */, + 8379A0BD3744792FFBD2B0C8 /* Resources */, + ); + buildRules = ( + ); + dependencies = ( + ); + name = VoxtralRealtime; + packageProductDependencies = ( + ); + productName = VoxtralRealtime; + productReference = 26D5AB3AA91015848E6CC122 /* VoxtralRealtime.app */; + productType = "com.apple.product-type.application"; + }; +/* End PBXNativeTarget section */ + +/* Begin PBXProject section */ + E0C3063C8408EF5D145A2309 /* Project object */ = { + isa = PBXProject; + attributes = { + BuildIndependentTargetsInParallel = YES; + LastUpgradeCheck = 1600; + }; + buildConfigurationList = DD10F8C9604ABCBFD61FA78C /* Build configuration list for PBXProject "VoxtralRealtime" */; + compatibilityVersion = "Xcode 14.0"; + developmentRegion = en; + hasScannedForEncodings = 0; + knownRegions = ( + Base, + en, + ); + mainGroup = E7943F393DA257C006D6411D; + minimizedProjectReferenceProxies = 1; + preferredProjectObjectVersion = 77; + projectDirPath = ""; + projectRoot = ""; + targets = ( + B8C73BD33D08CD15C03EC05D /* VoxtralRealtime */, + ); + }; +/* End PBXProject section */ + +/* Begin PBXResourcesBuildPhase section */ + 8379A0BD3744792FFBD2B0C8 /* Resources */ = { + isa = PBXResourcesBuildPhase; + buildActionMask = 2147483647; + files = ( + 6ABE5D7F989ACBF0E68BF8AD /* Assets.xcassets in Resources */, + ); + runOnlyForDeploymentPostprocessing = 0; + }; +/* End PBXResourcesBuildPhase section */ + +/* Begin PBXShellScriptBuildPhase section */ + 5A20038974E50FA223F732F2 /* Bundle Runner & Model Artifacts */ = { + isa = PBXShellScriptBuildPhase; + alwaysOutOfDate = 1; + buildActionMask = 2147483647; + files = ( + ); + inputFileListPaths = ( + ); + inputPaths = ( + ); + name = "Bundle Runner & Model Artifacts"; + outputFileListPaths = ( + ); + outputPaths = ( + ); + runOnlyForDeploymentPostprocessing = 0; + shellPath = /bin/sh; + shellScript = "set -euo pipefail\n\nET_PATH=\"${EXECUTORCH_PATH:-${HOME}/executorch}\"\nRUNNER_SRC=\"${ET_PATH}/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner\"\nMODEL_DIR=\"${MODEL_DIR:-${HOME}/voxtral_realtime_quant_metal}\"\nLIBOMP_SRC=\"/opt/homebrew/opt/libomp/lib/libomp.dylib\"\nDEST=\"${BUILT_PRODUCTS_DIR}/${CONTENTS_FOLDER_PATH}/Resources\"\n\nmkdir -p \"${DEST}\"\n\ncopy_if_newer() {\n local src=\"$1\" dst=\"$2\"\n if [ ! -f \"${src}\" ]; then\n echo \"warning: Not found: ${src}\"\n return\n fi\n if [ ! -f \"${dst}\" ] || [ \"${src}\" -nt \"${dst}\" ]; then\n cp -fL \"${src}\" \"${dst}\"\n echo \"✓ Bundled $(basename \"${dst}\")\"\n else\n echo \"· $(basename \"${dst}\") up to date\"\n fi\n}\n\n# Runner binary\ncopy_if_newer \"${RUNNER_SRC}\" \"${DEST}/voxtral_realtime_runner\"\nchmod +x \"${DEST}/voxtral_realtime_runner\" 2>/dev/null || true\n\n# libomp (runner dependency)\ncopy_if_newer \"${LIBOMP_SRC}\" \"${DEST}/libomp.dylib\"\n\n# Patch runner to find libomp via @executable_path (SIP strips DYLD_LIBRARY_PATH)\nif [ -f \"${DEST}/voxtral_realtime_runner\" ] && [ -f \"${DEST}/libomp.dylib\" ]; then\n install_name_tool -change /opt/llvm-openmp/lib/libomp.dylib @executable_path/libomp.dylib \"${DEST}/voxtral_realtime_runner\" 2>/dev/null || true\n echo \"✓ Patched runner rpath for libomp\"\nfi\n\n# Model artifacts\ncopy_if_newer \"/Users/younghan/voxtral_realtime_quant_metal/model-metal-int4.pte\" \"${DEST}/model-metal-int4.pte\"\ncopy_if_newer \"/Users/younghan/voxtral_realtime_quant_metal/preprocessor.pte\" \"${DEST}/preprocessor.pte\"\ncopy_if_newer \"/Users/younghan/voxtral_realtime_quant_metal/tekken.json\" \"${DEST}/tekken.json\"\n"; + }; +/* End PBXShellScriptBuildPhase section */ + +/* Begin PBXSourcesBuildPhase section */ + 590D6B292BC0BAAD0A99F381 /* Sources */ = { + isa = PBXSourcesBuildPhase; + buildActionMask = 2147483647; + files = ( + 2C387248936D9F2313B8E2FB /* AudioEngine.swift in Sources */, + 75FD2E801951086DFE9E28E3 /* AudioLevelView.swift in Sources */, + 5E5BEB5668CE4A42F6B67569 /* ContentView.swift in Sources */, + 00C95410EB7173CB8EC29546 /* DictationManager.swift in Sources */, + 1850513C3735A844666D0217 /* DictationOverlayView.swift in Sources */, + 51009A5D18D14D7B19A77645 /* DictationPanel.swift in Sources */, + DB7FD45624EA9EECDA9DD799 /* ErrorBannerView.swift in Sources */, + 710AF905B60EB23C212AAC30 /* HealthCheck.swift in Sources */, + 4EDEB7964D71C06220AE2CE5 /* Preferences.swift in Sources */, + 9D0149A93C0234587B0A7655 /* RecordingControls.swift in Sources */, + 54B04C6AFEC0BEC16D5A427B /* RunnerBridge.swift in Sources */, + BA4D276D05BDEC9EE49C8979 /* RunnerError.swift in Sources */, + 4E3EFCC72CC492D94CC9494E /* Session.swift in Sources */, + 437CDA8161D03BC53BF5ACAA /* SettingsView.swift in Sources */, + A23029B5977019EBC0CC217F /* SetupGuideView.swift in Sources */, + 50B57C3652FDC3D30EE08D50 /* SidebarView.swift in Sources */, + BDC0CBA04D99E79D5EBA5DC6 /* TranscriptStore.swift in Sources */, + 83C2211EB3926FCE3AE55BA8 /* TranscriptView.swift in Sources */, + B4086DFD7477802852F31D0E /* VoxtralRealtimeApp.swift in Sources */, + 2C0D9096EB59C859E877427A /* WelcomeView.swift in Sources */, + ); + runOnlyForDeploymentPostprocessing = 0; + }; +/* End PBXSourcesBuildPhase section */ + +/* Begin XCBuildConfiguration section */ + 28E8B7FC03C8076E0D628DB6 /* Debug */ = { + isa = XCBuildConfiguration; + buildSettings = { + ASSETCATALOG_COMPILER_APPICON_NAME = AppIcon; + ASSETCATALOG_COMPILER_GLOBAL_ACCENT_COLOR_NAME = AccentColor; + CODE_SIGN_ENTITLEMENTS = VoxtralRealtime/VoxtralRealtime.entitlements; + COMBINE_HIDPI_IMAGES = YES; + GENERATE_INFOPLIST_FILE = YES; + INFOPLIST_FILE = VoxtralRealtime/Info.plist; + INFOPLIST_KEY_NSMicrophoneUsageDescription = "Voxtral Realtime needs microphone access to capture audio for on-device speech transcription."; + LD_RUNPATH_SEARCH_PATHS = ( + "$(inherited)", + "@executable_path/../Frameworks", + ); + PRODUCT_BUNDLE_IDENTIFIER = com.younghan.VoxtralRealtime; + PRODUCT_NAME = "Voxtral Realtime"; + SDKROOT = macosx; + }; + name = Debug; + }; + 2E01924CDA9D95CDEE1E6C79 /* Release */ = { + isa = XCBuildConfiguration; + buildSettings = { + ASSETCATALOG_COMPILER_APPICON_NAME = AppIcon; + ASSETCATALOG_COMPILER_GLOBAL_ACCENT_COLOR_NAME = AccentColor; + CODE_SIGN_ENTITLEMENTS = VoxtralRealtime/VoxtralRealtime.entitlements; + COMBINE_HIDPI_IMAGES = YES; + GENERATE_INFOPLIST_FILE = YES; + INFOPLIST_FILE = VoxtralRealtime/Info.plist; + INFOPLIST_KEY_NSMicrophoneUsageDescription = "Voxtral Realtime needs microphone access to capture audio for on-device speech transcription."; + LD_RUNPATH_SEARCH_PATHS = ( + "$(inherited)", + "@executable_path/../Frameworks", + ); + PRODUCT_BUNDLE_IDENTIFIER = com.younghan.VoxtralRealtime; + PRODUCT_NAME = "Voxtral Realtime"; + SDKROOT = macosx; + }; + name = Release; + }; + 824D2A17D67FC8808FAC2F63 /* Release */ = { + isa = XCBuildConfiguration; + buildSettings = { + ALWAYS_SEARCH_USER_PATHS = NO; + CLANG_ANALYZER_NONNULL = YES; + CLANG_ANALYZER_NUMBER_OBJECT_CONVERSION = YES_AGGRESSIVE; + CLANG_CXX_LANGUAGE_STANDARD = "gnu++14"; + CLANG_CXX_LIBRARY = "libc++"; + CLANG_ENABLE_MODULES = YES; + CLANG_ENABLE_OBJC_ARC = YES; + CLANG_ENABLE_OBJC_WEAK = YES; + CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES; + CLANG_WARN_BOOL_CONVERSION = YES; + CLANG_WARN_COMMA = YES; + CLANG_WARN_CONSTANT_CONVERSION = YES; + CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES; + CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR; + CLANG_WARN_DOCUMENTATION_COMMENTS = YES; + CLANG_WARN_EMPTY_BODY = YES; + CLANG_WARN_ENUM_CONVERSION = YES; + CLANG_WARN_INFINITE_RECURSION = YES; + CLANG_WARN_INT_CONVERSION = YES; + CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES; + CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES; + CLANG_WARN_OBJC_LITERAL_CONVERSION = YES; + CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR; + CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES; + CLANG_WARN_RANGE_LOOP_ANALYSIS = YES; + CLANG_WARN_STRICT_PROTOTYPES = YES; + CLANG_WARN_SUSPICIOUS_MOVE = YES; + CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE; + CLANG_WARN_UNREACHABLE_CODE = YES; + CLANG_WARN__DUPLICATE_METHOD_MATCH = YES; + COPY_PHASE_STRIP = NO; + DEBUG_INFORMATION_FORMAT = "dwarf-with-dsym"; + ENABLE_HARDENED_RUNTIME = YES; + ENABLE_NS_ASSERTIONS = NO; + ENABLE_STRICT_OBJC_MSGSEND = YES; + GCC_C_LANGUAGE_STANDARD = gnu11; + GCC_NO_COMMON_BLOCKS = YES; + GCC_WARN_64_TO_32_BIT_CONVERSION = YES; + GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR; + GCC_WARN_UNDECLARED_SELECTOR = YES; + GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE; + GCC_WARN_UNUSED_FUNCTION = YES; + GCC_WARN_UNUSED_VARIABLE = YES; + MACOSX_DEPLOYMENT_TARGET = 14.0; + MTL_ENABLE_DEBUG_INFO = NO; + MTL_FAST_MATH = YES; + PRODUCT_NAME = "$(TARGET_NAME)"; + SDKROOT = macosx; + SWIFT_COMPILATION_MODE = wholemodule; + SWIFT_OPTIMIZATION_LEVEL = "-O"; + SWIFT_VERSION = 5.10; + }; + name = Release; + }; + B51BF99527053D861C58E2BE /* Debug */ = { + isa = XCBuildConfiguration; + buildSettings = { + ALWAYS_SEARCH_USER_PATHS = NO; + CLANG_ANALYZER_NONNULL = YES; + CLANG_ANALYZER_NUMBER_OBJECT_CONVERSION = YES_AGGRESSIVE; + CLANG_CXX_LANGUAGE_STANDARD = "gnu++14"; + CLANG_CXX_LIBRARY = "libc++"; + CLANG_ENABLE_MODULES = YES; + CLANG_ENABLE_OBJC_ARC = YES; + CLANG_ENABLE_OBJC_WEAK = YES; + CLANG_WARN_BLOCK_CAPTURE_AUTORELEASING = YES; + CLANG_WARN_BOOL_CONVERSION = YES; + CLANG_WARN_COMMA = YES; + CLANG_WARN_CONSTANT_CONVERSION = YES; + CLANG_WARN_DEPRECATED_OBJC_IMPLEMENTATIONS = YES; + CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR; + CLANG_WARN_DOCUMENTATION_COMMENTS = YES; + CLANG_WARN_EMPTY_BODY = YES; + CLANG_WARN_ENUM_CONVERSION = YES; + CLANG_WARN_INFINITE_RECURSION = YES; + CLANG_WARN_INT_CONVERSION = YES; + CLANG_WARN_NON_LITERAL_NULL_CONVERSION = YES; + CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES; + CLANG_WARN_OBJC_LITERAL_CONVERSION = YES; + CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR; + CLANG_WARN_QUOTED_INCLUDE_IN_FRAMEWORK_HEADER = YES; + CLANG_WARN_RANGE_LOOP_ANALYSIS = YES; + CLANG_WARN_STRICT_PROTOTYPES = YES; + CLANG_WARN_SUSPICIOUS_MOVE = YES; + CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE; + CLANG_WARN_UNREACHABLE_CODE = YES; + CLANG_WARN__DUPLICATE_METHOD_MATCH = YES; + COPY_PHASE_STRIP = NO; + DEBUG_INFORMATION_FORMAT = dwarf; + ENABLE_HARDENED_RUNTIME = YES; + ENABLE_STRICT_OBJC_MSGSEND = YES; + ENABLE_TESTABILITY = YES; + GCC_C_LANGUAGE_STANDARD = gnu11; + GCC_DYNAMIC_NO_PIC = NO; + GCC_NO_COMMON_BLOCKS = YES; + GCC_OPTIMIZATION_LEVEL = 0; + GCC_PREPROCESSOR_DEFINITIONS = ( + "$(inherited)", + "DEBUG=1", + ); + GCC_WARN_64_TO_32_BIT_CONVERSION = YES; + GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR; + GCC_WARN_UNDECLARED_SELECTOR = YES; + GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE; + GCC_WARN_UNUSED_FUNCTION = YES; + GCC_WARN_UNUSED_VARIABLE = YES; + MACOSX_DEPLOYMENT_TARGET = 14.0; + MTL_ENABLE_DEBUG_INFO = INCLUDE_SOURCE; + MTL_FAST_MATH = YES; + ONLY_ACTIVE_ARCH = YES; + PRODUCT_NAME = "$(TARGET_NAME)"; + SDKROOT = macosx; + SWIFT_ACTIVE_COMPILATION_CONDITIONS = DEBUG; + SWIFT_OPTIMIZATION_LEVEL = "-Onone"; + SWIFT_VERSION = 5.10; + }; + name = Debug; + }; +/* End XCBuildConfiguration section */ + +/* Begin XCConfigurationList section */ + 6C9DE1DB881A58FAD15232B9 /* Build configuration list for PBXNativeTarget "VoxtralRealtime" */ = { + isa = XCConfigurationList; + buildConfigurations = ( + 28E8B7FC03C8076E0D628DB6 /* Debug */, + 2E01924CDA9D95CDEE1E6C79 /* Release */, + ); + defaultConfigurationIsVisible = 0; + defaultConfigurationName = Debug; + }; + DD10F8C9604ABCBFD61FA78C /* Build configuration list for PBXProject "VoxtralRealtime" */ = { + isa = XCConfigurationList; + buildConfigurations = ( + B51BF99527053D861C58E2BE /* Debug */, + 824D2A17D67FC8808FAC2F63 /* Release */, + ); + defaultConfigurationIsVisible = 0; + defaultConfigurationName = Debug; + }; +/* End XCConfigurationList section */ + }; + rootObject = E0C3063C8408EF5D145A2309 /* Project object */; +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Info.plist b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Info.plist new file mode 100644 index 00000000..3d98dcfb --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Info.plist @@ -0,0 +1,8 @@ + + + + + NSMicrophoneUsageDescription + Voxtral Realtime needs microphone access to capture audio for on-device speech transcription. + + diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Models/Preferences.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Models/Preferences.swift new file mode 100644 index 00000000..0e55c92c --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Models/Preferences.swift @@ -0,0 +1,71 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import Foundation + +@MainActor @Observable +final class Preferences { + var runnerPath: String { + didSet { UserDefaults.standard.set(runnerPath, forKey: "runnerPath") } + } + + var modelDirectory: String { + didSet { UserDefaults.standard.set(modelDirectory, forKey: "modelDirectory") } + } + + var audioDeviceID: String? { + didSet { UserDefaults.standard.set(audioDeviceID, forKey: "audioDeviceID") } + } + + var silenceThreshold: Double { + didSet { UserDefaults.standard.set(silenceThreshold, forKey: "silenceThreshold") } + } + + var silenceTimeout: Double { + didSet { UserDefaults.standard.set(silenceTimeout, forKey: "silenceTimeout") } + } + + var modelPath: String { "\(modelDirectory)/model-metal-int4.pte" } + var tokenizerPath: String { "\(modelDirectory)/tekken.json" } + var preprocessorPath: String { "\(modelDirectory)/preprocessor.pte" } + + var usingBundledResources: Bool { + runnerPath.hasPrefix(Bundle.main.bundlePath) + && modelDirectory.hasPrefix(Bundle.main.bundlePath) + } + + init() { + let defaults = UserDefaults.standard + let home = FileManager.default.homeDirectoryForCurrentUser.path() + let bundleResources = Bundle.main.resourcePath ?? "" + + let bundledRunner = "\(bundleResources)/voxtral_realtime_runner" + let bundledModel = "\(bundleResources)/model-metal-int4.pte" + let buildRunner = "\(home)/executorch/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner" + + let hasBundledRunner = FileManager.default.isExecutableFile(atPath: bundledRunner) + let hasBundledModel = FileManager.default.fileExists(atPath: bundledModel) + + if hasBundledRunner { + self.runnerPath = bundledRunner + } else { + self.runnerPath = defaults.string(forKey: "runnerPath") ?? buildRunner + } + + if hasBundledModel { + self.modelDirectory = bundleResources + } else { + self.modelDirectory = defaults.string(forKey: "modelDirectory") + ?? "\(home)/voxtral_realtime_quant_metal" + } + + self.audioDeviceID = defaults.string(forKey: "audioDeviceID") + self.silenceThreshold = defaults.object(forKey: "silenceThreshold") as? Double ?? 0.02 + self.silenceTimeout = defaults.object(forKey: "silenceTimeout") as? Double ?? 2.0 + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Models/Session.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Models/Session.swift new file mode 100644 index 00000000..a57fe82b --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Models/Session.swift @@ -0,0 +1,39 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import Foundation + +struct Session: Identifiable, Codable, Sendable, Hashable { + let id: UUID + let date: Date + var title: String + var transcript: String + var duration: TimeInterval + + init( + id: UUID = UUID(), + date: Date = .now, + title: String = "", + transcript: String = "", + duration: TimeInterval = 0 + ) { + self.id = id + self.date = date + self.title = title + self.transcript = transcript + self.duration = duration + } + + var displayTitle: String { + if !title.isEmpty { return title } + let formatter = DateFormatter() + formatter.dateStyle = .medium + formatter.timeStyle = .short + return formatter.string(from: date) + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Models/TranscriptStore.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Models/TranscriptStore.swift new file mode 100644 index 00000000..e57bea97 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Models/TranscriptStore.swift @@ -0,0 +1,354 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import Foundation + +@MainActor @Observable +final class TranscriptStore { + enum SessionState: Equatable { + case idle + case loading + case transcribing + case paused + } + + enum ModelState: Equatable { + case unloaded + case loading + case ready + } + + var sessions: [Session] = [] + var selectedSessionID: UUID? + var liveTranscript = "" + var sessionState: SessionState = .idle + var modelState: ModelState = .unloaded + var currentError: RunnerError? + var healthResult: HealthCheck.Result? + var audioLevel: Float = 0 + var statusMessage = "" + + var dictationText = "" + var isDictating = false + + var hasActiveSession: Bool { sessionState != .idle } + var isTranscribing: Bool { sessionState == .transcribing } + var isPaused: Bool { sessionState == .paused } + var isLoading: Bool { sessionState == .loading } + var isModelReady: Bool { modelState == .ready } + + private let runner = RunnerBridge() + private let preferences: Preferences + private var startDate: Date? + private var streamTask: Task? + + init(preferences: Preferences) { + self.preferences = preferences + loadSessions() + } + + // MARK: - Model lifecycle + + func preloadModel() async { + guard modelState == .unloaded else { return } + await ensureRunnerLaunched() + } + + func unloadModel() async { + if hasActiveSession { await endSession() } + await runner.stop() + modelState = .unloaded + statusMessage = "" + streamTask?.cancel() + streamTask = nil + } + + // MARK: - Transcription + + func startTranscription() async { + guard sessionState == .idle else { return } + + let micOK = await checkMicPermissionLive() + guard micOK else { return } + + if modelState != .ready { + await ensureRunnerLaunched() + while modelState == .loading { + try? await Task.sleep(for: .milliseconds(100)) + } + guard modelState == .ready else { return } + } + + sessionState = .transcribing + liveTranscript = "" + startDate = .now + currentError = nil + statusMessage = "Transcribing" + + do { + try await runner.startAudioCapture() + } catch { + currentError = .launchFailed(description: error.localizedDescription) + sessionState = .idle + } + } + + func resumeTranscription() async { + guard sessionState == .paused else { return } + + let micOK = await checkMicPermissionLive() + guard micOK else { return } + + if modelState != .ready { + await ensureRunnerLaunched() + while modelState == .loading { + try? await Task.sleep(for: .milliseconds(100)) + } + guard modelState == .ready else { return } + } + + sessionState = .transcribing + statusMessage = "Transcribing" + + do { + try await runner.startAudioCapture() + } catch { + currentError = .launchFailed(description: error.localizedDescription) + sessionState = .paused + } + } + + func pauseTranscription() async { + guard sessionState == .transcribing || sessionState == .loading else { return } + await runner.stopAudioCapture() + audioLevel = 0 + statusMessage = "Paused" + sessionState = .paused + } + + func endSession() async { + if sessionState == .transcribing { + await runner.stopAudioCapture() + } + + let duration = startDate.map { Date.now.timeIntervalSince($0) } ?? 0 + + if !liveTranscript.isEmpty { + let session = Session( + date: startDate ?? .now, + transcript: liveTranscript, + duration: duration + ) + sessions.insert(session, at: 0) + selectedSessionID = session.id + saveSessions() + } + + liveTranscript = "" + audioLevel = 0 + statusMessage = isModelReady ? "Model ready" : "" + sessionState = .idle + startDate = nil + } + + func togglePauseResume() async { + switch sessionState { + case .idle: + await startTranscription() + case .loading, .transcribing: + await pauseTranscription() + case .paused: + await resumeTranscription() + } + } + + // MARK: - Private + + private func ensureRunnerLaunched() async { + guard await !runner.isRunnerAlive else { return } + + guard healthResult?.allGood == true else { + currentError = healthResult.flatMap { result in + if !result.runnerAvailable { return .binaryNotFound(path: preferences.runnerPath) } + if let missing = result.missingFiles.first { return .modelMissing(file: missing) } + if result.micPermission != .authorized { return .microphonePermissionDenied } + return nil + } + return + } + + modelState = .loading + statusMessage = "Launching runner..." + + let streams = await runner.launchRunner( + runnerPath: preferences.runnerPath, + modelPath: preferences.modelPath, + tokenizerPath: preferences.tokenizerPath, + preprocessorPath: preferences.preprocessorPath + ) + + streamTask?.cancel() + streamTask = Task { + await withTaskGroup(of: Void.self) { group in + group.addTask { @MainActor [weak self] in + for await token in streams.tokens { + if self?.isDictating == true { + self?.dictationText += token + } else if self?.hasActiveSession == true { + self?.liveTranscript += token + } + } + } + + group.addTask { @MainActor [weak self] in + for await error in streams.errors { + self?.currentError = error + if self?.isTranscribing == true { + await self?.pauseTranscription() + } + } + } + + group.addTask { @MainActor [weak self] in + for await level in streams.audioLevel { + self?.audioLevel = level + } + } + + group.addTask { @MainActor [weak self] in + for await status in streams.status { + self?.statusMessage = status + } + } + + group.addTask { @MainActor [weak self] in + for await state in streams.modelState { + switch state { + case .unloaded: + self?.modelState = .unloaded + case .loading: + self?.modelState = .loading + case .ready: + self?.modelState = .ready + if self?.statusMessage == "Warming up..." || self?.statusMessage.contains("Loading") == true { + self?.statusMessage = "Model ready" + } + } + } + } + } + } + } + + // MARK: - Dictation + + func startDictation() async { + guard !isDictating else { return } + + let micOK = await checkMicPermissionLive() + guard micOK else { return } + + if modelState != .ready { + await ensureRunnerLaunched() + while modelState == .loading { + try? await Task.sleep(for: .milliseconds(100)) + } + guard modelState == .ready else { return } + } + + isDictating = true + dictationText = "" + audioLevel = 0 + + do { + try await runner.startAudioCapture() + } catch { + isDictating = false + } + } + + func stopDictation() async -> String { + guard isDictating else { return "" } + await runner.stopAudioCapture() + isDictating = false + audioLevel = 0 + let result = dictationText + dictationText = "" + return result + } + + // MARK: - Session Management + + func deleteSession(_ session: Session) { + sessions.removeAll { $0.id == session.id } + if selectedSessionID == session.id { + selectedSessionID = sessions.first?.id + } + saveSessions() + } + + func renameSession(_ session: Session, to newTitle: String) { + guard let idx = sessions.firstIndex(where: { $0.id == session.id }) else { return } + sessions[idx].title = newTitle + saveSessions() + } + + func clearError() { + currentError = nil + } + + // MARK: - Health Check + + func runHealthCheck() async { + healthResult = await HealthCheck.run( + runnerPath: preferences.runnerPath, + modelPath: preferences.modelPath, + tokenizerPath: preferences.tokenizerPath, + preprocessorPath: preferences.preprocessorPath + ) + } + + private func checkMicPermissionLive() async -> Bool { + let permission = await HealthCheck.liveMicPermission() + + if permission == .notDetermined { + let granted = await HealthCheck.requestMicrophoneAccess() + if granted { + await runHealthCheck() + return true + } + } + + if permission == .authorized { return true } + + currentError = .microphonePermissionDenied + await runHealthCheck() + return false + } + + // MARK: - Persistence + + private var sessionsURL: URL { + let appSupport = FileManager.default.urls(for: .applicationSupportDirectory, in: .userDomainMask).first! + let dir = appSupport.appendingPathComponent("VoxtralRealtime", isDirectory: true) + try? FileManager.default.createDirectory(at: dir, withIntermediateDirectories: true) + return dir.appendingPathComponent("sessions.json") + } + + private func saveSessions() { + guard let data = try? JSONEncoder().encode(sessions) else { return } + try? data.write(to: sessionsURL, options: .atomic) + } + + private func loadSessions() { + guard let data = try? Data(contentsOf: sessionsURL), + let decoded = try? JSONDecoder().decode([Session].self, from: data) + else { return } + sessions = decoded + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AccentColor.colorset/Contents.json b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AccentColor.colorset/Contents.json new file mode 100644 index 00000000..eb878970 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AccentColor.colorset/Contents.json @@ -0,0 +1,11 @@ +{ + "colors" : [ + { + "idiom" : "universal" + } + ], + "info" : { + "author" : "xcode", + "version" : 1 + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/Contents.json b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/Contents.json new file mode 100644 index 00000000..4ab5e2f3 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/Contents.json @@ -0,0 +1,68 @@ +{ + "images" : [ + { + "filename" : "icon_16.png", + "idiom" : "mac", + "scale" : "1x", + "size" : "16x16" + }, + { + "filename" : "icon_16@2x.png", + "idiom" : "mac", + "scale" : "2x", + "size" : "16x16" + }, + { + "filename" : "icon_32.png", + "idiom" : "mac", + "scale" : "1x", + "size" : "32x32" + }, + { + "filename" : "icon_32@2x.png", + "idiom" : "mac", + "scale" : "2x", + "size" : "32x32" + }, + { + "filename" : "icon_128.png", + "idiom" : "mac", + "scale" : "1x", + "size" : "128x128" + }, + { + "filename" : "icon_128@2x.png", + "idiom" : "mac", + "scale" : "2x", + "size" : "128x128" + }, + { + "filename" : "icon_256.png", + "idiom" : "mac", + "scale" : "1x", + "size" : "256x256" + }, + { + "filename" : "icon_256@2x.png", + "idiom" : "mac", + "scale" : "2x", + "size" : "256x256" + }, + { + "filename" : "icon_512.png", + "idiom" : "mac", + "scale" : "1x", + "size" : "512x512" + }, + { + "filename" : "icon_512@2x.png", + "idiom" : "mac", + "scale" : "2x", + "size" : "512x512" + } + ], + "info" : { + "author" : "xcode", + "version" : 1 + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_128.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_128.png new file mode 100644 index 00000000..cf71558e Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_128.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_128@2x.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_128@2x.png new file mode 100644 index 00000000..49e194e7 Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_128@2x.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_16.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_16.png new file mode 100644 index 00000000..a97bd5e8 Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_16.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_16@2x.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_16@2x.png new file mode 100644 index 00000000..726d41e1 Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_16@2x.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_256.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_256.png new file mode 100644 index 00000000..49e194e7 Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_256.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_256@2x.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_256@2x.png new file mode 100644 index 00000000..5035e10a Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_256@2x.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_32.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_32.png new file mode 100644 index 00000000..726d41e1 Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_32.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_32@2x.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_32@2x.png new file mode 100644 index 00000000..164c19a5 Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_32@2x.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_512.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_512.png new file mode 100644 index 00000000..5035e10a Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_512.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_512@2x.png b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_512@2x.png new file mode 100644 index 00000000..1597ff5c Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/AppIcon.appiconset/icon_512@2x.png differ diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/Contents.json b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/Contents.json new file mode 100644 index 00000000..73c00596 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Resources/Assets.xcassets/Contents.json @@ -0,0 +1,6 @@ +{ + "info" : { + "author" : "xcode", + "version" : 1 + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/AudioEngine.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/AudioEngine.swift new file mode 100644 index 00000000..70651572 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/AudioEngine.swift @@ -0,0 +1,111 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import AVFoundation +import Accelerate +import os + +private let log = Logger(subsystem: "org.pytorch.executorch.VoxtralRealtime", category: "AudioEngine") + +actor AudioEngine { + private var engine: AVAudioEngine? + + func startCapture( + writingTo handle: FileHandle, + levelHandler: @Sendable @escaping (Float) -> Void + ) throws { + let engine = AVAudioEngine() + let inputNode = engine.inputNode + let hwFormat = inputNode.outputFormat(forBus: 0) + + log.info("Hardware audio format: \(hwFormat.sampleRate)Hz, \(hwFormat.channelCount)ch") + + guard hwFormat.sampleRate > 0, hwFormat.channelCount > 0 else { + throw RunnerError.microphoneNotAvailable + } + + let targetFormat = AVAudioFormat( + commonFormat: .pcmFormatFloat32, + sampleRate: 16000, + channels: 1, + interleaved: false + )! + + guard let converter = AVAudioConverter(from: hwFormat, to: targetFormat) else { + throw RunnerError.launchFailed( + description: "Cannot convert audio from \(hwFormat.sampleRate)Hz/\(hwFormat.channelCount)ch to 16kHz mono" + ) + } + + let sampleRateRatio = 16000.0 / hwFormat.sampleRate + var totalBytesWritten: Int64 = 0 + var writeErrorCount = 0 + + inputNode.installTap(onBus: 0, bufferSize: 4096, format: hwFormat) { buffer, _ in + let capacity = AVAudioFrameCount(Double(buffer.frameLength) * sampleRateRatio) + 1 + guard let converted = AVAudioPCMBuffer(pcmFormat: targetFormat, frameCapacity: capacity) else { + log.warning("Failed to allocate conversion buffer") + return + } + + var consumed = false + var error: NSError? + converter.convert(to: converted, error: &error) { _, outStatus in + if !consumed { + consumed = true + outStatus.pointee = .haveData + return buffer + } + outStatus.pointee = .noDataNow + return nil + } + + if let error { + log.warning("Audio conversion error: \(error.localizedDescription)") + return + } + + guard converted.frameLength > 0, + let channelData = converted.floatChannelData + else { return } + + let frameCount = Int(converted.frameLength) + let samples = channelData[0] + + var rms: Float = 0 + vDSP_rmsqv(samples, 1, &rms, vDSP_Length(frameCount)) + levelHandler(rms) + + let byteCount = frameCount * MemoryLayout.size + let data = Data(bytes: samples, count: byteCount) + do { + try handle.write(contentsOf: data) + totalBytesWritten += Int64(byteCount) + if totalBytesWritten % 160000 == 0 { + log.debug("Audio written: \(totalBytesWritten) bytes (\(totalBytesWritten / 64000)s @ 16kHz)") + } + } catch { + writeErrorCount += 1 + if writeErrorCount <= 3 { + log.error("Pipe write failed (#\(writeErrorCount)): \(error.localizedDescription)") + } + } + } + + try engine.start() + log.info("AVAudioEngine started — capturing and piping to runner stdin") + self.engine = engine + } + + func stopCapture() { + engine?.inputNode.removeTap(onBus: 0) + engine?.stop() + engine = nil + log.info("Audio capture stopped") + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/DictationManager.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/DictationManager.swift new file mode 100644 index 00000000..8986c4f8 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/DictationManager.swift @@ -0,0 +1,245 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import AppKit +import Carbon.HIToolbox +import os + +private let log = Logger(subsystem: "org.pytorch.executorch.VoxtralRealtime", category: "Dictation") + +@MainActor @Observable +final class DictationManager { + enum State: Equatable { + case idle + case listening + } + + private(set) var state: State = .idle + var isListening: Bool { state == .listening } + + private let store: TranscriptStore + private let preferences: Preferences + private var panel: DictationPanel? + private var hotKeyRef: EventHotKeyRef? + private var eventHandlerRef: EventHandlerRef? + private var silenceTimer: Task? + private var lastVoiceTime: Date = .now + private var targetApp: NSRunningApplication? + + init(store: TranscriptStore, preferences: Preferences) { + self.store = store + self.preferences = preferences + } + + nonisolated func cleanup() { + MainActor.assumeIsolated { + unregisterHotKey() + } + } + + // MARK: - Global Hotkey (Ctrl+Space) + + func registerHotKey() { + let hotKeyID = EventHotKeyID(signature: OSType(0x5654_5254), id: 1) + var ref: EventHotKeyRef? + let status = RegisterEventHotKey( + UInt32(kVK_Space), + UInt32(controlKey), + hotKeyID, + GetApplicationEventTarget(), + 0, + &ref + ) + if status == noErr { + hotKeyRef = ref + log.info("Global hotkey registered: Ctrl+Space") + } else { + log.error("Failed to register hotkey: \(status)") + } + + installEventHandler() + } + + func unregisterHotKey() { + if let ref = hotKeyRef { + UnregisterEventHotKey(ref) + hotKeyRef = nil + } + if let handler = eventHandlerRef { + RemoveEventHandler(handler) + eventHandlerRef = nil + } + } + + private func installEventHandler() { + var eventType = EventTypeSpec(eventClass: OSType(kEventClassKeyboard), eventKind: UInt32(kEventHotKeyPressed)) + let selfPtr = Unmanaged.passUnretained(self).toOpaque() + + InstallEventHandler( + GetApplicationEventTarget(), + { _, event, userData -> OSStatus in + guard let userData, let event else { return OSStatus(eventNotHandledErr) } + let manager = Unmanaged.fromOpaque(userData).takeUnretainedValue() + Task { @MainActor in + await manager.toggle() + } + return noErr + }, + 1, + &eventType, + selfPtr, + &eventHandlerRef + ) + } + + // MARK: - Toggle + + func toggle() async { + switch state { + case .idle: + await startListening() + case .listening: + await stopAndPaste() + } + } + + // MARK: - Listening + + private func startListening() async { + guard store.isModelReady || store.modelState == .unloaded else { return } + + let micStatus = await HealthCheck.liveMicPermission() + if micStatus == .notDetermined { + _ = await HealthCheck.requestMicrophoneAccess() + } + guard await HealthCheck.liveMicPermission() == .authorized else { + log.error("Microphone permission not granted — cannot start dictation") + store.currentError = .microphonePermissionDenied + return + } + + if !store.isModelReady { + await store.preloadModel() + while store.modelState == .loading { + try? await Task.sleep(for: .milliseconds(100)) + } + guard store.isModelReady else { return } + } + + targetApp = NSWorkspace.shared.frontmostApplication + log.info("Target app: \(self.targetApp?.localizedName ?? "none")") + + state = .listening + lastVoiceTime = .now + showPanel() + + await store.startDictation() + startSilenceMonitor() + log.info("Dictation started") + } + + private func stopAndPaste() async { + guard state == .listening else { return } + + silenceTimer?.cancel() + silenceTimer = nil + + let text = await store.stopDictation() + state = .idle + + dismissPanel() + log.info("Dictation stopped, text length: \(text.count)") + + guard !text.isEmpty else { return } + + NSPasteboard.general.clearContents() + NSPasteboard.general.setString(text, forType: .string) + + if let app = targetApp { + app.activate() + log.info("Re-activated: \(app.localizedName ?? "?")") + } + + try? await Task.sleep(for: .milliseconds(300)) + + if Self.checkAccessibility(prompt: false) { + pasteViaKeyEvent() + } else { + log.warning("Accessibility permission lost — text is on clipboard, prompting user to re-grant") + _ = Self.checkAccessibility(prompt: true) + } + } + + // MARK: - Silence Detection + + private func startSilenceMonitor() { + silenceTimer?.cancel() + silenceTimer = Task { @MainActor [weak self] in + while !Task.isCancelled { + try? await Task.sleep(for: .milliseconds(250)) + guard let self, self.state == .listening else { break } + + let level = self.store.audioLevel + + if level > Float(self.preferences.silenceThreshold) { + self.lastVoiceTime = .now + } + + let silenceDuration = Date.now.timeIntervalSince(self.lastVoiceTime) + let hasText = !self.store.dictationText.isEmpty + + if hasText && silenceDuration >= self.preferences.silenceTimeout { + log.info("Auto-stop: \(String(format: "%.1f", silenceDuration))s silence (level: \(String(format: "%.4f", level)))") + await self.stopAndPaste() + break + } + } + } + } + + // MARK: - Panel + + private func showPanel() { + let overlay = DictationOverlayView(store: store) + panel = DictationPanel(contentView: overlay) + panel?.showCentered() + } + + private func dismissPanel() { + panel?.dismiss() + panel = nil + } + + // MARK: - Paste + + private func pasteViaKeyEvent() { + guard let keyDown = CGEvent(keyboardEventSource: nil, virtualKey: CGKeyCode(kVK_ANSI_V), keyDown: true), + let keyUp = CGEvent(keyboardEventSource: nil, virtualKey: CGKeyCode(kVK_ANSI_V), keyDown: false) + else { + log.error("CGEvent creation failed — Accessibility permission is not granted for this binary. Remove the app from Accessibility settings, re-add it, and restart.") + return + } + + keyDown.flags = .maskCommand + keyDown.post(tap: .cgSessionEventTap) + + usleep(50_000) + + keyUp.flags = .maskCommand + keyUp.post(tap: .cgSessionEventTap) + + log.info("Auto-pasted via Cmd+V") + } + + // MARK: - Accessibility + + static func checkAccessibility(prompt: Bool = false) -> Bool { + let options = [kAXTrustedCheckOptionPrompt.takeUnretainedValue(): prompt] as CFDictionary + return AXIsProcessTrustedWithOptions(options) + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/HealthCheck.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/HealthCheck.swift new file mode 100644 index 00000000..d2af865f --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/HealthCheck.swift @@ -0,0 +1,71 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import AVFoundation + +struct HealthCheck: Sendable { + struct Result: Sendable { + var runnerAvailable: Bool + var modelAvailable: Bool + var preprocessorAvailable: Bool + var tokenizerAvailable: Bool + var micPermission: MicPermission + + var allGood: Bool { + runnerAvailable && modelAvailable && preprocessorAvailable && tokenizerAvailable && micPermission == .authorized + } + + var missingFiles: [String] { + var missing: [String] = [] + if !runnerAvailable { missing.append("voxtral_realtime_runner") } + if !modelAvailable { missing.append("model-metal-int4.pte") } + if !preprocessorAvailable { missing.append("preprocessor.pte") } + if !tokenizerAvailable { missing.append("tekken.json") } + return missing + } + } + + enum MicPermission: Sendable { + case authorized, denied, notDetermined + } + + static func run( + runnerPath: String, + modelPath: String, + tokenizerPath: String, + preprocessorPath: String + ) async -> Result { + let fm = FileManager.default + let micPerm = await microphonePermission() + + return Result( + runnerAvailable: fm.isExecutableFile(atPath: runnerPath), + modelAvailable: fm.fileExists(atPath: modelPath), + preprocessorAvailable: fm.fileExists(atPath: preprocessorPath), + tokenizerAvailable: fm.fileExists(atPath: tokenizerPath), + micPermission: micPerm + ) + } + + static func requestMicrophoneAccess() async -> Bool { + await AVCaptureDevice.requestAccess(for: .audio) + } + + static func liveMicPermission() async -> MicPermission { + await microphonePermission() + } + + private static func microphonePermission() async -> MicPermission { + switch AVCaptureDevice.authorizationStatus(for: .audio) { + case .authorized: .authorized + case .denied, .restricted: .denied + case .notDetermined: .notDetermined + @unknown default: .notDetermined + } + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/RunnerBridge.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/RunnerBridge.swift new file mode 100644 index 00000000..01d47667 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Services/RunnerBridge.swift @@ -0,0 +1,248 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import Foundation +import os + +private let log = Logger(subsystem: "org.pytorch.executorch.VoxtralRealtime", category: "RunnerBridge") + +actor RunnerBridge { + enum ModelState: Sendable, Equatable { + case unloaded + case loading + case ready + } + + struct Streams: Sendable { + let tokens: AsyncStream + let errors: AsyncStream + let modelState: AsyncStream + let audioLevel: AsyncStream + let status: AsyncStream + } + + private var process: Process? + private var stdinPipe: Pipe? + private let audioEngine = AudioEngine() + private var levelContinuation: AsyncStream.Continuation? + private var tokenContinuation: AsyncStream.Continuation? + private var statusContinuation: AsyncStream.Continuation? + private var errorContinuation: AsyncStream.Continuation? + private var modelStateContinuation: AsyncStream.Continuation? + + private(set) var modelState: ModelState = .unloaded + + var isRunnerAlive: Bool { process?.isRunning == true } + + // MARK: - Process lifecycle + + func launchRunner( + runnerPath: String, + modelPath: String, + tokenizerPath: String, + preprocessorPath: String + ) -> Streams { + if isRunnerAlive { stop() } + + let stdoutPipe = Pipe() + let stderrPipe = Pipe() + let stdinPipe = Pipe() + self.stdinPipe = stdinPipe + + let proc = Process() + proc.executableURL = URL(fileURLWithPath: runnerPath) + proc.arguments = [ + "--model_path", modelPath, + "--tokenizer_path", tokenizerPath, + "--preprocessor_path", preprocessorPath, + "--mic" + ] + + var env = ProcessInfo.processInfo.environment + if let bundleResources = Bundle.main.resourcePath { + let existing = env["DYLD_LIBRARY_PATH"] ?? "" + env["DYLD_LIBRARY_PATH"] = existing.isEmpty ? bundleResources : "\(bundleResources):\(existing)" + } + proc.environment = env + proc.standardInput = stdinPipe + proc.standardOutput = stdoutPipe + proc.standardError = stderrPipe + self.process = proc + + let (tokenStream, tokenCont) = AsyncStream.makeStream() + let (errorStream, errorCont) = AsyncStream.makeStream() + let (modelStateStream, modelStateCont) = AsyncStream.makeStream() + let (levelStream, levelCont) = AsyncStream.makeStream() + let (statusStream, statusCont) = AsyncStream.makeStream() + + self.tokenContinuation = tokenCont + self.errorContinuation = errorCont + self.modelStateContinuation = modelStateCont + self.levelContinuation = levelCont + self.statusContinuation = statusCont + + modelState = .loading + modelStateCont.yield(.loading) + + log.info("Launching runner: \(runnerPath)") + statusCont.yield("Launching runner...") + + DispatchQueue.global(qos: .userInitiated).async { [weak self] in + let handle = stdoutPipe.fileHandleForReading + var sawListening = false + + while true { + let data = handle.availableData + if data.isEmpty { + log.info("Runner stdout EOF") + tokenCont.finish() + break + } + + guard let text = String(data: data, encoding: .utf8) else { continue } + log.debug("Runner stdout (\(data.count) bytes): \(text.prefix(200))") + + if !sawListening && text.contains("Listening") { + sawListening = true + log.info("Model loaded — runner ready") + modelStateCont.yield(.ready) + Task { await self?.setModelState(.ready) } + + let remainder = text.replacingOccurrences(of: "Listening (Ctrl+C to stop)...", with: "") + .trimmingCharacters(in: .whitespacesAndNewlines) + if !remainder.isEmpty { tokenCont.yield(remainder) } + continue + } + + if text.contains("PyTorchObserver") { + let parts = text.components(separatedBy: "\n") + let nonStats = parts.filter { !$0.contains("PyTorchObserver") }.joined(separator: "\n") + if !nonStats.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty { + tokenCont.yield(nonStats) + } + continue + } + + let cleaned = text.replacingOccurrences( + of: "\u{1B}\\[[0-9;]*m", with: "", options: .regularExpression + ) + if !cleaned.isEmpty { tokenCont.yield(cleaned) } + } + } + + DispatchQueue.global(qos: .utility).async { + let handle = stderrPipe.fileHandleForReading + while true { + let data = handle.availableData + if data.isEmpty { break } + guard let text = String(data: data, encoding: .utf8) else { continue } + log.debug("Runner stderr: \(text.trimmingCharacters(in: .newlines))") + + let lastLine = text.components(separatedBy: "\n") + .filter { !$0.trimmingCharacters(in: .whitespaces).isEmpty } + .last ?? "" + if lastLine.contains("Loading model") { statusCont.yield("Loading model...") } + else if lastLine.contains("Loading tokenizer") { statusCont.yield("Loading tokenizer...") } + else if lastLine.contains("Loading preprocessor") { statusCont.yield("Loading preprocessor...") } + else if lastLine.contains("Warming up") { statusCont.yield("Warming up...") } + else if lastLine.contains("Warmup complete") { statusCont.yield("Model ready") } + } + } + + proc.terminationHandler = { [weak self] process in + let code = process.terminationStatus + log.info("Runner exited with code \(code)") + if code != 0 && code != 2 { + errorCont.yield(.runnerCrashed(exitCode: code, stderr: "Exit code: \(code)")) + } + tokenCont.finish() + errorCont.finish() + levelCont.finish() + statusCont.finish() + modelStateCont.yield(.unloaded) + modelStateCont.finish() + Task { await self?.onProcessExited() } + } + + do { + try proc.run() + log.info("Runner process started (pid: \(proc.processIdentifier))") + } catch { + log.error("Failed to launch runner: \(error.localizedDescription)") + errorCont.yield(.launchFailed(description: error.localizedDescription)) + errorCont.finish() + tokenCont.finish() + levelCont.finish() + statusCont.finish() + modelStateCont.yield(.unloaded) + modelStateCont.finish() + } + + return Streams( + tokens: tokenStream, errors: errorStream, modelState: modelStateStream, + audioLevel: levelStream, status: statusStream + ) + } + + // MARK: - Audio lifecycle (independent of process) + + func startAudioCapture() async throws { + guard let handle = stdinPipe?.fileHandleForWriting else { + throw RunnerError.launchFailed(description: "Runner stdin not available") + } + let cont = self.levelContinuation + log.info("Starting audio capture") + try await audioEngine.startCapture(writingTo: handle) { level in + cont?.yield(level) + } + } + + func stopAudioCapture() async { + await audioEngine.stopCapture() + levelContinuation?.yield(0) + log.info("Audio capture stopped (runner still alive)") + } + + // MARK: - Full shutdown + + func stop() { + Task { await audioEngine.stopCapture() } + + stdinPipe?.fileHandleForWriting.closeFile() + stdinPipe = nil + + if let proc = process, proc.isRunning { + proc.interrupt() + proc.waitUntilExit() + } + process = nil + modelState = .unloaded + clearContinuations() + } + + // MARK: - Private + + private func setModelState(_ state: ModelState) { + modelState = state + } + + private func onProcessExited() { + stdinPipe = nil + process = nil + modelState = .unloaded + clearContinuations() + } + + private func clearContinuations() { + tokenContinuation = nil + errorContinuation = nil + modelStateContinuation = nil + levelContinuation = nil + statusContinuation = nil + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Utilities/RunnerError.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Utilities/RunnerError.swift new file mode 100644 index 00000000..f5cf8ac5 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Utilities/RunnerError.swift @@ -0,0 +1,40 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import Foundation + +enum RunnerError: Error, Sendable { + case binaryNotFound(path: String) + case modelMissing(file: String) + case microphonePermissionDenied + case microphoneNotAvailable + case runnerCrashed(exitCode: Int32, stderr: String) + case transcriptionInterrupted(partial: String) + case launchFailed(description: String) +} + +extension RunnerError: LocalizedError { + var errorDescription: String? { + switch self { + case .binaryNotFound(let path): + "Runner binary not found at \(path)" + case .modelMissing(let file): + "Model file missing: \(file)" + case .microphonePermissionDenied: + "Microphone access denied. Enable it in System Settings → Privacy & Security → Microphone, then quit and relaunch the app." + case .microphoneNotAvailable: + "No audio input available. Check that your microphone is connected, enable it in System Settings → Privacy & Security → Microphone, then quit and relaunch the app." + case .runnerCrashed(let code, let stderr): + "Runner exited with code \(code): \(stderr)" + case .transcriptionInterrupted: + "Transcription was interrupted" + case .launchFailed(let desc): + "Failed to launch runner: \(desc)" + } + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/AudioLevelView.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/AudioLevelView.swift new file mode 100644 index 00000000..68b503cb --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/AudioLevelView.swift @@ -0,0 +1,54 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct AudioLevelView: View { + let level: Float + let barCount: Int + + @State private var barHeights: [CGFloat] + + init(level: Float, barCount: Int = 24) { + self.level = level + self.barCount = barCount + _barHeights = State(initialValue: Array(repeating: 0.08, count: barCount)) + } + + var body: some View { + HStack(spacing: 2) { + ForEach(0.. Color { + if height > 0.7 { return .orange } + if height > 0.15 { return .accentColor } + return .secondary.opacity(0.4) + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/ContentView.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/ContentView.swift new file mode 100644 index 00000000..18740d87 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/ContentView.swift @@ -0,0 +1,59 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct ContentView: View { + @Environment(TranscriptStore.self) private var store + @State private var columnVisibility: NavigationSplitViewVisibility = .doubleColumn + + var body: some View { + @Bindable var store = store + + NavigationSplitView(columnVisibility: $columnVisibility) { + SidebarView() + .navigationSplitViewColumnWidth(min: 180, ideal: 220, max: 320) + } detail: { + detailContent + .frame(maxWidth: .infinity, maxHeight: .infinity) + } + .navigationSplitViewStyle(.balanced) + .toolbar { RecordingControls() } + .overlay(alignment: .top) { + if store.currentError != nil { + ErrorBannerView() + .transition(.move(edge: .top).combined(with: .opacity)) + } + } + .animation(.easeInOut(duration: 0.25), value: store.currentError != nil) + .task { + await store.runHealthCheck() + } + } + + @ViewBuilder + private var detailContent: some View { + if store.healthResult?.allGood == false && !store.hasActiveSession && store.modelState == .unloaded { + SetupGuideView() + } else if store.hasActiveSession { + TranscriptView( + text: store.liveTranscript, + isLive: store.isTranscribing, + isPaused: store.isPaused, + audioLevel: store.audioLevel, + statusMessage: store.statusMessage + ) + } else if let id = store.selectedSessionID, + let session = store.sessions.first(where: { $0.id == id }) { + TranscriptView(text: session.transcript, isLive: false) + .navigationTitle(session.displayTitle) + } else { + WelcomeView() + } + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/DictationOverlayView.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/DictationOverlayView.swift new file mode 100644 index 00000000..831099e0 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/DictationOverlayView.swift @@ -0,0 +1,73 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct DictationOverlayView: View { + var store: TranscriptStore + + private var lineCount: Int { + let text = store.dictationText + guard !text.isEmpty else { return 0 } + let width: CGFloat = 268 // 300 - 32 padding + let font = NSFont.preferredFont(forTextStyle: .callout) + let attrs: [NSAttributedString.Key: Any] = [.font: font] + let size = (text as NSString).boundingRect( + with: CGSize(width: width, height: .greatestFiniteMagnitude), + options: [.usesLineFragmentOrigin, .usesFontLeading], + attributes: attrs + ) + return max(1, Int(ceil(size.height / font.pointSize))) + } + + private var textHeight: CGFloat { + let lines = lineCount + if lines <= 2 { return 40 } + return min(CGFloat(lines) * 18, 200) + } + + var body: some View { + VStack(spacing: 10) { + AudioLevelView(level: store.audioLevel, barCount: 20) + .frame(height: 36) + + if store.dictationText.isEmpty { + Text("Listening...") + .font(.callout) + .foregroundStyle(.secondary) + } else { + ScrollViewReader { proxy in + ScrollView { + Text(store.dictationText) + .font(.callout) + .frame(maxWidth: .infinity, alignment: .leading) + Color.clear.frame(height: 1).id("end") + } + .frame(height: textHeight) + .onChange(of: store.dictationText) { + withAnimation(.easeOut(duration: 0.15)) { + proxy.scrollTo("end", anchor: .bottom) + } + } + } + } + + Text("⌃Space to finish") + .font(.caption2) + .foregroundStyle(.tertiary) + } + .padding(16) + .frame(width: 300) + .animation(.easeInOut(duration: 0.2), value: textHeight) + .background(.ultraThinMaterial, in: RoundedRectangle(cornerRadius: 14)) + .overlay( + RoundedRectangle(cornerRadius: 14) + .strokeBorder(.quaternary, lineWidth: 0.5) + ) + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/DictationPanel.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/DictationPanel.swift new file mode 100644 index 00000000..6b62c45b --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/DictationPanel.swift @@ -0,0 +1,46 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import AppKit +import SwiftUI + +final class DictationPanel: NSPanel { + init(contentView: some View) { + super.init( + contentRect: NSRect(x: 0, y: 0, width: 320, height: 140), + styleMask: [.nonactivatingPanel, .fullSizeContentView, .borderless], + backing: .buffered, + defer: true + ) + + level = .floating + isOpaque = false + backgroundColor = .clear + hasShadow = true + isMovableByWindowBackground = true + collectionBehavior = [.canJoinAllSpaces, .fullScreenAuxiliary] + animationBehavior = .utilityWindow + + let hosting = NSHostingView(rootView: contentView) + hosting.frame = contentRect(forFrameRect: frame) + self.contentView = hosting + } + + func showCentered() { + guard let screen = NSScreen.main else { return } + let screenFrame = screen.visibleFrame + let x = screenFrame.midX - frame.width / 2 + let y = screenFrame.midY - frame.height / 2 + 100 + setFrameOrigin(NSPoint(x: x, y: y)) + orderFrontRegardless() + } + + func dismiss() { + orderOut(nil) + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/ErrorBannerView.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/ErrorBannerView.swift new file mode 100644 index 00000000..87100128 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/ErrorBannerView.swift @@ -0,0 +1,39 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct ErrorBannerView: View { + @Environment(TranscriptStore.self) private var store + + var body: some View { + if let error = store.currentError { + HStack(spacing: 12) { + Image(systemName: "exclamationmark.triangle.fill") + .foregroundStyle(.white) + Text(error.localizedDescription) + .font(.callout) + .foregroundStyle(.white) + .lineLimit(2) + Spacer() + Button { + store.clearError() + } label: { + Image(systemName: "xmark") + .foregroundStyle(.white.opacity(0.8)) + } + .buttonStyle(.plain) + } + .padding(.horizontal, 16) + .padding(.vertical, 10) + .background(.red.gradient, in: RoundedRectangle(cornerRadius: 8)) + .padding(.horizontal, 16) + .padding(.top, 8) + } + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/RecordingControls.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/RecordingControls.swift new file mode 100644 index 00000000..29cc878f --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/RecordingControls.swift @@ -0,0 +1,91 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct RecordingControls: ToolbarContent { + @Environment(TranscriptStore.self) private var store + + var body: some ToolbarContent { + ToolbarItem(placement: .primaryAction) { + HStack(spacing: 6) { + pauseResumeButton + if store.hasActiveSession { + endSessionButton + } + if store.isModelReady && !store.hasActiveSession { + unloadButton + } + } + } + } + + private var pauseResumeButton: some View { + Button { + Task { await store.togglePauseResume() } + } label: { + if store.isLoading { + ProgressView() + .controlSize(.small) + } else { + Label(pauseResumeLabel, systemImage: pauseResumeIcon) + .foregroundStyle(store.isTranscribing ? .orange : .primary) + } + } + .keyboardShortcut(store.isTranscribing ? "." : "R", + modifiers: store.isTranscribing ? .command : [.command, .shift]) + .disabled(store.isLoading) + .help(pauseResumeHelp) + } + + private var unloadButton: some View { + Button { + Task { await store.unloadModel() } + } label: { + Label("Unload Model", systemImage: "eject.fill") + } + .help("Unload model to free resources") + } + + private var endSessionButton: some View { + Button { + Task { await store.endSession() } + } label: { + Label("Done", systemImage: "checkmark.circle.fill") + } + .keyboardShortcut(.return, modifiers: .command) + .help("End session and save (⌘↩)") + } + + private var pauseResumeLabel: String { + switch store.sessionState { + case .idle: "Transcribe" + case .loading: "Loading..." + case .transcribing: "Pause" + case .paused: "Resume" + } + } + + private var pauseResumeIcon: String { + switch store.sessionState { + case .idle: "mic.fill" + case .loading: "hourglass" + case .transcribing: "pause.fill" + case .paused: "play.fill" + } + } + + private var pauseResumeHelp: String { + switch store.sessionState { + case .idle: "Start transcription (⌘⇧R)" + case .loading: "Loading model..." + case .transcribing: "Pause transcription (⌘.)" + case .paused: "Resume transcription (⌘⇧R)" + } + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/SettingsView.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/SettingsView.swift new file mode 100644 index 00000000..5a49cb90 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/SettingsView.swift @@ -0,0 +1,119 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct SettingsView: View { + @Environment(Preferences.self) private var preferences + + var body: some View { + @Bindable var prefs = preferences + + TabView { + Form { + Section("Runner") { + LabeledContent("Binary path") { + HStack { + TextField("Path to voxtral_realtime_runner", text: $prefs.runnerPath) + .textFieldStyle(.roundedBorder) + browseButton(for: $prefs.runnerPath) + } + } + } + + Section("Model") { + LabeledContent("Model directory") { + HStack { + TextField("Path to model artifacts", text: $prefs.modelDirectory) + .textFieldStyle(.roundedBorder) + browseButton(for: $prefs.modelDirectory, directory: true) + } + } + + LabeledContent("Files") { + VStack(alignment: .leading, spacing: 4) { + fileStatus("model-metal-int4.pte", path: preferences.modelPath) + fileStatus("tekken.json", path: preferences.tokenizerPath) + fileStatus("preprocessor.pte", path: preferences.preprocessorPath) + } + } + } + } + .formStyle(.grouped) + .padding() + .tabItem { Label("General", systemImage: "gear") } + + Form { + Section("Silence Detection") { + LabeledContent("Silence threshold") { + VStack(alignment: .trailing, spacing: 4) { + Slider(value: $prefs.silenceThreshold, in: 0.005...0.1, step: 0.005) + .frame(width: 200) + Text(String(format: "%.3f RMS", preferences.silenceThreshold)) + .font(.caption) + .foregroundStyle(.secondary) + .monospacedDigit() + } + } + + LabeledContent("Auto-stop delay") { + VStack(alignment: .trailing, spacing: 4) { + Slider(value: $prefs.silenceTimeout, in: 0.5...5.0, step: 0.5) + .frame(width: 200) + Text(String(format: "%.1fs after silence", preferences.silenceTimeout)) + .font(.caption) + .foregroundStyle(.secondary) + .monospacedDigit() + } + } + + Text("Lower threshold = more sensitive (stops on softer sounds). Higher = only stops in true silence. Adjust based on your environment.") + .font(.caption) + .foregroundStyle(.tertiary) + } + + Section("Shortcut") { + LabeledContent("Dictation hotkey") { + Text("⌃Space") + .padding(.horizontal, 8) + .padding(.vertical, 4) + .background(.quaternary, in: RoundedRectangle(cornerRadius: 4)) + } + } + } + .formStyle(.grouped) + .padding() + .tabItem { Label("Dictation", systemImage: "mic.badge.plus") } + } + .frame(width: 500, height: 320) + } + + private func browseButton(for binding: Binding, directory: Bool = false) -> some View { + Button("Browse...") { + let panel = NSOpenPanel() + panel.canChooseFiles = !directory + panel.canChooseDirectories = directory + panel.allowsMultipleSelection = false + if panel.runModal() == .OK, let url = panel.url { + binding.wrappedValue = url.path(percentEncoded: false) + } + } + .controlSize(.small) + } + + private func fileStatus(_ name: String, path: String) -> some View { + let exists = FileManager.default.fileExists(atPath: path) + return HStack(spacing: 4) { + Image(systemName: exists ? "checkmark.circle.fill" : "xmark.circle.fill") + .foregroundStyle(exists ? .green : .red) + .font(.caption) + Text(name) + .font(.caption) + } + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/SetupGuideView.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/SetupGuideView.swift new file mode 100644 index 00000000..f08bd013 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/SetupGuideView.swift @@ -0,0 +1,112 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct SetupGuideView: View { + @Environment(TranscriptStore.self) private var store + @Environment(Preferences.self) private var preferences + + private var isBundledApp: Bool { + Bundle.main.resourcePath.map { + FileManager.default.fileExists(atPath: "\($0)/voxtral_realtime_runner") + } ?? false + } + + var body: some View { + VStack(spacing: 24) { + Image(systemName: "exclamationmark.triangle") + .font(.system(size: 48)) + .foregroundStyle(.orange) + + Text("Setup Required") + .font(.title2.bold()) + + if let result = store.healthResult { + VStack(alignment: .leading, spacing: 12) { + checkRow("Runner binary", ok: result.runnerAvailable) + checkRow("Model file", ok: result.modelAvailable) + checkRow("Preprocessor", ok: result.preprocessorAvailable) + checkRow("Tokenizer", ok: result.tokenizerAvailable) + checkRow("Microphone access", ok: result.micPermission == .authorized) + } + .padding() + .background(.background.secondary, in: RoundedRectangle(cornerRadius: 8)) + + if !result.missingFiles.isEmpty { + if isBundledApp { + bundledInstructions + } else { + developerInstructions + } + } + + if result.micPermission == .notDetermined { + Button("Grant Microphone Access") { + Task { + _ = await HealthCheck.requestMicrophoneAccess() + await store.runHealthCheck() + } + } + .buttonStyle(.borderedProminent) + } else if result.micPermission == .denied { + Text("Open System Settings → Privacy & Security → Microphone to grant access.") + .font(.caption) + .foregroundStyle(.secondary) + } + } + + Button("Recheck") { + Task { await store.runHealthCheck() } + } + .buttonStyle(.bordered) + } + .padding(40) + .frame(maxWidth: 500) + } + + private func checkRow(_ label: String, ok: Bool) -> some View { + HStack { + Image(systemName: ok ? "checkmark.circle.fill" : "xmark.circle.fill") + .foregroundStyle(ok ? .green : .red) + Text(label) + Spacer() + } + } + + private var bundledInstructions: some View { + VStack(alignment: .leading, spacing: 8) { + Text("App installation may be incomplete") + .font(.headline) + + Text("Model files are missing from the app bundle. Please download a complete DMG from the project releases or rebuild the app from source.") + .font(.caption) + .foregroundStyle(.secondary) + } + } + + private var developerInstructions: some View { + VStack(alignment: .leading, spacing: 8) { + Text("Download model files:") + .font(.headline) + + Text(""" + hf download mistralai/Voxtral-Mini-4B-Realtime-2602-Executorch \\ + --local-dir ~/voxtral_realtime_quant_metal + """) + .font(.system(.caption, design: .monospaced)) + .textSelection(.enabled) + .padding(8) + .background(.background.tertiary, in: RoundedRectangle(cornerRadius: 4)) + + Text("Then configure paths in Settings (⌘,).") + .font(.caption) + .foregroundStyle(.secondary) + } + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/SidebarView.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/SidebarView.swift new file mode 100644 index 00000000..91524631 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/SidebarView.swift @@ -0,0 +1,162 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct SidebarView: View { + @Environment(TranscriptStore.self) private var store + @State private var searchText = "" + @State private var renamingSessionID: UUID? + @State private var renameText = "" + + var body: some View { + @Bindable var store = store + + List(selection: $store.selectedSessionID) { + if store.hasActiveSession { + liveRow + } + + Section("History") { + ForEach(filteredSessions) { session in + sessionRow(session) + .tag(session.id) + .contextMenu { sessionContextMenu(session) } + } + } + } + .listStyle(.sidebar) + .searchable(text: $searchText, placement: .sidebar, prompt: "Search history") + .overlay { + if store.sessions.isEmpty && !store.hasActiveSession { + ContentUnavailableView( + "No History", + systemImage: "waveform", + description: Text("Transcriptions will appear here") + ) + } + } + .sheet(item: renamingBinding) { session in + RenameSheet(title: renameText) { newTitle in + store.renameSession(session, to: newTitle) + renamingSessionID = nil + } onCancel: { + renamingSessionID = nil + } + } + } + + private var renamingBinding: Binding { + Binding( + get: { + guard let id = renamingSessionID else { return nil } + return store.sessions.first { $0.id == id } + }, + set: { _ in renamingSessionID = nil } + ) + } + + private var filteredSessions: [Session] { + if searchText.isEmpty { return store.sessions } + return store.sessions.filter { + $0.transcript.localizedCaseInsensitiveContains(searchText) || + $0.title.localizedCaseInsensitiveContains(searchText) + } + } + + private var liveRow: some View { + HStack { + if store.isPaused { + Image(systemName: "pause.fill") + .foregroundStyle(.orange) + .frame(width: 24) + } else { + AudioLevelView(level: store.audioLevel, barCount: 6) + .frame(width: 24) + } + VStack(alignment: .leading) { + Text(store.isPaused ? "Paused" : "Transcribing...") + .font(.headline) + Text(store.liveTranscript.prefix(60).description) + .font(.caption) + .foregroundStyle(.secondary) + .lineLimit(1) + } + } + .listRowBackground( + (store.isPaused ? Color.orange : Color.accentColor).opacity(0.08) + ) + } + + private func sessionRow(_ session: Session) -> some View { + VStack(alignment: .leading, spacing: 4) { + Text(session.displayTitle) + .font(.headline) + .lineLimit(1) + Text(session.transcript.prefix(80).description) + .font(.caption) + .foregroundStyle(.secondary) + .lineLimit(2) + HStack(spacing: 6) { + Text(session.date, format: .dateTime.month(.abbreviated).day().hour().minute()) + Text("·") + Text(formattedDuration(session.duration)) + } + .font(.caption2) + .foregroundStyle(.tertiary) + } + .padding(.vertical, 2) + } + + @ViewBuilder + private func sessionContextMenu(_ session: Session) -> some View { + Button("Rename...") { + renameText = session.title + renamingSessionID = session.id + } + Button("Copy Transcript") { + NSPasteboard.general.clearContents() + NSPasteboard.general.setString(session.transcript, forType: .string) + } + Divider() + Button("Delete", role: .destructive) { + store.deleteSession(session) + } + } + + private func formattedDuration(_ duration: TimeInterval) -> String { + let minutes = Int(duration) / 60 + let seconds = Int(duration) % 60 + return String(format: "%d:%02d", minutes, seconds) + } +} + +private struct RenameSheet: View { + @State var title: String + let onSave: (String) -> Void + let onCancel: () -> Void + + var body: some View { + VStack(spacing: 16) { + Text("Rename") + .font(.headline) + TextField("Title", text: $title) + .textFieldStyle(.roundedBorder) + .frame(minWidth: 250) + .onSubmit { onSave(title) } + HStack { + Button("Cancel", role: .cancel) { onCancel() } + .keyboardShortcut(.cancelAction) + Button("Save") { onSave(title) } + .keyboardShortcut(.defaultAction) + .disabled(title.trimmingCharacters(in: .whitespaces).isEmpty) + } + } + .padding(20) + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/TranscriptView.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/TranscriptView.swift new file mode 100644 index 00000000..62a3e41b --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/TranscriptView.swift @@ -0,0 +1,111 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct TranscriptView: View { + let text: String + let isLive: Bool + var isPaused: Bool = false + var audioLevel: Float = 0 + var statusMessage: String = "" + + var body: some View { + ScrollViewReader { proxy in + ScrollView { + VStack(alignment: .leading, spacing: 0) { + if text.isEmpty && isLive { + listeningPlaceholder + } else { + Text(text) + .font(.body) + .textSelection(.enabled) + .frame(maxWidth: .infinity, alignment: .leading) + .padding() + } + + Color.clear + .frame(height: 1) + .id("bottom") + } + } + .onChange(of: text) { + withAnimation(.easeOut(duration: 0.15)) { + proxy.scrollTo("bottom", anchor: .bottom) + } + } + .onAppear { + proxy.scrollTo("bottom", anchor: .bottom) + } + } + .overlay(alignment: .topTrailing) { + if !text.isEmpty { + copyButton + .padding(8) + } + } + .overlay(alignment: .bottom) { + if isLive || isPaused { + statusIndicator + .padding(.bottom, 12) + } + } + } + + private var listeningPlaceholder: some View { + VStack(spacing: 16) { + Spacer() + AudioLevelView(level: audioLevel) + Text("Listening...") + .font(.title3) + .foregroundStyle(.secondary) + if !statusMessage.isEmpty { + Text(statusMessage) + .font(.caption) + .foregroundStyle(.tertiary) + } + Spacer() + } + .frame(maxWidth: .infinity, maxHeight: .infinity) + .padding() + } + + private var copyButton: some View { + Button { + NSPasteboard.general.clearContents() + NSPasteboard.general.setString(text, forType: .string) + } label: { + Label("Copy All", systemImage: "doc.on.doc") + .font(.caption) + } + .buttonStyle(.bordered) + .controlSize(.small) + .keyboardShortcut("C", modifiers: [.command, .shift]) + } + + private var statusIndicator: some View { + HStack(spacing: 8) { + if isPaused { + Image(systemName: "pause.fill") + .font(.caption2) + .foregroundStyle(.orange) + Text("Paused") + .font(.caption) + .foregroundStyle(.secondary) + } else { + AudioLevelView(level: audioLevel, barCount: 12) + Text("Transcribing") + .font(.caption) + .foregroundStyle(.secondary) + } + } + .padding(.horizontal, 14) + .padding(.vertical, 8) + .background(.ultraThinMaterial, in: Capsule()) + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/WelcomeView.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/WelcomeView.swift new file mode 100644 index 00000000..4661049a --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/Views/WelcomeView.swift @@ -0,0 +1,116 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import SwiftUI + +struct WelcomeView: View { + @Environment(TranscriptStore.self) private var store + + var body: some View { + VStack(spacing: 24) { + Image(systemName: "waveform") + .font(.system(size: 56)) + .foregroundStyle(.secondary) + + Text("Voxtral Realtime") + .font(.title.bold()) + + Text("On-device speech transcription powered by ExecuTorch") + .font(.subheadline) + .foregroundStyle(.secondary) + + Spacer().frame(height: 8) + + modelSection + + if store.isModelReady { + Button { + Task { await store.startTranscription() } + } label: { + Label("Start Transcription", systemImage: "mic.fill") + .frame(minWidth: 180) + } + .buttonStyle(.borderedProminent) + .controlSize(.large) + .keyboardShortcut("R", modifiers: [.command, .shift]) + } + + shortcutHints + } + .padding(40) + .frame(maxWidth: 480) + } + + @ViewBuilder + private var modelSection: some View { + switch store.modelState { + case .unloaded: + Button { + Task { await store.preloadModel() } + } label: { + Label("Load Model", systemImage: "arrow.down.circle") + .frame(minWidth: 180) + } + .buttonStyle(.bordered) + .controlSize(.large) + + case .loading: + VStack(spacing: 12) { + ProgressView() + .controlSize(.regular) + Text(store.statusMessage) + .font(.callout) + .foregroundStyle(.secondary) + } + .padding() + .frame(minWidth: 220) + .background(.background.secondary, in: RoundedRectangle(cornerRadius: 10)) + + case .ready: + HStack(spacing: 8) { + Image(systemName: "checkmark.circle.fill") + .foregroundStyle(.green) + Text("Model loaded") + .foregroundStyle(.secondary) + Button { + Task { await store.unloadModel() } + } label: { + Label("Unload", systemImage: "xmark.circle") + .labelStyle(.iconOnly) + .font(.callout) + } + .buttonStyle(.plain) + .foregroundStyle(.secondary) + .help("Unload model to free resources") + } + .font(.callout) + } + } + + private var shortcutHints: some View { + HStack(spacing: 16) { + shortcutBadge("⌘⇧R", label: "Transcribe") + shortcutBadge("⌘.", label: "Pause") + shortcutBadge("⌘↩", label: "Done") + } + .padding(.top, 8) + } + + private func shortcutBadge(_ shortcut: String, label: String) -> some View { + VStack(spacing: 4) { + Text(shortcut) + .font(.caption.monospaced()) + .padding(.horizontal, 6) + .padding(.vertical, 3) + .background(.quaternary, in: RoundedRectangle(cornerRadius: 4)) + Text(label) + .font(.caption2) + .foregroundStyle(.tertiary) + } + } +} diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/VoxtralRealtime.entitlements b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/VoxtralRealtime.entitlements new file mode 100644 index 00000000..4f1a868a --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/VoxtralRealtime.entitlements @@ -0,0 +1,10 @@ + + + + + com.apple.security.cs.disable-library-validation + + com.apple.security.device.audio-input + + + diff --git a/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/VoxtralRealtimeApp.swift b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/VoxtralRealtimeApp.swift new file mode 100644 index 00000000..7c950dff --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/VoxtralRealtime/VoxtralRealtimeApp.swift @@ -0,0 +1,120 @@ +/* + * Copyright (c) Meta Platforms, Inc. and affiliates. + * All rights reserved. + * + * This source code is licensed under the BSD-style license found in the + * LICENSE file in the root directory of this source tree. + */ + +import AVFoundation +import SwiftUI + +@main +struct VoxtralRealtimeApp: App { + @State private var preferences = Preferences() + @State private var store: TranscriptStore + @State private var dictation: DictationManager + + init() { + let prefs = Preferences() + let s = TranscriptStore(preferences: prefs) + _preferences = State(initialValue: prefs) + _store = State(initialValue: s) + _dictation = State(initialValue: DictationManager(store: s, preferences: prefs)) + } + + var body: some Scene { + WindowGroup { + ContentView() + .environment(store) + .environment(preferences) + .frame(minWidth: 600, minHeight: 400) + .task { + _ = await AVCaptureDevice.requestAccess(for: .audio) + dictation.registerHotKey() + } + .onReceive(NotificationCenter.default.publisher(for: NSApplication.didBecomeActiveNotification)) { _ in + Task { await store.runHealthCheck() } + } + } + .defaultSize(width: 900, height: 600) + .windowToolbarStyle(.unified) + .commands { + CommandGroup(replacing: .newItem) {} + + CommandMenu("Transcription") { + switch store.sessionState { + case .idle: + Button("Start Transcription") { + Task { await store.startTranscription() } + } + .keyboardShortcut("R", modifiers: [.command, .shift]) + + case .loading: + Button("Loading...") {} + .disabled(true) + + case .transcribing: + Button("Pause") { + Task { await store.pauseTranscription() } + } + .keyboardShortcut(".", modifiers: .command) + + case .paused: + Button("Resume") { + Task { await store.resumeTranscription() } + } + .keyboardShortcut("R", modifiers: [.command, .shift]) + } + + if store.hasActiveSession { + Button("End Session") { + Task { await store.endSession() } + } + .keyboardShortcut(.return, modifiers: .command) + } + + Divider() + + if store.isModelReady && !store.hasActiveSession { + Button("Unload Model") { + Task { await store.unloadModel() } + } + .keyboardShortcut("U", modifiers: [.command, .shift]) + + Divider() + } + + Button("Copy Transcript") { + let text = store.hasActiveSession ? store.liveTranscript : selectedSessionTranscript + guard !text.isEmpty else { return } + NSPasteboard.general.clearContents() + NSPasteboard.general.setString(text, forType: .string) + } + .keyboardShortcut("C", modifiers: [.command, .shift]) + .disabled(currentTranscript.isEmpty) + } + + CommandMenu("Dictation") { + Button(dictation.isListening ? "Stop Dictation" : "Start Dictation") { + Task { await dictation.toggle() } + } + .disabled(!store.isModelReady && store.modelState != .unloaded) + } + } + + Settings { + SettingsView() + .environment(preferences) + } + } + + private var selectedSessionTranscript: String { + guard let id = store.selectedSessionID else { return "" } + return store.sessions.first(where: { $0.id == id })?.transcript ?? "" + } + + private var currentTranscript: String { + store.hasActiveSession ? store.liveTranscript : selectedSessionTranscript + } +} diff --git a/apps/macos/VoxtralRealtimeApp/docs/context.md b/apps/macos/VoxtralRealtimeApp/docs/context.md new file mode 100644 index 00000000..0169bd17 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/docs/context.md @@ -0,0 +1,281 @@ +# Project Context + +Living document of architectural decisions, constraints, and accumulated knowledge. Read this before starting any new task. + +--- + +## Architecture overview + +``` +VoxtralRealtimeApp (@main) +├── TranscriptStore (@Observable, @MainActor) — single source of truth +│ ├── SessionState: idle → loading → transcribing ⇆ paused → idle +│ ├── ModelState: unloaded → loading → ready +│ └── RunnerBridge (actor) — owns the runner subprocess +│ ├── Process (voxtral_realtime_runner C++ binary) +│ │ ├── stdin ← raw 16kHz mono f32le PCM bytes +│ │ ├── stdout → transcript tokens (parsed line-by-line) +│ │ └── stderr → status messages ("Loading model...", etc.) +│ └── AudioEngine (actor) +│ └── AVAudioEngine → AVAudioConverter (resample to 16kHz mono) → pipe to stdin +├── DictationManager (@Observable, @MainActor) +│ ├── Global hotkey (Carbon RegisterEventHotKey, Ctrl+Space) +│ ├── DictationPanel (NSPanel, non-activating, floating, all spaces) +│ └── Paste via CGEvent (simulated Cmd+V to frontmost app) +└── Views (SwiftUI, NavigationSplitView) + ├── Sidebar: SidebarView (session list, search, context menus) + ├── Detail: TranscriptView | WelcomeView | SetupGuideView + ├── Toolbar: RecordingControls + ├── Overlay: ErrorBannerView + └── Settings: SettingsView (2 tabs: General, Dictation) +``` + +### Data flow + +1. User taps Record → `TranscriptStore.startTranscription()` +2. Store calls `runner.launchRunner()` if not already running → spawns `voxtral_realtime_runner` process +3. `launchRunner()` returns a `Streams` struct with 5 `AsyncStream`s (tokens, errors, audioLevel, status, modelState) +4. Store spawns a `TaskGroup` with 5 concurrent consumers for each stream +5. Store calls `runner.startAudioCapture()` → `AudioEngine.startCapture(writingTo: stdinFileHandle)` +6. AVAudioEngine tap callback: converts to 16kHz mono f32le → computes RMS via vDSP → writes bytes to stdin pipe +7. Runner process reads stdin, runs inference, writes tokens to stdout +8. RunnerBridge reads stdout on `DispatchQueue.global`, yields tokens via `AsyncStream` +9. Store appends tokens to `liveTranscript` → SwiftUI auto-updates +10. Stop → audio capture stops, stdin pipe stays open, runner stays alive (model remains loaded) +11. End session → creates `Session` value, saves to JSON, resets state + +### Concurrency model + +- `AudioEngine` and `RunnerBridge` are **custom actors** isolating I/O from the main thread +- Audio writes go directly from the AVAudioEngine tap to the stdin `FileHandle` — no extra actor hop +- Cross-isolation communication uses **5 `AsyncStream`s**: tokens, errors, audioLevel, status, modelState +- All UI state is `@MainActor`-isolated `@Observable` classes +- No `Task.detached` — structured child tasks via `TaskGroup` + +### Dependency injection + +`VoxtralRealtimeApp.init()` builds the object graph: +``` +Preferences → TranscriptStore(preferences:) → DictationManager(store:, preferences:) +``` +All three are injected into the SwiftUI environment via `.environment()`. Views access them with `@Environment(TranscriptStore.self)`. + +--- + +## Decisions + +| # | Decision | Rationale | Date | Status | +|---|---|---|---|---| +| 1 | Shell out to `voxtral_realtime_runner` via `Process` instead of linking ExecuTorch C++ | Simpler integration, runner already works as CLI, avoids complex C++/Swift bridging | Pre-project | active | +| 2 | JSON file persistence instead of SwiftData | Simpler, no migration headaches, sufficient for session list. File at `~/Library/Application Support/VoxtralRealtime/sessions.json` | Pre-project | active | +| 3 | `@Observable` + `@MainActor` pattern for state (not Combine/ObservableObject) | Modern Swift concurrency, less boilerplate, requires macOS 14+ | Pre-project | active | +| 4 | Custom actors for AudioEngine and RunnerBridge (not classes with locks) | Swift actor isolation provides thread safety without manual locking | Pre-project | active | +| 5 | 5 separate AsyncStreams from RunnerBridge (not a single multiplexed stream) | Each stream has different consumer behavior; keeps token stream unpolluted by errors/status | Pre-project | active | +| 6 | Carbon API for global hotkey (not modern alternative) | No modern macOS API exists for global hotkeys; Carbon `RegisterEventHotKey` is the standard approach | Pre-project | active | +| 7 | `NSPanel` (non-activating) for dictation overlay | Must not steal focus from the target app during dictation | Pre-project | active | +| 8 | Audio pipeline writes directly to FileHandle (no intermediate buffer) | Minimizes latency; 80ms PCM chunks go straight to runner stdin | Pre-project | active | +| 9 | Model preloading — runner stays alive between sessions | Avoids 30s+ model reload on each transcription start | Pre-project | active | +| 10 | XcodeGen (`project.yml`) for project generation | Keeps project config in version-controlled YAML, avoids Xcode project merge conflicts | Pre-project | active | +| 11 | Bundled runner + libomp + models via build script | Post-compile script in `project.yml` copies runner binary, patches `install_name_tool` for libomp, copies model files into .app Resources | Pre-project | active | +| 12 | Bundle models in .app via build script, DMG ships self-contained | Non-technical users should not need to download models separately | Pre-project | active | +| 13 | Preferences backed by UserDefaults with `didSet` writes | Simple, no framework dependency, auto-persists | Pre-project | active | +| 14 | Distribute via GitHub Releases DMG (not git-bundled models) | Model files are ~6.2 GB, too large for git. Developer builds DMG with models bundled, uploads to Releases. End users download DMG. | 2026-03-05 | active | +| 15 | `create_dmg.sh` validates bundled files before creating DMG | Prevents shipping an incomplete DMG missing model weights | 2026-03-05 | active | +| 16 | `scripts/build.sh` automates full pipeline | One command: check prereqs → xcodegen → xcodebuild → create DMG. Supports `--download-models` flag. | 2026-03-05 | active | + + + +--- + +## Constraints + +- Building requires `conda activate et-metal` — a dedicated conda env with ExecuTorch installed from a fresh `git clone` to `~/executorch` (Metal/MPS backend). Runner build, model download, and CLI testing all depend on it. +- Runner binary is a pre-built C++ executable (`voxtral_realtime_runner`) from ExecuTorch +- Audio format: 16kHz mono f32le PCM piped to runner's stdin (no WAV header, raw bytes) +- Model artifacts are ~6.2 GB total — not in git; developer downloads from HF, build script bundles into .app, DMG ships self-contained +- Must use Hardened Runtime + sandbox for notarization (entitlements file currently empty) +- macOS 14+ minimum (for `@Observable`, modern SwiftUI APIs) +- Apple Silicon only (Metal backend requires M1+) +- Runner expects exactly these CLI args: `--model_path`, `--tokenizer_path`, `--preprocessor_path`, `--mic` +- Runner's stdout protocol: emits `"Listening"` once model is ready, then transcript tokens; `PyTorchObserver` stats lines must be filtered +- Runner's stderr protocol: emits status strings like "Loading model", "Loading tokenizer", "Warming up", "Warmup complete" +- System-wide dictation requires Accessibility permission (for CGEvent paste simulation) +- Global hotkey uses Carbon API — cannot be changed to a modern API (none exists) +- No third-party dependencies — pure Apple frameworks (AVFoundation, CoreAudio, Accelerate, Carbon) +- `libomp` must be installed via Homebrew and bundled in the .app for the runner to work + +--- + +## Known issues + +| Issue | Impact | Workaround | Status | +|---|---|---|---| +| Spin-wait polling for model load state | `ensureRunnerLaunched()` uses `while modelState == .loading { Task.sleep(100ms) }` — wasteful polling loop | Works but not ideal; could use `AsyncStream` signal or continuation | open | +| `DictationManager.cleanup()` uses `MainActor.assumeIsolated` in nonisolated context | Fragile pattern — crashes if called off main actor | Currently safe because only called from `deinit` which runs on main actor | open | +| Entitlements file is empty | App won't pass notarization; sandbox + audio input entitlements needed | Only affects distribution, not development builds | open | +| No error recovery for pipe write failures | AudioEngine logs first 3 write errors then suppresses; relies on runner termination handler | Runner crash eventually triggers error flow | open | +| AccentColor in assets has no custom color values | Uses system default accent color | Functional, just not branded | open | +| Accessibility permission prompt timing | `AXIsProcessTrustedWithOptions` prompt fires on startup even if user doesn't need dictation | Could defer to first dictation use | open | +| Debug builds invalidate Accessibility trust | Each Xcode rebuild changes binary signature; macOS requires re-granting Accessibility permission | Use Release build for testing dictation, or remove/re-add in System Settings | mitigated | + + + +--- + +## What works + +Things confirmed working — don't re-investigate these: + +- Full transcription pipeline: mic → AVAudioEngine → resample to 16kHz mono f32le → pipe to runner stdin → tokens from stdout → live UI +- Model preloading: runner launches once, stays alive across sessions (avoids ~30s reload) +- Pause / resume: stops audio capture but keeps runner process alive +- Session persistence: JSON save/load at `~/Library/Application Support/VoxtralRealtime/sessions.json` +- Session management: create, rename, delete, search, copy transcript +- System-wide dictation: Ctrl+Space → floating panel → auto-paste via CGEvent Cmd+V +- Silence detection in dictation mode: polls audioLevel every 250ms, auto-stops after configured timeout +- Audio level visualization: RMS computed via vDSP, animated waveform bars with color transitions +- Health check: validates runner binary, model files, mic permission on startup +- DMG creation: `scripts/create_dmg.sh` with drag-to-Applications layout +- Build script: bundles runner binary, libomp.dylib, model files into .app Resources via post-compile script +- Full build pipeline: `scripts/build.sh` validates prereqs, builds app, creates DMG in one command +- DMG validation: `create_dmg.sh` refuses to create DMG if runner/models/libomp are missing from .app bundle +- Runner stdout parsing: detects "Listening" for model ready, filters PyTorchObserver stats, strips ANSI escapes +- Runner stderr parsing: extracts status messages for UI (Loading model, Loading tokenizer, Warming up, etc.) +- Error flow: runner crash → RunnerError → ErrorBannerView with dismiss +- Settings window: runner path, model directory, silence threshold/timeout (2 tabs) +- Keyboard shortcuts: Cmd+Shift+R (start/resume), Cmd+. (pause), Cmd+Return (end), Cmd+Shift+C (copy), Cmd+Shift+U (unload), Ctrl+Space (dictation) + +--- + +## What doesn't work + +Approaches tried and abandoned — don't retry without new information: + +- (none yet — no failed approaches to record) + +--- + +## Feature status + +| # | Feature | Status | Notes | +|---|---|---|---| +| 1 | Record / Stop | done | Toggle via toolbar + shortcuts | +| 2 | Model loading status | done | ProgressView + stderr-parsed status messages | +| 3 | Live scrolling transcript | done | Auto-scroll with ScrollViewReader | +| 4 | Copy transcript | done | Button + Cmd+Shift+C | +| 5 | Keyboard shortcuts | done | 6 shortcuts registered | +| 6 | Session history sidebar | done | List with selection, live row | +| 7 | Persist sessions | done | JSON file (not SwiftData) | +| 8 | Search sessions | done | `.searchable` in sidebar | +| 9 | Delete / rename sessions | done | Context menu + rename sheet | +| 10 | Export session (.txt/.srt/.json) | not started | — | +| 11 | Menu bar | partial | Transcription + Dictation menus; missing View, Export | +| 12 | Toolbar | done | Pause/Resume, End, Unload | +| 13 | Resizable window | done | Default 900×600 | +| 14 | Fullscreen / Split View | not started | No special handling | +| 15 | Settings window | done | 2 tabs (General, Dictation) | +| 16 | Audio input device picker | not started | `Preferences.audioDeviceID` exists but no UI | +| 17 | Audio level indicator | done | Animated waveform bars | +| 18 | Silence detection | done | Dictation mode only | +| 19 | Liquid Glass style | not started | Plan says macOS 26+ glassEffect | +| 20 | Light / dark mode | implicit | SwiftUI default behavior | +| 21 | Animations | done | Scroll, waveform, error banner transitions | +| 22 | Accessibility (VoiceOver) | not started | No accessibilityLabel/Hint on controls | +| 23 | Bundled runner binary | done | Build script in project.yml | +| 24 | First-run model download | not started | Manual download only; setup guide shows instructions | +| 25 | Developer ID signing + notarization | not started | Entitlements empty | +| 26 | DMG installer | done | `scripts/create_dmg.sh` | +| 27 | Sparkle auto-update | not started | Stretch goal | +| 28 | System-wide dictation | done | Ctrl+Space, floating panel, auto-paste | +| 29 | Pause / Resume | done | Audio stops, runner stays alive | +| 30 | Model preloading | done | Load once, transcribe instantly | + +--- + +## Code patterns and conventions + +When adding new features, follow these existing patterns: + +- **State**: Add new state properties to `TranscriptStore` as `@Observable` properties. Views read directly, no bindings needed for read-only. +- **Two-way bindings in views**: Use `@Bindable var store = store` inside `body` to create bindings from `@Environment`. +- **New services**: Create as a Swift `actor` if it does I/O. Communicate with `TranscriptStore` via `AsyncStream`. +- **Views**: All views get state from `@Environment(TranscriptStore.self)` and/or `@Environment(Preferences.self)`. +- **Errors**: Add new cases to `RunnerError` enum. Set `store.currentError` to display in `ErrorBannerView`. +- **Logging**: Use `os.Logger(subsystem: "com.younghan.VoxtralRealtime", category: "YourCategory")`. +- **File references**: The planned `BundleResources.swift` was never created. Path resolution lives in `Preferences.init()` which auto-detects bundled vs. filesystem paths. +- **Keyboard shortcuts**: Register in `VoxtralRealtimeApp.swift` menu commands section. +- **Settings**: Add UI to `SettingsView.swift`, backing property to `Preferences.swift` with `UserDefaults` persistence. +- **No third-party deps**: Keep it pure Apple frameworks. If you need a dep, document it as a decision. + +--- + +## Key file paths + +### Source code + +| What | Path (relative to `apps/macos/VoxtralRealtimeApp/`) | +|---|---| +| App entry point | `VoxtralRealtime/VoxtralRealtimeApp.swift` | +| Central state machine | `VoxtralRealtime/Models/TranscriptStore.swift` | +| Session model | `VoxtralRealtime/Models/Session.swift` | +| User preferences | `VoxtralRealtime/Models/Preferences.swift` | +| Audio capture actor | `VoxtralRealtime/Services/AudioEngine.swift` | +| Runner process actor | `VoxtralRealtime/Services/RunnerBridge.swift` | +| Global dictation manager | `VoxtralRealtime/Services/DictationManager.swift` | +| Startup validation | `VoxtralRealtime/Services/HealthCheck.swift` | +| Error types | `VoxtralRealtime/Utilities/RunnerError.swift` | +| Root view (NavigationSplitView) | `VoxtralRealtime/Views/ContentView.swift` | +| Session list | `VoxtralRealtime/Views/SidebarView.swift` | +| Live transcript display | `VoxtralRealtime/Views/TranscriptView.swift` | +| Toolbar buttons | `VoxtralRealtime/Views/RecordingControls.swift` | +| Waveform visualization | `VoxtralRealtime/Views/AudioLevelView.swift` | +| Preferences window | `VoxtralRealtime/Views/SettingsView.swift` | +| First-run guide | `VoxtralRealtime/Views/SetupGuideView.swift` | +| Landing page | `VoxtralRealtime/Views/WelcomeView.swift` | +| Error banner overlay | `VoxtralRealtime/Views/ErrorBannerView.swift` | +| Dictation HUD content | `VoxtralRealtime/Views/DictationOverlayView.swift` | +| Floating NSPanel | `VoxtralRealtime/Views/DictationPanel.swift` | + +### Build configuration + +| What | Path | +|---|---| +| XcodeGen spec | `project.yml` | +| Entitlements | `VoxtralRealtime/VoxtralRealtime.entitlements` | +| Info.plist | `VoxtralRealtime/Info.plist` | +| App icon assets | `VoxtralRealtime/Resources/Assets.xcassets/` | +| DMG build script | `scripts/create_dmg.sh` | +| Full build pipeline | `scripts/build.sh` | + +### External dependencies + +| What | Path | +|---|---| +| Plan | `docs/plan.md` | +| Progress log | `docs/progress.md` | +| Runner source | `~/executorch/examples/models/voxtral_realtime` | +| Runner binary | `~/executorch/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner` | +| Model dir (HF) | `~/voxtral_realtime_quant_metal` | +| Model (Metal int4) | `~/voxtral_realtime_quant_metal/model-metal-int4.pte` | +| Preprocessor | `~/voxtral_realtime_quant_metal/preprocessor.pte` | +| Tokenizer | `~/voxtral_realtime_quant_metal/tekken.json` | +| Mic streamer (CLI test) | `~/voxtral_realtime_quant_metal/stream_audio.py` | +| Session data | `~/Library/Application Support/VoxtralRealtime/sessions.json` | + +### Runtime requirements + +| Requirement | How to install | +|---|---| +| libomp | `brew install libomp` | +| ExecuTorch (Metal) | `conda activate et-metal && cd ~/executorch && EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh` | +| Runner binary | `conda activate et-metal && cd ~/executorch && make voxtral_realtime-metal` | +| Model artifacts | `hf download mistralai/Voxtral-Mini-4B-Realtime-2602-Executorch --local-dir ~/voxtral_realtime_quant_metal` | +| XcodeGen | `brew install xcodegen` | + +--- + +## Open questions + +- [ ] Use `Process` to shell out to the runner, or link ExecuTorch C++ as a Swift package? (Currently using Process) +- [ ] Metal vs XNNPACK as default — benchmark both on M1/M2/M3 before deciding. +- [ ] Minimum macOS version: 14 (Sonoma, current) vs 15 (Sequoia) for latest SwiftUI APIs? diff --git a/apps/macos/VoxtralRealtimeApp/docs/demo.mp4 b/apps/macos/VoxtralRealtimeApp/docs/demo.mp4 new file mode 100644 index 00000000..6cf70ad9 Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/docs/demo.mp4 differ diff --git a/apps/macos/VoxtralRealtimeApp/docs/plan.md b/apps/macos/VoxtralRealtimeApp/docs/plan.md new file mode 100644 index 00000000..3dc49f49 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/docs/plan.md @@ -0,0 +1,306 @@ +# Voxtral Realtime — macOS App + +A native macOS app that runs Mistral's Voxtral-Mini-4B-Realtime on-device for real-time voice transcription. All inference runs locally via ExecuTorch — no cloud, no network required. + +## Goal + +Wrap the ExecuTorch `voxtral_realtime_runner` C++ binary in a polished SwiftUI macOS app with mic capture, live transcript display, session management, and DMG distribution. + +--- + +## Inference Runtime + +The app shells out to the pre-built runner binary. Audio is captured natively (AVFoundation / CoreAudio) and piped as 16kHz mono f32le PCM to stdin. + +```bash +./stream_audio.py | \ + voxtral_realtime_runner \ + --model_path ./model-metal-int4.pte \ + --tokenizer_path ./tekken.json \ + --preprocessor_path ./preprocessor.pte \ + --mic +``` + +Runner source: `/Users/younghan/project/executorch/examples/models/voxtral_realtime` + +### Model artifacts + +Pre-quantized 4-bit Metal checkpoint from HuggingFace — no manual export needed. + +| Artifact | Purpose | Local path | +|---|---|---| +| `model-metal-int4.pte` | 4-bit quantized streaming encoder + decoder (Metal) | `~/voxtral_realtime_quant_metal/model-metal-int4.pte` | +| `preprocessor.pte` | Audio-to-mel spectrogram | `~/voxtral_realtime_quant_metal/preprocessor.pte` | +| `tekken.json` | Tokenizer | `~/voxtral_realtime_quant_metal/tekken.json` | +| `stream_audio.py` | Mic capture script (16kHz mono f32le PCM to stdout) | `~/voxtral_realtime_quant_metal/stream_audio.py` | + +HF repo: [`mistralai/Voxtral-Mini-4B-Realtime-2602-Executorch`](https://huggingface.co/mistral-labs/Voxtral-Mini-4B-Realtime-2602-Executorch) + +#### Download models + +```bash +export LOCAL_FOLDER="$HOME/voxtral_realtime_quant_metal" +hf download mistralai/Voxtral-Mini-4B-Realtime-2602-Executorch --local-dir ${LOCAL_FOLDER} +``` + +### Build the runner + +```bash +export EXECUTORCH_PATH="$HOME/executorch" + +# Install ExecuTorch with Metal backend +cd ${EXECUTORCH_PATH} && \ + EXECUTORCH_BUILD_KERNELS_TORCHAO=1 \ + TORCHAO_BUILD_EXPERIMENTAL_MPS=1 \ + ./install_executorch.sh + +# Build the voxtral realtime runner +cd ${EXECUTORCH_PATH} && make voxtral_realtime-metal +``` + +The runner binary lands at: +``` +${EXECUTORCH_PATH}/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner +``` + +#### Runtime dependencies + +```bash +brew install libomp +export DYLD_LIBRARY_PATH=/usr/lib:$(brew --prefix libomp)/lib +pip install sounddevice # in the executorch conda env +``` + +### Run (CLI, for testing) + +```bash +export EXECUTORCH_PATH="$HOME/executorch" +export CMAKE_RUNNER="${EXECUTORCH_PATH}/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner" +export LOCAL_FOLDER="$HOME/voxtral_realtime_quant_metal" + +cd ${LOCAL_FOLDER} && chmod +x stream_audio.py && +./stream_audio.py | \ + ${CMAKE_RUNNER} \ + --model_path ./model-metal-int4.pte \ + --tokenizer_path ./tekken.json \ + --preprocessor_path ./preprocessor.pte \ + --mic +``` + +### Backend options + +| Backend | Hardware | Quantization | Notes | +|---|---|---|---| +| Metal | Apple GPU | int4 (`model-metal-int4.pte`) | Pre-quantized from HF, recommended for M-series | +| XNNPACK | CPU | `8da4w` | Requires manual export | + +Default: Metal on Apple Silicon (uses the HF pre-quantized checkpoint). + +--- + +## Architecture + +``` +┌─────────────────────────────────────────────────────┐ +│ VoxtralApp │ +│ (SwiftUI App) │ +├─────────────────────────────────────────────────────┤ +│ │ +│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Sidebar │ │ Transcript │ │ Toolbar / │ │ +│ │ Sessions │ │ DetailView │ │ StatusBar │ │ +│ └────┬─────┘ └──────┬───────┘ └──────┬───────┘ │ +│ │ │ │ │ +│ └───────────────┼─────────────────┘ │ +│ │ │ +│ ┌────────▼────────┐ │ +│ │ TranscriptStore │ @Observable │ +│ │ (shared state) │ @MainActor │ +│ └────────┬────────┘ │ +│ │ │ +│ ┌─────────────┼─────────────┐ │ +│ ▼ ▼ │ +│ ┌─────────────┐ ┌─────────────────┐ │ +│ │ AudioEngine │ │ RunnerBridge │ │ +│ │ (actor) │──PCM──▶ │ (actor) │ │ +│ │ CoreAudio │ pipe │ Process + stdin │ │ +│ └─────────────┘ │ stdout parsing │ │ +│ └─────────────────┘ │ +└─────────────────────────────────────────────────────┘ +``` + +### Key types + +| Type | Isolation | Responsibility | +|---|---|---| +| `VoxtralApp` | @MainActor | App entry, scene, menu commands | +| `TranscriptStore` | @MainActor, @Observable | Sessions list, active transcript text, recording state | +| `AudioEngine` | actor | CoreAudio capture → 16kHz mono f32le PCM stream | +| `RunnerBridge` | actor | Spawn `voxtral_realtime_runner` as `Process`, pipe PCM to stdin, parse stdout/stderr | +| `HealthCheck` | nonisolated | Startup validation: binary exists, model files present, mic permission | +| `Session` | value type (Sendable) | Immutable snapshot: id, date, transcript text, duration | +| `RunnerError` | value type (Sendable) | Error cases: binary not found, model missing, crash, permission denied | +| `Preferences` | @MainActor, @Observable | Model paths, backend selection, audio device | + +### Data flow + +1. User taps Record (or Cmd+Shift+R). +2. `TranscriptStore` tells `RunnerBridge` to start a session. +3. `RunnerBridge` spawns the runner process, obtaining a `FileHandle` to its stdin pipe. +4. `RunnerBridge` tells `AudioEngine` to start capture, passing the stdin `FileHandle` directly. +5. `AudioEngine` opens a CoreAudio input unit, writes 80ms PCM chunks directly to the stdin pipe (no intermediate actor hop). +6. `RunnerBridge` reads the runner's stdout line-by-line, parses transcript tokens. +7. New tokens flow back to `TranscriptStore` via `AsyncStream`. +8. `TranscriptStore` appends to the live transcript; SwiftUI updates the view. +9. User taps Stop (or Cmd+.). `TranscriptStore` tells `RunnerBridge` to stop. +10. `RunnerBridge` tells `AudioEngine` to stop capture, then closes stdin. Runner flushes remaining text before exiting. + +### Error flow + +1. `RunnerBridge` reads stderr asynchronously alongside stdout. +2. If the runner process exits with non-zero status or emits to stderr, `RunnerBridge` yields an error through a separate `AsyncStream`. +3. `TranscriptStore` receives the error, sets a published `currentError` property, and stops recording. +4. The UI displays an inline error banner (not a modal alert) with a retry action. + +Common failure modes: +- **Runner binary not found** — caught at startup by `HealthCheck`, surfaces a setup guide. +- **Model files missing/corrupt** — runner exits immediately with stderr message. +- **Runner crash mid-session** — `Process.terminationHandler` fires, `RunnerBridge` yields error, UI shows "transcription interrupted" with the partial transcript preserved. +- **Microphone permission denied** — `AudioEngine` checks `AVCaptureDevice.authorizationStatus` before opening; surfaces a permission prompt. + +### Concurrency design + +- `AudioEngine` and `RunnerBridge` are custom actors to isolate audio I/O and process management from the UI. +- `AudioEngine` writes PCM chunks directly to the runner's stdin `FileHandle` — avoids an extra actor hop per 80ms chunk. The `FileHandle` is safe to write from a single actor. +- Transcript tokens flow via `AsyncStream` — structured concurrency, no callbacks. +- Errors flow via a separate `AsyncStream` to avoid polluting the token stream. +- All UI state lives in `@MainActor`-isolated `@Observable` classes. +- No `Task.detached` unless profiling shows a need. Prefer structured child tasks. + +--- + +## Features + +### Core (MVP) + +1. **Record / Stop** — single toggle; mic capture starts/stops. +2. **Model loading status** — show progress while runner initializes (loading .pte files). +3. **Live scrolling transcript** — text appears token-by-token in real time, auto-scrolls to bottom. +4. **Copy transcript** — select text or one-click "Copy All" button. +5. **Keyboard shortcuts** — Cmd+Shift+R record, Cmd+. stop, Cmd+C copy selection, Cmd+Shift+C copy all. (Cmd+R avoided — conflicts with system "show ruler" in text views.) + +### Session management + +6. **Session history sidebar** — each recording saved as a session with timestamp and transcript. +7. **Persist sessions** — SwiftData (macOS 14+) for free search, sorting, and migration; JSON file fallback if targeting older OS. +8. **Search sessions** — Cmd+F to filter past transcripts by keyword. +9. **Delete / rename sessions** — right-click context menu. +10. **Export session** — save as `.txt`, `.srt` (subtitles), or `.json` with timestamps. + +### macOS integration + +11. **Menu bar** — App, Edit (Undo/Redo/Copy/Paste/Select All), View (toggle sidebar), Transcription (Record/Stop/Export), Window, Help. +12. **Toolbar** — record button (red dot when active), export, settings gear. +13. **Resizable window** — min 500x400, remembers position/size across launches. +14. **Fullscreen / Split View** support. +15. **Settings window** (Cmd+,) — model paths, audio input device picker, backend (Metal / XNNPACK), appearance. + +### Audio + +16. **Audio input device selection** — pick from available mics in Settings. +17. **Audio level indicator** — subtle waveform or level meter while recording. +18. **Silence detection** — visual indicator when no speech detected (dims the record button). + +### Polish + +19. **Liquid Glass style** — translucent window chrome, glass-effect toolbar and sidebar, matching macOS 26+ aesthetics with fallback for older OS. +20. **Light / dark mode** — respect system appearance. +21. **Animations** — smooth transcript scroll, record button pulse, fade-in for new tokens. +22. **Accessibility** — VoiceOver labels on all controls, Dynamic Type support, keyboard-navigable. + +### Distribution + +23. **Bundled runner binary** — `voxtral_realtime_runner` embedded in the .app bundle. +24. **First-run model download** — guide user to download model artifacts (or auto-download from HuggingFace with progress). +25. **Developer ID signed + notarized** — passes Gatekeeper on first launch. +26. **DMG installer** — drag-to-Applications layout. +27. **Sparkle auto-update** (stretch goal) — check for updates on launch. + +--- + +## Style + +- Liquid Glass translucent window (macOS 26+ `glassEffect`, `.ultraThinMaterial` fallback) +- Light, airy palette — white/transparent surfaces, subtle gray borders +- SF Symbols for all icons (mic.fill, stop.fill, doc.on.doc, gear) +- Monospaced or serif font option for transcript text (user preference) +- Minimal chrome — content-first, toolbar auto-hides in fullscreen + +--- + +## Entitlements & permissions + +| Entitlement | Required | Reason | +|---|---|---| +| `com.apple.security.device.audio-input` | Yes | Microphone access | +| `com.apple.security.app-sandbox` | Yes (for notarization) | App Store / Gatekeeper requirement | +| `com.apple.security.files.user-selected.read-write` | Yes | User picks model file location | +| `com.apple.security.network.client` | Optional | Only if auto-downloading models from HuggingFace | +| Hardened Runtime | Yes | Required for notarization | + +The app must include an `NSMicrophoneUsageDescription` string in Info.plist explaining why mic access is needed. + +--- + +## Startup health check + +On first launch (and on every launch before enabling Record): + +1. **Runner binary** — verify `voxtral_realtime_runner` exists in the app bundle and is executable. +2. **Model files** — verify `model.pte`, `preprocessor.pte`, and `tekken.json` exist at the configured paths. +3. **Microphone permission** — check `AVCaptureDevice.authorizationStatus(for: .audio)`. + +If any check fails, show a setup guide view instead of the main UI (not a blocking modal — the user can still browse past sessions). + +--- + +## File structure (planned) + +``` +VoxtralRealtime/ +├── VoxtralRealtimeApp.swift # @main, scenes, menu commands +├── Models/ +│ ├── Session.swift # Session value type +│ ├── TranscriptStore.swift # @Observable shared state +│ └── Preferences.swift # User settings +├── Services/ +│ ├── AudioEngine.swift # CoreAudio capture actor +│ ├── RunnerBridge.swift # Process management actor +│ └── HealthCheck.swift # Startup validation (binary, models, mic) +├── Utilities/ +│ ├── BundleResources.swift # Path resolution for bundled runner/models +│ └── RunnerError.swift # Error types for runner failures +├── Views/ +│ ├── ContentView.swift # NavigationSplitView root +│ ├── SidebarView.swift # Session list +│ ├── TranscriptView.swift # Live transcript detail +│ ├── RecordingControls.swift # Toolbar controls +│ ├── AudioLevelView.swift # Waveform / level meter +│ ├── SetupGuideView.swift # First-run / missing model guide +│ ├── ErrorBannerView.swift # Inline error display +│ └── SettingsView.swift # Preferences window +├── Resources/ +│ ├── Assets.xcassets +│ └── Runner/ # Bundled voxtral_realtime_runner +├── VoxtralRealtime.entitlements +└── Info.plist +``` + +--- + +## Open questions + +- [x] ~~Bundle model artifacts in the .app (large ~2-4GB) or download on first run?~~ → Download from HF on first run. Pre-quantized checkpoint at `mistralai/Voxtral-Mini-4B-Realtime-2602-Executorch`. +- [ ] Use `Process` to shell out to the runner, or link ExecuTorch C++ as a Swift package? +- [ ] Metal vs XNNPACK as default — benchmark both on M1/M2/M3 before deciding. +- [ ] Minimum macOS version: 14 (Sonoma) for `@Observable`, or 15 (Sequoia) for latest SwiftUI APIs? \ No newline at end of file diff --git a/apps/macos/VoxtralRealtimeApp/docs/progress.md b/apps/macos/VoxtralRealtimeApp/docs/progress.md new file mode 100644 index 00000000..9f67aed0 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/docs/progress.md @@ -0,0 +1,29 @@ +# Progress Log + +Chronological record of implementation attempts, failures, and resolutions. Newest entries at top. + +--- + + diff --git a/apps/macos/VoxtralRealtimeApp/et-voice-logo.jpeg b/apps/macos/VoxtralRealtimeApp/et-voice-logo.jpeg new file mode 100644 index 00000000..c9613062 Binary files /dev/null and b/apps/macos/VoxtralRealtimeApp/et-voice-logo.jpeg differ diff --git a/apps/macos/VoxtralRealtimeApp/project.yml b/apps/macos/VoxtralRealtimeApp/project.yml new file mode 100644 index 00000000..2586da9a --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/project.yml @@ -0,0 +1,85 @@ +name: VoxtralRealtime +options: + bundleIdPrefix: org.pytorch.executorch + deploymentTarget: + macOS: "14.0" + xcodeVersion: "16.0" + createIntermediateGroups: true + generateEmptyDirectories: true + +settings: + base: + SWIFT_VERSION: "5.10" + MACOSX_DEPLOYMENT_TARGET: "14.0" + ENABLE_HARDENED_RUNTIME: YES + +targets: + VoxtralRealtime: + type: application + platform: macOS + sources: + - path: VoxtralRealtime + excludes: + - "Resources/**" + - path: VoxtralRealtime/Resources/Assets.xcassets + settings: + base: + PRODUCT_BUNDLE_IDENTIFIER: org.pytorch.executorch.VoxtralRealtime + PRODUCT_NAME: Voxtral Realtime + INFOPLIST_FILE: VoxtralRealtime/Info.plist + CODE_SIGN_ENTITLEMENTS: VoxtralRealtime/VoxtralRealtime.entitlements + GENERATE_INFOPLIST_FILE: YES + INFOPLIST_KEY_NSMicrophoneUsageDescription: "Voxtral Realtime needs microphone access to capture audio for on-device speech transcription." + COMBINE_HIDPI_IMAGES: YES + ASSETCATALOG_COMPILER_APPICON_NAME: AppIcon + ASSETCATALOG_COMPILER_GLOBAL_ACCENT_COLOR_NAME: AccentColor + entitlements: + path: VoxtralRealtime/VoxtralRealtime.entitlements + properties: + com.apple.security.cs.disable-library-validation: true + com.apple.security.device.audio-input: true + postCompileScripts: + - script: | + set -euo pipefail + + ET_PATH="${EXECUTORCH_PATH:-${HOME}/executorch}" + RUNNER_SRC="${ET_PATH}/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner" + MODEL_DIR="${MODEL_DIR:-${HOME}/voxtral_realtime_quant_metal}" + LIBOMP_SRC="/opt/homebrew/opt/libomp/lib/libomp.dylib" + DEST="${BUILT_PRODUCTS_DIR}/${CONTENTS_FOLDER_PATH}/Resources" + + mkdir -p "${DEST}" + + copy_if_newer() { + local src="$1" dst="$2" + if [ ! -f "${src}" ]; then + echo "warning: Not found: ${src}" + return + fi + if [ ! -f "${dst}" ] || [ "${src}" -nt "${dst}" ]; then + cp -fL "${src}" "${dst}" + echo "✓ Bundled $(basename "${dst}")" + else + echo "· $(basename "${dst}") up to date" + fi + } + + # Runner binary + copy_if_newer "${RUNNER_SRC}" "${DEST}/voxtral_realtime_runner" + chmod +x "${DEST}/voxtral_realtime_runner" 2>/dev/null || true + + # libomp (runner dependency) + copy_if_newer "${LIBOMP_SRC}" "${DEST}/libomp.dylib" + + # Patch runner to find libomp via @executable_path (SIP strips DYLD_LIBRARY_PATH) + if [ -f "${DEST}/voxtral_realtime_runner" ] && [ -f "${DEST}/libomp.dylib" ]; then + install_name_tool -change /opt/llvm-openmp/lib/libomp.dylib @executable_path/libomp.dylib "${DEST}/voxtral_realtime_runner" 2>/dev/null || true + echo "✓ Patched runner rpath for libomp" + fi + + # Model artifacts + copy_if_newer "${MODEL_DIR}/model-metal-int4.pte" "${DEST}/model-metal-int4.pte" + copy_if_newer "${MODEL_DIR}/preprocessor.pte" "${DEST}/preprocessor.pte" + copy_if_newer "${MODEL_DIR}/tekken.json" "${DEST}/tekken.json" + name: Bundle Runner & Model Artifacts + basedOnDependencyAnalysis: false diff --git a/apps/macos/VoxtralRealtimeApp/scripts/build.sh b/apps/macos/VoxtralRealtimeApp/scripts/build.sh new file mode 100755 index 00000000..d8474738 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/scripts/build.sh @@ -0,0 +1,299 @@ +#!/usr/bin/env bash +# +# Copyright (c) Meta Platforms, Inc. and affiliates. +# All rights reserved. +# +# This source code is licensed under the BSD-style license found in the +# LICENSE file in the root directory of this source tree. +# +# Build a self-contained Voxtral Realtime DMG with all model files bundled. +# +# Prerequisites: +# - Create and activate the et-metal conda env with ExecuTorch + Metal backend +# - Build the voxtral_realtime_runner binary +# - Download model files from HuggingFace (or pass --download-models) +# +# Usage: +# conda activate et-metal +# ./scripts/build.sh # uses default paths +# ./scripts/build.sh --download-models # also downloads models from HuggingFace +# +# Environment variables (override defaults): +# EXECUTORCH_PATH path to executorch repo (default: ~/executorch) +# MODEL_DIR path to model artifacts (default: ~/voxtral_realtime_quant_metal) +# +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +PROJECT_DIR="$(cd "${SCRIPT_DIR}/.." && pwd)" + +export EXECUTORCH_PATH="${EXECUTORCH_PATH:-${HOME}/executorch}" +export MODEL_DIR="${MODEL_DIR:-${HOME}/voxtral_realtime_quant_metal}" +RUNNER_PATH="${EXECUTORCH_PATH}/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner" +LIBOMP_PATH="/opt/homebrew/opt/libomp/lib/libomp.dylib" +EXPECTED_CONDA_ENV="et-metal" + +BUILD_DIR="${PROJECT_DIR}/build" +SCHEME="VoxtralRealtime" +CONFIG="Release" +APP_NAME="Voxtral Realtime" +DMG_OUTPUT="${PROJECT_DIR}/VoxtralRealtime.dmg" + +DOWNLOAD_MODELS=false +for arg in "$@"; do + case "${arg}" in + --download-models) DOWNLOAD_MODELS=true ;; + -h|--help) + echo "Usage: ./scripts/build.sh [--download-models]" + echo "" + echo "Builds a self-contained DMG with all model files bundled." + echo "" + echo "Before running this script, set up the et-metal conda environment:" + echo "" + echo " # One-time setup" + echo " conda create -n et-metal python=3.12 -y" + echo " conda activate et-metal" + echo " git clone https://github.com/pytorch/executorch/ ~/executorch" + echo " cd ~/executorch" + echo " EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh" + echo " make voxtral_realtime-metal" + echo " brew install libomp xcodegen" + echo " pip install huggingface_hub" + echo "" + echo " # Then build" + echo " conda activate et-metal" + echo " ./scripts/build.sh --download-models" + echo "" + echo "Options:" + echo " --download-models Download model artifacts from HuggingFace before building" + echo " -h, --help Show this help message" + echo "" + echo "Environment variables:" + echo " EXECUTORCH_PATH Path to executorch repo (default: ~/executorch)" + echo " MODEL_DIR Path to model artifacts (default: ~/voxtral_realtime_quant_metal)" + exit 0 + ;; + *) echo "Unknown argument: ${arg}. Use --help for usage." >&2; exit 1 ;; + esac +done + +# --------------------------------------------------------------------------- +echo "" +echo "=== Voxtral Realtime — Build Pipeline ===" +echo "" + +# --------------------------------------------------------------------------- +# Step 0: Check conda environment +# --------------------------------------------------------------------------- +echo "--- Step 0: Checking conda environment ---" + +if [[ -z "${CONDA_DEFAULT_ENV:-}" || "${CONDA_DEFAULT_ENV}" == "base" ]]; then + echo "" + echo "ERROR: The et-metal conda environment is not active." >&2 + echo "" >&2 + echo "This build requires a dedicated conda env with ExecuTorch installed" >&2 + echo "(Metal/MPS backend). All steps below must run inside this env." >&2 + echo "" >&2 + + if [[ -z "${CONDA_DEFAULT_ENV:-}" ]]; then + echo "No conda environment is active." >&2 + else + echo "You are in the 'base' env — ExecuTorch should not be installed in base." >&2 + fi + + echo "" >&2 + echo "=== One-time setup ===" >&2 + echo "" >&2 + echo " # 1. Create the et-metal conda environment" >&2 + echo " conda create -n et-metal python=3.10 -y" >&2 + echo " conda activate et-metal" >&2 + echo "" >&2 + echo " # 2. Clone and install ExecuTorch with Metal (MPS) backend" >&2 + echo " git clone https://github.com/pytorch/executorch/ ~/executorch" >&2 + echo " cd ~/executorch" >&2 + echo " EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh" >&2 + echo "" >&2 + echo " # 3. Build the voxtral realtime runner" >&2 + echo " make voxtral_realtime-metal" >&2 + echo "" >&2 + echo " # 4. Install tools" >&2 + echo " brew install libomp xcodegen" >&2 + echo " pip install huggingface_hub" >&2 + echo "" >&2 + echo "=== Then build ===" >&2 + echo "" >&2 + echo " conda activate et-metal" >&2 + echo " cd $(pwd)" >&2 + echo " ./scripts/build.sh --download-models" >&2 + exit 1 +fi + +if [[ "${CONDA_DEFAULT_ENV}" != "${EXPECTED_CONDA_ENV}" ]]; then + echo "WARNING: Active conda env is '${CONDA_DEFAULT_ENV}', expected '${EXPECTED_CONDA_ENV}'." >&2 + echo " Continuing, but make sure ExecuTorch with Metal backend is installed in this env." >&2 + echo "" +fi + +echo "✓ Conda environment active: ${CONDA_DEFAULT_ENV}" +echo "" + +# --------------------------------------------------------------------------- +# Step 1: Check prerequisites +# --------------------------------------------------------------------------- +echo "--- Step 1: Checking prerequisites ---" + +ERRORS=() + +# Build tools +if ! command -v xcodegen &>/dev/null; then + ERRORS+=("xcodegen not found — install with: brew install xcodegen") +fi + +if ! command -v xcodebuild &>/dev/null; then + ERRORS+=("xcodebuild not found — install Xcode from the App Store") +fi + +# ExecuTorch repo +if [[ ! -d "${EXECUTORCH_PATH}" ]]; then + ERRORS+=("ExecuTorch repo not found at: ${EXECUTORCH_PATH}") + ERRORS+=(" Clone it: git clone https://github.com/pytorch/executorch/ ${EXECUTORCH_PATH}") + ERRORS+=(" Or set: EXECUTORCH_PATH=/your/path ./scripts/build.sh") +fi + +# Runner binary (built from ExecuTorch with Metal backend) +if [[ ! -f "${RUNNER_PATH}" ]]; then + ERRORS+=("Runner binary not found at: ${RUNNER_PATH}") + ERRORS+=(" Build it (inside conda env):") + ERRORS+=(" conda activate et-metal") + ERRORS+=(" cd ${EXECUTORCH_PATH}") + ERRORS+=(" EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh") + ERRORS+=(" make voxtral_realtime-metal") +fi + +# libomp (runner runtime dependency) +if [[ ! -f "${LIBOMP_PATH}" ]]; then + ERRORS+=("libomp not found at: ${LIBOMP_PATH}") + ERRORS+=(" Install: brew install libomp") +fi + +# Download models if requested +if [[ "${DOWNLOAD_MODELS}" == true ]]; then + echo "Downloading models from HuggingFace (~6.2 GB)..." + if ! command -v hf &>/dev/null && ! command -v huggingface-cli &>/dev/null; then + ERRORS+=("Neither 'hf' nor 'huggingface-cli' found — install with: pip install huggingface_hub") + else + HF_CMD="hf" + if ! command -v hf &>/dev/null; then + HF_CMD="huggingface-cli" + fi + ${HF_CMD} download mistralai/Voxtral-Mini-4B-Realtime-2602-Executorch --local-dir "${MODEL_DIR}" + echo "✓ Models downloaded to ${MODEL_DIR}" + fi +fi + +# Model files +MODEL_FILES=("model-metal-int4.pte" "preprocessor.pte" "tekken.json") +MISSING_MODELS=() +for f in "${MODEL_FILES[@]}"; do + if [[ ! -f "${MODEL_DIR}/${f}" ]]; then + MISSING_MODELS+=("${f}") + fi +done + +if [[ ${#MISSING_MODELS[@]} -gt 0 ]]; then + ERRORS+=("Model files missing from ${MODEL_DIR}:") + for f in "${MISSING_MODELS[@]}"; do + ERRORS+=(" - ${f}") + done + ERRORS+=(" Download: hf download mistralai/Voxtral-Mini-4B-Realtime-2602-Executorch --local-dir ${MODEL_DIR}") + ERRORS+=(" Or run: ./scripts/build.sh --download-models") +fi + +# Report errors +if [[ ${#ERRORS[@]} -gt 0 ]]; then + echo "" + echo "ERROR: Missing prerequisites:" >&2 + for e in "${ERRORS[@]}"; do + echo " ✗ ${e}" >&2 + done + exit 1 +fi + +echo "✓ xcodegen: $(which xcodegen)" +echo "✓ xcodebuild: $(which xcodebuild)" +echo "✓ ExecuTorch: ${EXECUTORCH_PATH}" +echo "✓ Runner binary: ${RUNNER_PATH}" +echo "✓ libomp: ${LIBOMP_PATH}" +echo "✓ Model files: ${MODEL_DIR}" +echo "" + +# --------------------------------------------------------------------------- +# Step 2: Generate Xcode project +# --------------------------------------------------------------------------- +echo "--- Step 2: Generating Xcode project ---" +cd "${PROJECT_DIR}" +xcodegen generate +echo "✓ VoxtralRealtime.xcodeproj generated" +echo "" + +# --------------------------------------------------------------------------- +# Step 3: Build the app (Release) +# --------------------------------------------------------------------------- +echo "--- Step 3: Building ${SCHEME} (${CONFIG}) ---" + +BUILD_LOG="${BUILD_DIR}/build.log" +mkdir -p "${BUILD_DIR}" + +set +e +xcodebuild \ + -project VoxtralRealtime.xcodeproj \ + -scheme "${SCHEME}" \ + -configuration "${CONFIG}" \ + -derivedDataPath "${BUILD_DIR}" \ + build \ + > "${BUILD_LOG}" 2>&1 +BUILD_EXIT=$? +set -e + +if [[ ${BUILD_EXIT} -ne 0 ]]; then + echo "" + echo "ERROR: xcodebuild failed (exit code ${BUILD_EXIT})." >&2 + echo "Last 20 lines of build log:" >&2 + echo "" >&2 + tail -20 "${BUILD_LOG}" >&2 + echo "" >&2 + echo "Full log: ${BUILD_LOG}" >&2 + exit 1 +fi + +tail -3 "${BUILD_LOG}" + +APP_PATH="${BUILD_DIR}/Build/Products/${CONFIG}/${APP_NAME}.app" +if [[ ! -d "${APP_PATH}" ]]; then + echo "ERROR: Build succeeded but app not found at: ${APP_PATH}" >&2 + echo "Full log: ${BUILD_LOG}" >&2 + exit 1 +fi +echo "✓ App built: ${APP_PATH}" +echo "" + +# --------------------------------------------------------------------------- +# Step 4: Create DMG +# --------------------------------------------------------------------------- +echo "--- Step 4: Creating DMG ---" +rm -f "${DMG_OUTPUT}" +"${SCRIPT_DIR}/create_dmg.sh" "${APP_PATH}" "${DMG_OUTPUT}" +echo "" + +# --------------------------------------------------------------------------- +# Summary +# --------------------------------------------------------------------------- +echo "=== Build Complete ===" +echo "" +echo " DMG: ${DMG_OUTPUT}" +echo " App: ${APP_PATH}" +echo " Conda env: ${CONDA_DEFAULT_ENV}" +echo " Build log: ${BUILD_LOG}" +echo "" +echo "To distribute: upload the DMG to GitHub Releases." +echo "" diff --git a/apps/macos/VoxtralRealtimeApp/scripts/create_dmg.sh b/apps/macos/VoxtralRealtimeApp/scripts/create_dmg.sh new file mode 100755 index 00000000..1c6e12f6 --- /dev/null +++ b/apps/macos/VoxtralRealtimeApp/scripts/create_dmg.sh @@ -0,0 +1,100 @@ +#!/usr/bin/env bash +# +# Copyright (c) Meta Platforms, Inc. and affiliates. +# All rights reserved. +# +# This source code is licensed under the BSD-style license found in the +# LICENSE file in the root directory of this source tree. +# +set -euo pipefail + +APP_PATH="${1:-}" +DMG_PATH="${2:-}" +VOLUME_NAME="${3:-Voxtral Realtime}" + +if [[ -z "${APP_PATH}" || -z "${DMG_PATH}" ]]; then + echo "Usage: $(basename "$0") /path/to/App.app /path/to/output.dmg [Volume Name]" >&2 + exit 1 +fi + +if [[ ! -d "${APP_PATH}" ]]; then + echo "Error: App not found: ${APP_PATH}" >&2 + exit 1 +fi + +# --------------------------------------------------------------------------- +# Validate bundled resources — refuse to create a DMG with missing files +# --------------------------------------------------------------------------- +RESOURCES="${APP_PATH}/Contents/Resources" +REQUIRED_FILES=( + "voxtral_realtime_runner" + "libomp.dylib" + "model-metal-int4.pte" + "preprocessor.pte" + "tekken.json" +) + +MISSING=() +for f in "${REQUIRED_FILES[@]}"; do + if [[ ! -f "${RESOURCES}/${f}" ]]; then + MISSING+=("${f}") + fi +done + +if [[ ${#MISSING[@]} -gt 0 ]]; then + echo "Error: The following required files are missing from ${RESOURCES}:" >&2 + for f in "${MISSING[@]}"; do + echo " - ${f}" >&2 + done + echo "" >&2 + echo "The DMG must be self-contained. Make sure you:" >&2 + echo " 1. Downloaded model files: hf download mistralai/Voxtral-Mini-4B-Realtime-2602-Executorch --local-dir ~/voxtral_realtime_quant_metal" >&2 + echo " 2. Built the runner: conda activate et-metal && cd ~/executorch && make voxtral_realtime-metal" >&2 + echo " 3. Installed libomp: brew install libomp" >&2 + echo " 4. Built in Release mode: xcodebuild -scheme VoxtralRealtime -configuration Release" >&2 + exit 1 +fi + +echo "✓ All required files present in app bundle" + +# --------------------------------------------------------------------------- +# Create DMG with drag-to-Applications layout +# --------------------------------------------------------------------------- +APP_NAME="$(basename "${APP_PATH}")" +WORK_DIR="$(mktemp -d)" +STAGING_DIR="${WORK_DIR}/staging" +DMG_RW="${WORK_DIR}/tmp.dmg" + +mkdir -p "${STAGING_DIR}" +cp -R "${APP_PATH}" "${STAGING_DIR}/" +ln -s /Applications "${STAGING_DIR}/Applications" + +hdiutil create -volname "${VOLUME_NAME}" -srcfolder "${STAGING_DIR}" -ov -format UDRW "${DMG_RW}" >/dev/null + +DEVICE="$(hdiutil attach -readwrite -noverify -noautoopen "${DMG_RW}" | awk 'NR==1{print $1}')" + +osascript </dev/null && echo "✓ DMG window layout configured" || echo "· Skipped DMG window layout (Finder not available in this context)" +tell application "Finder" + tell disk "${VOLUME_NAME}" + open + set current view of container window to icon view + set toolbar visible of container window to false + set statusbar visible of container window to false + set the bounds of container window to {100, 100, 700, 420} + set icon size of icon view options of container window to 128 + set arrangement of icon view options of container window to not arranged + set position of item "${APP_NAME}" of container window to {150, 200} + set position of item "Applications" of container window to {500, 200} + update without registering applications + delay 1 + close + end tell +end tell +EOF + +hdiutil detach "${DEVICE}" >/dev/null 2>&1 || true +hdiutil convert "${DMG_RW}" -format UDZO -o "${DMG_PATH}" >/dev/null +rm -rf "${WORK_DIR}" + +DMG_SIZE=$(du -sh "${DMG_PATH}" | cut -f1) +echo "✓ Created ${DMG_PATH} (${DMG_SIZE})"