AI speech models for Apple Silicon, powered by MLX Swift and CoreML.
📖 Read in: English · 中文 · 日本語 · 한국어 · Español · Deutsch · Français · हिन्दी · Português · Русский
On-device speech recognition, synthesis, and understanding for Mac and iOS. Runs locally on Apple Silicon — no cloud, no API keys, no data leaves your device.
📚 Full Documentation → · 🤗 HuggingFace Models · 📝 Blog
- Qwen3-ASR — Speech-to-text (automatic speech recognition, 52 languages, MLX + CoreML)
- Parakeet TDT — Speech-to-text via CoreML (Neural Engine, NVIDIA FastConformer + TDT decoder, 25 languages)
- Omnilingual ASR — Speech-to-text (Meta wav2vec2 + CTC, 1,672 languages across 32 scripts, CoreML 300M + MLX 300M/1B/3B/7B)
- Streaming Dictation — Real-time dictation with partials and end-of-utterance detection (Parakeet-EOU-120M)
- Nemotron Streaming — Low-latency streaming ASR with native punctuation and capitalization (NVIDIA Nemotron-Speech-Streaming-0.6B, CoreML, English)
- Qwen3-ForcedAligner — Word-level timestamp alignment (audio + text → timestamps)
- Qwen3-TTS — Text-to-speech (highest quality, streaming, custom speakers, 10 languages)
- CosyVoice TTS — Streaming TTS with voice cloning, multi-speaker dialogue, emotion tags (9 languages)
- Kokoro TTS — On-device TTS (82M, CoreML/Neural Engine, 54 voices, iOS-ready, 10 languages)
- Qwen3.5-Chat — On-device LLM chat (0.8B, MLX INT4 + CoreML INT8, DeltaNet hybrid, streaming tokens)
- PersonaPlex — Full-duplex speech-to-speech (7B, audio in → audio out, 18 voice presets)
- DeepFilterNet3 — Real-time noise suppression (2.1M params, 48 kHz)
- Source Separation — Music source separation via Open-Unmix (UMX-HQ / UMX-L, 4 stems: vocals/drums/bass/other, 44.1 kHz stereo)
- Wake-word — On-device keyword spotting (KWS Zipformer 3M, CoreML, 26× real-time, configurable keyword list)
- VAD — Voice activity detection (Silero streaming, Pyannote offline, FireRedVAD 100+ languages)
- Speaker Diarization — Who spoke when (Pyannote pipeline, Sortformer end-to-end on Neural Engine)
- Speaker Embeddings — WeSpeaker ResNet34 (256-dim), CAM++ (192-dim)
Papers: Qwen3-ASR (Alibaba) · Qwen3-TTS (Alibaba) · Omnilingual ASR (Meta) · Parakeet TDT (NVIDIA) · CosyVoice 3 (Alibaba) · Kokoro (StyleTTS 2) · PersonaPlex (NVIDIA) · Mimi (Kyutai) · Sortformer (NVIDIA)
- 19 Apr 2026 — MLX vs CoreML on Apple Silicon — A Practical Guide to Picking the Right Backend
- 20 Mar 2026 — We Beat Whisper Large v3 with a 600M Model Running Entirely on Your Mac
- 26 Feb 2026 — Speaker Diarization and Voice Activity Detection on Apple Silicon — Native Swift with MLX
- 23 Feb 2026 — NVIDIA PersonaPlex 7B on Apple Silicon — Full-Duplex Speech-to-Speech in Native Swift with MLX
- 12 Feb 2026 — Qwen3-ASR Swift: On-Device ASR + TTS for Apple Silicon — Architecture and Benchmarks
Add the package to your Package.swift:
.package(url: "https://github.com/soniqo/speech-swift", branch: "main")Import only the modules you need — every model is its own SPM library, so you don't pay for what you don't use:
.product(name: "ParakeetStreamingASR", package: "speech-swift"),
.product(name: "SpeechUI", package: "speech-swift"), // optional SwiftUI viewsTranscribe an audio buffer in 3 lines:
import ParakeetStreamingASR
let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)Live streaming with partials:
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
print(partial.isFinal ? "FINAL: \(partial.text)" : "... \(partial.text)")
}SwiftUI dictation view in ~10 lines:
import SwiftUI
import ParakeetStreamingASR
import SpeechUI
@MainActor
struct DictateView: View {
@State private var store = TranscriptionStore()
var body: some View {
TranscriptionView(finals: store.finalLines, currentPartial: store.currentPartial)
.task {
let model = try? await ParakeetStreamingASRModel.fromPretrained()
guard let model else { return }
for await p in model.transcribeStream(audio: samples, sampleRate: 16000) {
store.apply(text: p.text, isFinal: p.isFinal)
}
}
}
}SpeechUI ships only TranscriptionView (finals + partials) and TranscriptionStore (streaming ASR adapter). Use AVFoundation for audio visualization and playback.
Available SPM products: Qwen3ASR, Qwen3TTS, Qwen3TTSCoreML, ParakeetASR, ParakeetStreamingASR, NemotronStreamingASR, OmnilingualASR, KokoroTTS, CosyVoiceTTS, PersonaPlex, SpeechVAD, SpeechEnhancement, SourceSeparation, Qwen3Chat, SpeechCore, SpeechUI, AudioCommon.
Compact view below. Full model catalogue with sizes, quantisations, download URLs, and memory tables → soniqo.audio/architecture.
| Model | Task | Backends | Sizes | Languages |
|---|---|---|---|---|
| Qwen3-ASR | Speech → Text | MLX, CoreML (hybrid) | 0.6B, 1.7B | 52 |
| Parakeet TDT | Speech → Text | CoreML (ANE) | 0.6B | 25 European |
| Parakeet EOU | Speech → Text (streaming) | CoreML (ANE) | 120M | 25 European |
| Nemotron Streaming | Speech → Text (streaming, punctuated) | CoreML (ANE) | 0.6B | EN |
| Omnilingual ASR | Speech → Text | CoreML (ANE), MLX | 300M / 1B / 3B / 7B | 1,672 |
| Qwen3-ForcedAligner | Audio + Text → Timestamps | MLX, CoreML | 0.6B | Multi |
| Qwen3-TTS | Text → Speech | MLX, CoreML | 0.6B, 1.7B | 10 |
| CosyVoice3 | Text → Speech | MLX | 0.5B | 9 |
| Kokoro-82M | Text → Speech | CoreML (ANE) | 82M | 10 |
| Qwen3.5-Chat | Text → Text (LLM) | MLX, CoreML | 0.8B | Multi |
| PersonaPlex | Speech → Speech | MLX | 7B | EN |
| Silero VAD | Voice Activity Detection | MLX, CoreML | 309K | Agnostic |
| Pyannote | VAD + Diarization | MLX | 1.5M | Agnostic |
| Sortformer | Diarization (E2E) | CoreML (ANE) | — | Agnostic |
| DeepFilterNet3 | Speech Enhancement | CoreML | 2.1M | Agnostic |
| Open-Unmix | Source Separation | MLX | 8.6M | Agnostic |
| WeSpeaker | Speaker Embedding | MLX, CoreML | 6.6M | Agnostic |
Requires native ARM Homebrew (/opt/homebrew). Rosetta/x86_64 Homebrew is not supported.
brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speechThen:
audio transcribe recording.wav
audio speak "Hello world"
audio respond --input question.wav --transcript
audio-server --port 8080 # local HTTP / WebSocket server (OpenAI-compatible /v1/realtime)dependencies: [
.package(url: "https://github.com/soniqo/speech-swift", branch: "main")
]Import only what you need — every model is its own SPM target:
import Qwen3ASR // Speech recognition (MLX)
import ParakeetASR // Speech recognition (CoreML, batch)
import ParakeetStreamingASR // Streaming dictation with partials + EOU
import NemotronStreamingASR // English streaming ASR with native punctuation (0.6B)
import OmnilingualASR // 1,672 languages (CoreML + MLX)
import Qwen3TTS // Text-to-speech
import CosyVoiceTTS // Text-to-speech with voice cloning
import KokoroTTS // Text-to-speech (iOS-ready)
import Qwen3Chat // On-device LLM chat
import PersonaPlex // Full-duplex speech-to-speech
import SpeechVAD // VAD + speaker diarization + embeddings
import SpeechEnhancement // Noise suppression
import SourceSeparation // Music source separation (Open-Unmix, 4 stems)
import SpeechUI // SwiftUI components for streaming transcripts
import AudioCommon // Shared protocols and utilities- Swift 6+, Xcode 16+ (with Metal Toolchain)
- macOS 15+ (Sequoia) or iOS 18+, Apple Silicon (M1/M2/M3/M4)
The macOS 15 / iOS 18 minimum comes from MLState — Apple's persistent ANE state API used by the CoreML pipelines (Qwen3-ASR, Qwen3-Chat, Qwen3-TTS) to keep KV caches resident on the Neural Engine across token steps.
git clone https://github.com/soniqo/speech-swift
cd speech-swift
make buildmake build compiles the Swift package and the MLX Metal shader library. The Metal library is required for GPU inference — without it you'll see Failed to load the default metallib at runtime. make debug for debug builds, make test for the test suite.
Full build and install guide →
- DictateDemo (docs) — macOS menu-bar streaming dictation with live partials, VAD-driven end-of-utterance detection, and one-click copy. Runs as a background agent (Parakeet-EOU-120M + Silero VAD).
- iOSEchoDemo — iOS echo demo (Parakeet ASR + Kokoro TTS). Device and simulator.
- PersonaPlexDemo — Conversational voice assistant with mic input, VAD, and multi-turn context. macOS. RTF ~0.94 on M2 Max (faster than real-time).
- SpeechDemo — Dictation and TTS synthesis in a tabbed interface. macOS.
Each demo's README has build instructions.
The snippets below show the minimal path for each domain. Every section links to a full guide on soniqo.audio with configuration options, multiple backends, streaming patterns, and CLI recipes.
Speech-to-Text — full guide →
import Qwen3ASR
let model = try await Qwen3ASRModel.fromPretrained()
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)Alternative backends: Parakeet TDT (CoreML, 32× realtime), Omnilingual ASR (1,672 languages, CoreML or MLX), Streaming dictation (live partials).
Forced Alignment — full guide →
import Qwen3ASR
let aligner = try await Qwen3ForcedAligner.fromPretrained()
let aligned = aligner.align(
audio: audioSamples,
text: "Can you guarantee that the replacement part will be shipped tomorrow?",
sampleRate: 24000
)
for word in aligned {
print("[\(word.startTime)s - \(word.endTime)s] \(word.text)")
}Text-to-Speech — full guide →
import Qwen3TTS
import AudioCommon
let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello world", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)Alternative TTS engines: CosyVoice3 (streaming + voice cloning + emotion tags), Kokoro-82M (iOS-ready, 54 voices), Voice cloning.
Speech-to-Speech — full guide →
import PersonaPlex
let model = try await PersonaPlexModel.fromPretrained()
let responseAudio = model.respond(userAudio: userSamples)
// 24 kHz mono Float32 output ready for playbackLLM Chat — full guide →
import Qwen3Chat
let chat = try await Qwen35MLXChat.fromPretrained()
chat.chat(messages: [(.user, "Explain MLX in one sentence")]) { token, isFinal in
print(token, terminator: "")
}Voice Activity Detection — full guide →
import SpeechVAD
let vad = try await SileroVADModel.fromPretrained()
let segments = vad.detectSpeech(audio: samples, sampleRate: 16000)
for s in segments { print("\(s.startTime)s → \(s.endTime)s") }Speaker Diarization — full guide →
import SpeechVAD
let diarizer = try await DiarizationPipeline.fromPretrained()
let segments = diarizer.diarize(audio: samples, sampleRate: 16000)
for s in segments { print("Speaker \(s.speakerId): \(s.startTime)s - \(s.endTime)s") }Speech Enhancement — full guide →
import SpeechEnhancement
let denoiser = try await DeepFilterNet3Model.fromPretrained()
let clean = try denoiser.enhance(audio: noisySamples, sampleRate: 48000)Voice Pipeline (ASR → LLM → TTS) — full guide →
import SpeechCore
let pipeline = VoicePipeline(
stt: parakeetASR,
tts: qwen3TTS,
vad: sileroVAD,
config: .init(mode: .voicePipeline),
onEvent: { event in print(event) }
)
pipeline.start()
pipeline.pushAudio(micSamples)VoicePipeline is the real-time voice-agent state machine (powered by speech-core) with VAD-driven turn detection, interruption handling, and eager STT. It connects any SpeechRecognitionModel + SpeechGenerationModel + StreamingVADProvider.
audio-server --port 8080Exposes every model via HTTP REST + WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See Sources/AudioServer/.
speech-swift is split into one SPM target per model so consumers only pay for what they import. Shared infrastructure lives in AudioCommon (protocols, audio I/O, HuggingFace downloader, SentencePieceModel) and MLXCommon (weight loading, QuantizedLinear helpers, SDPA multi-head attention helper).
Full architecture diagram with backends, memory tables, and module map → soniqo.audio/architecture · API reference → soniqo.audio/api · Benchmarks → soniqo.audio/benchmarks
Local docs (repo):
- Models: Qwen3-ASR · Qwen3-TTS · CosyVoice · Kokoro · Parakeet TDT · Parakeet Streaming · Nemotron Streaming · Omnilingual ASR · PersonaPlex · FireRedVAD · Source Separation
- Inference: Qwen3-ASR · Parakeet TDT · Parakeet Streaming · Nemotron Streaming · Omnilingual ASR · TTS · Forced Aligner · Silero VAD · Speaker Diarization · Speech Enhancement
- Reference: Shared Protocols
Model weights download from HuggingFace on first use and cache to ~/Library/Caches/qwen3-speech/. Override with QWEN3_CACHE_DIR (CLI) or cacheDir: (Swift API). All fromPretrained() entry points also accept offlineMode: true to skip network when weights are already cached.
See docs/inference/cache-and-offline.md for full details including sandboxed iOS container paths.
If you see Failed to load the default metallib at runtime, the Metal shader library is missing. Run make build or ./scripts/build_mlx_metallib.sh release after a manual swift build. If the Metal Toolchain is missing, install it first:
xcodebuild -downloadComponent MetalToolchainmake test # full suite (unit + E2E with model downloads)
swift test --skip E2E # unit only (CI-safe, no downloads)
swift test --filter Qwen3ASRTests # specific moduleE2E test classes use the E2E prefix so CI can filter them out with --skip E2E. See CLAUDE.md for the full testing convention.
PRs welcome — bug fixes, new model integrations, documentation. Fork, create a feature branch, make build && make test, open a PR against main.
Apache 2.0