Know who said what, when, and how they said it — with streaming, topic structure, and stable contracts. Local-first, from any audio file, on your machine.
Transcription gives you words. slower-whisper gives you the whole conversation: speakers, timing, tone, emotion, and topic structure — as stable, versioned JSON you can build on. Local-first. No cloud, no API keys, no per-minute billing.
from transcription import TranscriptionConfig, transcribe_file
cfg = TranscriptionConfig(model="base", device="auto", language="en")
transcript = transcribe_file("meeting.wav", root=".", config=cfg)
print(transcript.full_text)
print(transcript.segments[0].speaker) # "spk_0"
print(transcript.segments[0].audio_state) # prosody, emotion, timinguv add slower-whisper
# or: pip install slower-whisperAdd what you need:
| Extra | What it adds |
|---|---|
enrich-basic |
DSP stack (numpy, librosa, soundfile) |
enrich-prosody |
Praat-based prosody (praat-parselmouth) |
emotion |
Emotion models (torch, torchaudio, transformers) |
diarization |
Speaker diarization (pyannote.audio) |
full |
Everything above |
api |
FastAPI service runtime |
integrations |
LangChain + LlamaIndex adapters |
uv sync # transcription only
uv sync --extra full # all enrichment
uv sync --extra full --extra dev # contributor toolchain| Package | What it is |
|---|---|
transcription |
Core Python API — batch, file, streaming, enrichment, post-processing |
slower_whisper |
faster-whisper drop-in replacement (WhisperModel, Segment, Word) |
slower-whisper |
CLI — transcribe, enrich, benchmark, export, validate |
Change one import:
# Before
from faster_whisper import WhisperModel
# After
from slower_whisper import WhisperModel
model = WhisperModel("base", device="auto")
segments, info = model.transcribe("audio.wav", word_timestamps=True)
# slower-whisper extension — full transcript with enrichment
transcript = model.last_transcriptFull option mapping: docs/FASTER_WHISPER_MIGRATION.md
# Transcribe (reads raw_audio/, writes whisper_json/)
uv run slower-whisper transcribe --root .
# Enrich with prosody + emotion
uv run slower-whisper enrich --root .
# Benchmark against baselines
uv run slower-whisper benchmark run --track asr --dataset smoke
uv run slower-whisper benchmark compare --track asr --dataset smoke --gate- Local-first — all processing on your hardware. No accounts, no rate limits, no per-minute billing.
- Beyond words — speaker diarization, prosody, emotion, semantic annotation as opt-in layers.
- Stable contracts — JSON schema v2 with
schema_version,receipt, andrun_id. Same input + config = same output. - Streaming — WebSocket/SSE with event envelopes, resume protocol, and backpressure.
- Post-processing — topic segmentation (TF-IDF), turn-taking policies, domain presets for call centers and meetings.
- Benchmarks — ASR WER, diarization DER, emotion, streaming latency — with baseline gating in CI.
faster-whispercompatible — change one import line.
| Start here | Then |
|---|---|
| Quickstart | CLI Reference |
| Python API | Configuration |
| faster-whisper Migration | Streaming Architecture |
| Post-Processing | Benchmarks |
Full docs index: docs/INDEX.md | Vision and strategy: VISION.md | Changelog: CHANGELOG.md | Roadmap: ROADMAP.md
- Issues — Bug reports and feature requests
- Security — Report a vulnerability (see SECURITY.md)
- Contributing — See CONTRIBUTING.md for setup, PR process, and coding standards
- Support — Getting help
Dual-licensed under Apache 2.0 or MIT, at your option. See LICENSE-APACHE and LICENSE-MIT.