Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
b7638a7
feat(convert): emit prompt-conditioning KVs + use_bias; handle nested…
mudler Jun 6, 2026
b0e71f1
feat(loader): read prompt-conditioning config + encoder.use_bias
mudler Jun 6, 2026
8d2fdb2
test: make nemotron loader assertions reachable without PARAKEET_TEST…
mudler Jun 6, 2026
ff37ed8
feat(baseline): dump prompt_kernel_out + per-language RNNT reference …
mudler Jun 6, 2026
1dd074e
feat: PromptKernel post-encoder conditioning unit + isolated NeMo par…
mudler Jun 6, 2026
9642ac6
feat(model): apply PromptKernel + resolve target_lang in offline decode
mudler Jun 6, 2026
ae0243b
test: offline nemotron end-to-end NeMo parity (multi-language)
mudler Jun 6, 2026
9d50385
feat(streaming): apply PromptKernel per chunk; target_lang on session
mudler Jun 6, 2026
db11296
test: streaming nemotron end-to-end NeMo parity
mudler Jun 6, 2026
3d1436c
feat(capi): target_lang variants for transcribe + stream (ABI bump)
mudler Jun 6, 2026
f97f20f
feat(cli): --lang flag for multilingual prompt models
mudler Jun 6, 2026
4a229ba
test: e2e NeMo-vs-parakeet.cpp comparison harness (per-language, offl…
mudler Jun 6, 2026
1594484
docs: document nemotron multilingual streaming support + prompt KVs
mudler Jun 6, 2026
673d0b6
feat(publish): add nemotron-3.5-asr-streaming-0.6b (5 variants, OpenM…
mudler Jun 6, 2026
3d4c143
bench: nemotron-3.5-asr CPU benchmark vs NeMo (WER 0, 2.4x f32 / 2.5x…
mudler Jun 6, 2026
411ece3
fix(streaming): reject unknown target_lang for prompt models (match o…
mudler Jun 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ set(PARAKEET_SRC
src/ctc_decoder.cpp
src/prediction.cpp
src/joint.cpp
src/prompt_kernel.cpp
src/tdt.cpp
src/rnnt.cpp
src/transducer_batch.cpp
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

parakeet.cpp is a C++17 inference port of NVIDIA's [NeMo](https://github.com/NVIDIA-NeMo/NeMo) Parakeet speech-recognition models, built on [ggml](https://github.com/ggml-org/ggml). It gives you fast, dependency-light automatic speech recognition on CPU (and on GPU through ggml's backends), with no Python runtime needed at inference time.

It covers all the offline Parakeet families (CTC, RNNT, TDT, and hybrid TDT-CTC, in 0.6B/1.1B/110M sizes, English plus multilingual v3), each validated at WER 0 against NeMo on every published checkpoint. It also does **cache-aware streaming with end-of-utterance (EOU) detection** for `parakeet_realtime_eou_120m-v1`, where the streaming transcript matches NeMo's cache-aware streaming byte for byte. The full coverage matrix lives in `docs/parity.md`.
It covers all the offline Parakeet families (CTC, RNNT, TDT, and hybrid TDT-CTC, in 0.6B/1.1B/110M sizes, English plus multilingual v3), each validated at WER 0 against NeMo on every published checkpoint. It also does **cache-aware streaming with end-of-utterance (EOU) detection** for `parakeet_realtime_eou_120m-v1`, where the streaming transcript matches NeMo's cache-aware streaming byte for byte. And it supports the **multilingual, prompt-conditioned streaming model** `nvidia/nemotron-3.5-asr-streaming-0.6b` (40+ locales): pass a target language with `--lang <locale>` (default `auto`) and both the offline and the cache-aware streaming transcripts match NeMo per language at WER 0. The full coverage matrix lives in `docs/parity.md`.

It's faster than NeMo's PyTorch runtime on both CPU and GPU, with byte-identical transcripts. The full numbers, methodology, and all the plots are in [benchmarks/BENCHMARK.md](benchmarks/BENCHMARK.md).

Expand Down
18 changes: 18 additions & 0 deletions benchmarks/BENCHMARK.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,24 @@ Versus whisper.cpp turbo, same accuracy (WER 1.6% on this clip) and far less com

> **Speedup** = ours RTFx / NeMo RTFx (>1 = faster than NeMo). f32 reproduces NeMo's transcript (agreement ≈ 0).

## Nemotron (streaming, multilingual, prompt-conditioned)

`nemotron-3.5-asr-streaming-0.6b` is a FastConformer transducer with a per-language prompt: a one-hot language vector drives a PromptKernel between the encoder and the RNN-T decoder. It runs both offline and cache-aware streaming. Because it loads from a local `.nemo` plus its GGUF, it sits outside the LibriSpeech pipeline above and is measured on its own here.

One clip (`speech.wav`, 7.43 s), language prompt `en`, 8 threads, median of 7 passes after one warmup. ours is `parakeet-cli bench --decoder tdt --lang en` (load once, time transcribe only); NeMo runs the same prompt forward (preprocessor, encoder, PromptKernel, RNN-T greedy) on PyTorch CPU. RTFx = audio seconds per second of compute; higher is faster.

Host: AMD Ryzen 9 9950X3D (20 cores), CPU-only. NeMo 2.8.0rc0.

| Engine | RTFx | Speedup vs NeMo | Agreement WER vs NeMo |
|---|---|---|---|
| NeMo (PyTorch CPU) | 12.2 | 1.00× | reference |
| parakeet.cpp f32 | 29.4 | 2.40× | 0.0000% |
| parakeet.cpp q8_0 | 30.8 | 2.52× | 0.0000% |

Accuracy is **WER 0 vs NeMo**: the f32 and q8_0 transcripts are byte-identical to NeMo's on the timed runs (agreement WER 0.0000%), so the speed numbers compare equal work. parakeet.cpp is **2.40× faster than NeMo at f32** and **2.52× at q8_0**.

Streaming path (f32, cache-aware): compute RTFx **3.80** (median wall 2503 ms over the 7.43 s clip, one-time model load of 548 ms subtracted). Streaming is latency-oriented: it runs many small chunked forward passes rather than one offline pass, so its RTFx sits well below the offline number by design while staying several times real time. The streaming transcript matches the offline and NeMo transcripts.

## Quantization — size / speed / accuracy tradeoff

Averaged over all models (LibriSpeech). Size is the mean GGUF size as a fraction of the f32 GGUF.
Expand Down
40 changes: 40 additions & 0 deletions benchmarks/results/nemotron/bench.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
"clip": "speech.wav",
"audio_sec": 7.435,
"lang": "en",
"threads": 8,
"passes": 7,
"nemo": {
"rtfx": 12.228159482130682,
"median_proc_s": 0.6080228190403432,
"text": "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>",
"version": "2.8.0rc0",
"load_s": 25.837787624972407
},
"ours": {
"f32": {
"rtfx": 29.353937020308894,
"speedup": 2.400519641831998,
"median_proc_s": 0.253288,
"agreement_wer": 0.0,
"text": "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>",
"load_ms": 547.68
},
"q8_0": {
"rtfx": 30.819419343071743,
"speedup": 2.5203645232227254,
"median_proc_s": 0.241244,
"agreement_wer": 0.0,
"text": "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>",
"load_ms": 247.568
}
},
"stream": {
"dtype": "f32",
"compute_rtfx": 3.801518370630737,
"wall_rtfx": 2.96986895150122,
"median_wall_s": 2.5034774669911712,
"compute_s": 1.9557974669911713,
"load_s": 0.54768
}
}
24 changes: 24 additions & 0 deletions docs/conversion.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,16 @@ offline checkpoints omit them entirely (so they keep converting byte-identically
and the C++ loader falls back to offline-safe defaults (`att_context [-1,-1]`,
style `regular`, causal flags `false`, no streaming block).

The `parakeet.prompt.*` keys and `parakeet.encoder.use_bias` /
`parakeet.encoder.att_context_presets` are emitted **only for the
prompt-conditioned multilingual model** `nvidia/nemotron-3.5-asr-streaming-0.6b`
(`model_defaults.initialize_prompt_feature == true`). Every other checkpoint omits
them and the loader defaults `prompt.present=false` (the prompt stage is skipped)
and `use_bias=true`. nemotron stores `att_context_size` as a **list of presets**
(`[[56,3],[56,0],[56,6],[56,13]]`, the first is the default 320 ms preset) rather
than a single `[left,right]`, so the converter records all of them in
`att_context_presets` and uses the first pair for the scalar left/right keys.

| Key | GGUF type | Meaning | Source | 110m value |
| --- | --- | --- | --- | --- |
| `parakeet.arch` | STRING | One of `ctc` / `rnnt` / `tdt` / `hybrid_rnnt_ctc` / `hybrid_tdt_ctc` | arch detection (below) | `hybrid_tdt_ctc` |
Expand All @@ -61,6 +71,13 @@ style `regular`, causal flags `false`, no streaming block).
| `parakeet.streaming.valid_out_len` | INT32 | Valid encoder frames per step. **Streaming only.** | `encoder.streaming_cfg.valid_out_len` | (n/a) |
| `parakeet.streaming.pre_encode_cache_size` | ARRAY\<INT32\> | Pre-encode (mel) cache frames `[first, rest]`. **Streaming only.** | `encoder.streaming_cfg.pre_encode_cache_size` | (n/a) |
| `parakeet.streaming.drop_extra_pre_encoded` | INT32 | Steps dropped after pre-encode. **Streaming only.** | `encoder.streaming_cfg.drop_extra_pre_encoded` | (n/a) |
| `parakeet.encoder.use_bias` | BOOL | Whether the encoder linear layers carry a bias. `false` for nemotron (`use_bias=false`); the loader reads biases optionally and tolerates their absence. Defaults `true`. | `cfg.encoder.use_bias` | `true` |
| `parakeet.encoder.att_context_presets` | ARRAY\<INT32\> | Flattened `[l,r,l,r,...]` list of all `att_context_size` presets when the model stores a **list** of `[left,right]` pairs (multi-latency streaming, e.g. nemotron `[[56,3],[56,0],[56,6],[56,13]]`). The first pair is the default and is also written to `att_context_left`/`att_context_right`. **Streaming, multi-context only.** | `cfg.encoder.att_context_size` | (n/a) |
| `parakeet.prompt.present` | BOOL | Marks a prompt-conditioned multilingual model (nemotron). When `true` the C++ engine inserts the `prompt_kernel` (Linear, ReLU, Linear) on the encoder output, selected by a per-utterance language one-hot. Absent/`false` for every other model (which skip the stage entirely). | `model_defaults.initialize_prompt_feature` | (n/a) |
| `parakeet.prompt.num_prompts` | UINT32 | Width of the language one-hot appended to the encoder output (`prompt_kernel.0` input = `d_model + num_prompts`). **Prompt only.** | `model_defaults.num_prompts` | 128 |
| `parakeet.prompt.default_lang` | STRING | Locale used when no `--lang`/`target_lang` is given (nemotron: `auto`, prompt index 101). **Prompt only.** | derived (`auto` if present) | `auto` |
| `parakeet.prompt.dictionary.keys` | ARRAY\<STRING\> | Locale strings (e.g. `en`, `en-US`, `de`, `es`, `ja-JP`, `auto`) parallel to `dictionary.values`. The loader resolves a `target_lang` to its prompt index by lookup. **Prompt only.** | `model_defaults.prompt_dictionary` keys | len 121 |
| `parakeet.prompt.dictionary.values` | ARRAY\<INT32\> | Prompt index for each parallel key (multiple locales may share an index, e.g. `en` and `en-US` both map to 0). **Prompt only.** | `model_defaults.prompt_dictionary` values | (n/a) |
| `parakeet.preprocessor.sample_rate` | UINT32 | Audio sample rate | `featurizer.sample_rate` | 16000 |
| `parakeet.preprocessor.n_mels` | UINT32 | Mel filterbank count | `featurizer.nfilt` | 80 |
| `parakeet.preprocessor.n_fft` | UINT32 | FFT size | `featurizer.n_fft` | 512 |
Expand Down Expand Up @@ -122,6 +139,13 @@ State-dict prefixes present in the hybrid anchor (690 tensors total):
| CTC head (hybrid aux CTC) | `ctc_decoder.decoder_layers.0.*` | `ctc_decoder.decoder_layers.0.weight` shape `(vocab+1, d_model, 1)` |
| Prediction net (LSTM) | `decoder.prediction.*` | `decoder.prediction.embed.weight`, `decoder.prediction.dec_rnn.lstm.weight_ih_l0` |
| Joint net | `joint.{enc,pred,joint_net}.*` | `joint.joint_net.2.weight` shape `(vocab+1+D, joint_hidden)` |
| Prompt kernel (nemotron only) | `prompt_kernel.{0,2}.*` | `prompt_kernel.0.weight` `(2048, d_model+num_prompts)`, `prompt_kernel.0.bias` `(2048,)`, `prompt_kernel.2.weight` `(d_model, 2048)`, `prompt_kernel.2.bias` `(d_model,)` |

> The `prompt_kernel.*` projection weights are written verbatim by the generic
> tensor loop (no special handling); only the `parakeet.prompt.*` KV metadata is
> added by the converter. Like the LSTM prediction net and the featurizer buffers,
> the prompt kernel is **not** on the quantization allowlist, so it stays F32 in
> every quantized variant.

> Pure-CTC checkpoints (`EncDecCTCModelBPE`) put the CTC head under `decoder.*`
> instead of `ctc_decoder.*`; the verbatim rule preserves whatever the checkpoint
Expand Down
14 changes: 14 additions & 0 deletions docs/parity.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ CPU, batch 1, deterministic greedy (NeMo 2.7.3).
| `parakeet-rnnt-0.6b` | RNNT | `rnnt` | 1024 / 24 | 80 | **true** | 1024 | RNNT | **0.0** | PASS |
| `parakeet-rnnt-1.1b` | RNNT | `rnnt` | 1024 / 42 | 80 | **true** | 1024 | RNNT | **0.0** | PASS |
| `parakeet_realtime_eou_120m-v1` | Streaming + EOU | `rnnt` | 512 / 17 | 128 | false | 1026 | RNNT (offline, limited-context) | **0.0** | PASS (Phase 5 — 5a milestone) |
| `nemotron-3.5-asr-streaming-0.6b` | Streaming, multilingual, prompt-conditioned | `rnnt` | 1024 / 24 | 128 | false | 13087 | RNNT offline + cache-aware streaming, per language | **0.0** | PASS (offline + streaming, langs en/de/es/ja-JP/auto) |

Notes:
- `xscaling` = NeMo FastConformer `xscale=sqrt(d_model)` (true) vs `xscale=None` (false).
Expand All @@ -53,6 +54,19 @@ Notes:
clip NeMo's streaming does NOT emit `<EOU>` (the final-chunk tail has incomplete
right context); the C++ streaming session/C-API/CLI match that exactly and do
not fabricate one. See "Phase 5 — Streaming + EOU" below.
- `nemotron-3.5-asr-streaming-0.6b` (multilingual, prompt-conditioned): a target
language one-hot (`--lang <locale>`, default `auto`) is projected through the
`prompt_kernel` (Linear, ReLU, Linear) on the encoder output before the RNNT
decode, both offline and per streaming chunk (the one-hot is constant over time,
so per-chunk application is exact). The authoritative NeMo reference decodes the
prompt-conditioned encoder output via `m.decoding.rnnt_decoder_predictions_tensor`
(the lhotse `transcribe(target_lang=...)` path needs per-cut language metadata our
bare wav fixtures lack). `scripts/e2e_nemo_compare.py` cross-checks the C++ CLI
against this reference for `tests/fixtures/{speech,clip}.wav` × `{en, de, es,
ja-JP, auto}` × `{offline, stream}`: **all 20 rows WER 0.0**. The `prompt_kernel`,
LSTM prediction net, and featurizer tensors stay F32 in every quantized variant
(f16 and q8_0 also verified WER 0.0; see `docs/quantization.md`). Note `ja` is not
a dictionary key — the Japanese locale is `ja-JP`.

---

Expand Down
Loading
Loading