mudler · mudler · Jun 6, 2026 · Jun 6, 2026 · Jun 6, 2026 · Jun 6, 2026
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -74,6 +74,7 @@ set(PARAKEET_SRC
     src/ctc_decoder.cpp
     src/prediction.cpp
     src/joint.cpp
+    src/prompt_kernel.cpp
     src/tdt.cpp
     src/rnnt.cpp
     src/transducer_batch.cpp

diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@
 
 parakeet.cpp is a C++17 inference port of NVIDIA's [NeMo](https://github.com/NVIDIA-NeMo/NeMo) Parakeet speech-recognition models, built on [ggml](https://github.com/ggml-org/ggml). It gives you fast, dependency-light automatic speech recognition on CPU (and on GPU through ggml's backends), with no Python runtime needed at inference time.
 
-It covers all the offline Parakeet families (CTC, RNNT, TDT, and hybrid TDT-CTC, in 0.6B/1.1B/110M sizes, English plus multilingual v3), each validated at WER 0 against NeMo on every published checkpoint. It also does **cache-aware streaming with end-of-utterance (EOU) detection** for `parakeet_realtime_eou_120m-v1`, where the streaming transcript matches NeMo's cache-aware streaming byte for byte. The full coverage matrix lives in `docs/parity.md`.
+It covers all the offline Parakeet families (CTC, RNNT, TDT, and hybrid TDT-CTC, in 0.6B/1.1B/110M sizes, English plus multilingual v3), each validated at WER 0 against NeMo on every published checkpoint. It also does **cache-aware streaming with end-of-utterance (EOU) detection** for `parakeet_realtime_eou_120m-v1`, where the streaming transcript matches NeMo's cache-aware streaming byte for byte. And it supports the **multilingual, prompt-conditioned streaming model** `nvidia/nemotron-3.5-asr-streaming-0.6b` (40+ locales): pass a target language with `--lang <locale>` (default `auto`) and both the offline and the cache-aware streaming transcripts match NeMo per language at WER 0. The full coverage matrix lives in `docs/parity.md`.
 
 It's faster than NeMo's PyTorch runtime on both CPU and GPU, with byte-identical transcripts. The full numbers, methodology, and all the plots are in [benchmarks/BENCHMARK.md](benchmarks/BENCHMARK.md).
 

diff --git a/benchmarks/BENCHMARK.md b/benchmarks/BENCHMARK.md
@@ -69,6 +69,24 @@ Versus whisper.cpp turbo, same accuracy (WER 1.6% on this clip) and far less com
 
 > **Speedup** = ours RTFx / NeMo RTFx (>1 = faster than NeMo). f32 reproduces NeMo's transcript (agreement ≈ 0).
 
+## Nemotron (streaming, multilingual, prompt-conditioned)
+
+`nemotron-3.5-asr-streaming-0.6b` is a FastConformer transducer with a per-language prompt: a one-hot language vector drives a PromptKernel between the encoder and the RNN-T decoder. It runs both offline and cache-aware streaming. Because it loads from a local `.nemo` plus its GGUF, it sits outside the LibriSpeech pipeline above and is measured on its own here.
+
+One clip (`speech.wav`, 7.43 s), language prompt `en`, 8 threads, median of 7 passes after one warmup. ours is `parakeet-cli bench --decoder tdt --lang en` (load once, time transcribe only); NeMo runs the same prompt forward (preprocessor, encoder, PromptKernel, RNN-T greedy) on PyTorch CPU. RTFx = audio seconds per second of compute; higher is faster.
+
+Host: AMD Ryzen 9 9950X3D (20 cores), CPU-only. NeMo 2.8.0rc0.
+
+| Engine | RTFx | Speedup vs NeMo | Agreement WER vs NeMo |
+|---|---|---|---|
+| NeMo (PyTorch CPU) | 12.2 | 1.00× | reference |
+| parakeet.cpp f32 | 29.4 | 2.40× | 0.0000% |
+| parakeet.cpp q8_0 | 30.8 | 2.52× | 0.0000% |
+
+Accuracy is **WER 0 vs NeMo**: the f32 and q8_0 transcripts are byte-identical to NeMo's on the timed runs (agreement WER 0.0000%), so the speed numbers compare equal work. parakeet.cpp is **2.40× faster than NeMo at f32** and **2.52× at q8_0**.
+
+Streaming path (f32, cache-aware): compute RTFx **3.80** (median wall 2503 ms over the 7.43 s clip, one-time model load of 548 ms subtracted). Streaming is latency-oriented: it runs many small chunked forward passes rather than one offline pass, so its RTFx sits well below the offline number by design while staying several times real time. The streaming transcript matches the offline and NeMo transcripts.
+
 ## Quantization — size / speed / accuracy tradeoff
 
 Averaged over all models (LibriSpeech). Size is the mean GGUF size as a fraction of the f32 GGUF.

diff --git a/benchmarks/results/nemotron/bench.json b/benchmarks/results/nemotron/bench.json
@@ -0,0 +1,40 @@
+{
+  "clip": "speech.wav",
+  "audio_sec": 7.435,
+  "lang": "en",
+  "threads": 8,
+  "passes": 7,
+  "nemo": {
+    "rtfx": 12.228159482130682,
+    "median_proc_s": 0.6080228190403432,
+    "text": "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>",
+    "version": "2.8.0rc0",
+    "load_s": 25.837787624972407
+  },
+  "ours": {
+    "f32": {
+      "rtfx": 29.353937020308894,
+      "speedup": 2.400519641831998,
+      "median_proc_s": 0.253288,
+      "agreement_wer": 0.0,
+      "text": "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>",
+      "load_ms": 547.68
+    },
+    "q8_0": {
+      "rtfx": 30.819419343071743,
+      "speedup": 2.5203645232227254,
+      "median_proc_s": 0.241244,
+      "agreement_wer": 0.0,
+      "text": "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>",
+      "load_ms": 247.568
+    }
+  },
+  "stream": {
+    "dtype": "f32",
+    "compute_rtfx": 3.801518370630737,
+    "wall_rtfx": 2.96986895150122,
+    "median_wall_s": 2.5034774669911712,
+    "compute_s": 1.9557974669911713,
+    "load_s": 0.54768
+  }
+}
diff --git a/docs/conversion.md b/docs/conversion.md
@@ -35,6 +35,16 @@ offline checkpoints omit them entirely (so they keep converting byte-identically
 and the C++ loader falls back to offline-safe defaults (`att_context [-1,-1]`,
 style `regular`, causal flags `false`, no streaming block).
 
+The `parakeet.prompt.*` keys and `parakeet.encoder.use_bias` /
+`parakeet.encoder.att_context_presets` are emitted **only for the
+prompt-conditioned multilingual model** `nvidia/nemotron-3.5-asr-streaming-0.6b`
+(`model_defaults.initialize_prompt_feature == true`). Every other checkpoint omits
+them and the loader defaults `prompt.present=false` (the prompt stage is skipped)
+and `use_bias=true`. nemotron stores `att_context_size` as a **list of presets**
+(`[[56,3],[56,0],[56,6],[56,13]]`, the first is the default 320 ms preset) rather
+than a single `[left,right]`, so the converter records all of them in
+`att_context_presets` and uses the first pair for the scalar left/right keys.
+
 | Key | GGUF type | Meaning | Source | 110m value |
 | --- | --- | --- | --- | --- |
 | `parakeet.arch` | STRING | One of `ctc` / `rnnt` / `tdt` / `hybrid_rnnt_ctc` / `hybrid_tdt_ctc` | arch detection (below) | `hybrid_tdt_ctc` |
@@ -61,6 +71,13 @@ style `regular`, causal flags `false`, no streaming block).
 | `parakeet.streaming.valid_out_len` | INT32 | Valid encoder frames per step. **Streaming only.** | `encoder.streaming_cfg.valid_out_len` | (n/a) |
 | `parakeet.streaming.pre_encode_cache_size` | ARRAY\<INT32\> | Pre-encode (mel) cache frames `[first, rest]`. **Streaming only.** | `encoder.streaming_cfg.pre_encode_cache_size` | (n/a) |
 | `parakeet.streaming.drop_extra_pre_encoded` | INT32 | Steps dropped after pre-encode. **Streaming only.** | `encoder.streaming_cfg.drop_extra_pre_encoded` | (n/a) |
+| `parakeet.encoder.use_bias` | BOOL | Whether the encoder linear layers carry a bias. `false` for nemotron (`use_bias=false`); the loader reads biases optionally and tolerates their absence. Defaults `true`. | `cfg.encoder.use_bias` | `true` |
+| `parakeet.encoder.att_context_presets` | ARRAY\<INT32\> | Flattened `[l,r,l,r,...]` list of all `att_context_size` presets when the model stores a **list** of `[left,right]` pairs (multi-latency streaming, e.g. nemotron `[[56,3],[56,0],[56,6],[56,13]]`). The first pair is the default and is also written to `att_context_left`/`att_context_right`. **Streaming, multi-context only.** | `cfg.encoder.att_context_size` | (n/a) |
+| `parakeet.prompt.present` | BOOL | Marks a prompt-conditioned multilingual model (nemotron). When `true` the C++ engine inserts the `prompt_kernel` (Linear, ReLU, Linear) on the encoder output, selected by a per-utterance language one-hot. Absent/`false` for every other model (which skip the stage entirely). | `model_defaults.initialize_prompt_feature` | (n/a) |
+| `parakeet.prompt.num_prompts` | UINT32 | Width of the language one-hot appended to the encoder output (`prompt_kernel.0` input = `d_model + num_prompts`). **Prompt only.** | `model_defaults.num_prompts` | 128 |
+| `parakeet.prompt.default_lang` | STRING | Locale used when no `--lang`/`target_lang` is given (nemotron: `auto`, prompt index 101). **Prompt only.** | derived (`auto` if present) | `auto` |
+| `parakeet.prompt.dictionary.keys` | ARRAY\<STRING\> | Locale strings (e.g. `en`, `en-US`, `de`, `es`, `ja-JP`, `auto`) parallel to `dictionary.values`. The loader resolves a `target_lang` to its prompt index by lookup. **Prompt only.** | `model_defaults.prompt_dictionary` keys | len 121 |
+| `parakeet.prompt.dictionary.values` | ARRAY\<INT32\> | Prompt index for each parallel key (multiple locales may share an index, e.g. `en` and `en-US` both map to 0). **Prompt only.** | `model_defaults.prompt_dictionary` values | (n/a) |
 | `parakeet.preprocessor.sample_rate` | UINT32 | Audio sample rate | `featurizer.sample_rate` | 16000 |
 | `parakeet.preprocessor.n_mels` | UINT32 | Mel filterbank count | `featurizer.nfilt` | 80 |
 | `parakeet.preprocessor.n_fft` | UINT32 | FFT size | `featurizer.n_fft` | 512 |
@@ -122,6 +139,13 @@ State-dict prefixes present in the hybrid anchor (690 tensors total):
 | CTC head (hybrid aux CTC) | `ctc_decoder.decoder_layers.0.*` | `ctc_decoder.decoder_layers.0.weight` shape `(vocab+1, d_model, 1)` |
 | Prediction net (LSTM) | `decoder.prediction.*` | `decoder.prediction.embed.weight`, `decoder.prediction.dec_rnn.lstm.weight_ih_l0` |
 | Joint net | `joint.{enc,pred,joint_net}.*` | `joint.joint_net.2.weight` shape `(vocab+1+D, joint_hidden)` |
+| Prompt kernel (nemotron only) | `prompt_kernel.{0,2}.*` | `prompt_kernel.0.weight` `(2048, d_model+num_prompts)`, `prompt_kernel.0.bias` `(2048,)`, `prompt_kernel.2.weight` `(d_model, 2048)`, `prompt_kernel.2.bias` `(d_model,)` |
+
+> The `prompt_kernel.*` projection weights are written verbatim by the generic
+> tensor loop (no special handling); only the `parakeet.prompt.*` KV metadata is
+> added by the converter. Like the LSTM prediction net and the featurizer buffers,
+> the prompt kernel is **not** on the quantization allowlist, so it stays F32 in
+> every quantized variant.
 
 > Pure-CTC checkpoints (`EncDecCTCModelBPE`) put the CTC head under `decoder.*`
 > instead of `ctc_decoder.*`; the verbatim rule preserves whatever the checkpoint

diff --git a/docs/parity.md b/docs/parity.md
@@ -30,6 +30,7 @@ CPU, batch 1, deterministic greedy (NeMo 2.7.3).
 | `parakeet-rnnt-0.6b` | RNNT | `rnnt` | 1024 / 24 | 80 | **true** | 1024 | RNNT | **0.0** | PASS |
 | `parakeet-rnnt-1.1b` | RNNT | `rnnt` | 1024 / 42 | 80 | **true** | 1024 | RNNT | **0.0** | PASS |
 | `parakeet_realtime_eou_120m-v1` | Streaming + EOU | `rnnt` | 512 / 17 | 128 | false | 1026 | RNNT (offline, limited-context) | **0.0** | PASS (Phase 5 — 5a milestone) |
+| `nemotron-3.5-asr-streaming-0.6b` | Streaming, multilingual, prompt-conditioned | `rnnt` | 1024 / 24 | 128 | false | 13087 | RNNT offline + cache-aware streaming, per language | **0.0** | PASS (offline + streaming, langs en/de/es/ja-JP/auto) |
 
 Notes:
 - `xscaling` = NeMo FastConformer `xscale=sqrt(d_model)` (true) vs `xscale=None` (false).
@@ -53,6 +54,19 @@ Notes:
   clip NeMo's streaming does NOT emit `<EOU>` (the final-chunk tail has incomplete
   right context); the C++ streaming session/C-API/CLI match that exactly and do
   not fabricate one. See "Phase 5 — Streaming + EOU" below.
+- `nemotron-3.5-asr-streaming-0.6b` (multilingual, prompt-conditioned): a target
+  language one-hot (`--lang <locale>`, default `auto`) is projected through the
+  `prompt_kernel` (Linear, ReLU, Linear) on the encoder output before the RNNT
+  decode, both offline and per streaming chunk (the one-hot is constant over time,
+  so per-chunk application is exact). The authoritative NeMo reference decodes the
+  prompt-conditioned encoder output via `m.decoding.rnnt_decoder_predictions_tensor`
+  (the lhotse `transcribe(target_lang=...)` path needs per-cut language metadata our
+  bare wav fixtures lack). `scripts/e2e_nemo_compare.py` cross-checks the C++ CLI
+  against this reference for `tests/fixtures/{speech,clip}.wav` × `{en, de, es,
+  ja-JP, auto}` × `{offline, stream}`: **all 20 rows WER 0.0**. The `prompt_kernel`,
+  LSTM prediction net, and featurizer tensors stay F32 in every quantized variant
+  (f16 and q8_0 also verified WER 0.0; see `docs/quantization.md`). Note `ja` is not
+  a dictionary key — the Japanese locale is `ja-JP`.
 
 ---