Add nemotron-3.5-asr-streaming-0.6b (multilingual prompt-conditioned streaming) by localai-bot · Pull Request #10 · mudler/parakeet.cpp

localai-bot · 2026-06-06T02:14:51Z

Summary

Adds support for nvidia/nemotron-3.5-asr-streaming-0.6b, the multilingual (40+ locales), prompt-conditioned, cache-aware streaming NeMo model.

The model is a standard cache-aware streaming FastConformer + RNN-T that parakeet.cpp already implements, plus one new piece: a post-encoder prompt module that conditions on the target language. We model it as an orthogonal prompt.present capability (like streaming.present), so all existing models stay byte-identical.

The new prompt module is small: take the encoder output [d_model=1024, T], concatenate a one-hot language vector [128, T] (constant over time), and project it back with Linear(1152 to 2048) -> ReLU -> Linear(2048 to 1024). The language is chosen with --lang <locale> (default auto, which is just prompt index 101, so there is no separate language-ID head).

What is in here

Converter emits parakeet.prompt.* KVs + encoder.use_bias, and handles the nested att_context_size presets.
New PromptKernel unit (src/prompt_kernel.{hpp,cpp}), wired into both the offline and the per-chunk streaming decode, gated on prompt.present.
target_lang plumbed through Model, StreamingSession, the C-API (new _lang variants, ABI bumped to 3), and the CLI (--lang). Unknown locales error cleanly and consistently across offline, streaming, and the C-API.
Quantize + publish support: added to publish_hf.py (5 variants) with an OpenMDW-1.1 license card (the upstream license differs from the other checkpoints).
README, docs/conversion.md, and docs/parity.md updated.

Validation (parity-first, against NeMo)

NeMo reference comes from NeMo main (the model class EncDecRNNTBPEModelWithPrompt is newer than the pinned 2.7.3; baselines were generated in a dedicated venv).

PromptKernel tensor parity vs NeMo: max abs diff ~1.9e-6.
Offline end-to-end transcript: WER 0 vs NeMo for en, de, es, ja-JP, and auto.
Cache-aware streaming transcript: WER 0 vs NeMo (matches NeMo's cache_aware_stream_step reference).
Quantized f16/q8_0/q6_k/q5_k/q4_k: WER 0 vs NeMo (prompt_kernel, LSTM, and featurizer tensors stay F32).
scripts/e2e_nemo_compare.py: 20-row offline+streaming comparison, all WER 0.
New ctest targets: test_prompt_kernel, test_transcribe_nemotron, test_streaming_nemotron (fixture-gated, skip 77 when the model gguf/baseline are absent). Full suite 50/50 green; the existing 10 models are unaffected (the prompt path is gated and never runs for them).

Benchmark (CPU, AMD Ryzen 9 9950X3D, 8 threads, speech.wav 7.43 s)

Engine	RTFx	Speedup vs NeMo	Agreement WER
NeMo (PyTorch CPU)	12.2	1.00x	reference
parakeet.cpp f32	29.4	2.40x	0.0000%
parakeet.cpp q8_0	30.8	2.52x	0.0000%

Notes

The HuggingFace upload of the quantized GGUFs is prepared but NOT performed here (publish dry-run only); it is left for a maintainer to run, pending the OpenMDW-1.1 redistribution check.
third_party/ggml submodule pointer is intentionally untouched.

🤖 Generated with Claude Code

… att_context presets Emit parakeet.prompt.{present,num_prompts,dictionary.keys,dictionary.values, default_lang} and parakeet.encoder.use_bias for prompt-conditioned multilingual checkpoints (nvidia/nemotron-3.5-asr-streaming-0.6b). Handle the nested att_context_size preset list ([[56,3],[56,0],...]) by taking the first preset as the default and recording all presets in parakeet.encoder.att_context_presets. Also refine detect_arch: a bare aux_ctc config block is no longer enough to mark a model hybrid. The nemotron prompt RNNT carries an unconfigured aux_ctc stub (num_classes=-1, empty vocabulary) but has no ctc decoder and zero ctc_decoder.* weights (NeMo initializes it RNNT-only), so require an actual model.ctc_decoder before classifying as hybrid. This makes the model convert as arch=rnnt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add PromptCfg (present, num_prompts, default_lang, dict_keys/vals, lang_to_index) and use_bias to ParakeetConfig, and read the new KVs in ModelLoader::load via a new kv_str_arr helper. present=false / use_bias=true defaults keep every existing model byte-identical. Extend test_model_loader with a PARAKEET_TEST_GGUF_NEMOTRON block asserting the resolved prompt dictionary (de=9, auto=101, unknown=-1) and use_bias=false; it skips silently when the fixture env var is unset. The encoder attention/FFN linear bias loads were already optional (clone_weight_opt + ml.tensor guards across relpos_attention/conformer/streaming_encoder), and every subsampling bias is present in this checkpoint, so the use_bias=false model loads and its encoder graph builds with no further changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…_GGUF The nemotron prompt-config block was unreachable when PARAKEET_TEST_GGUF was unset, because main() returned 77 before it. Guard the base-model checks behind PARAKEET_TEST_GGUF and run the nemotron block whenever PARAKEET_TEST_GGUF_NEMOTRON is set. Only skip (return 77) when neither env var is present. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…for nemotron Decode the prompt-conditioned encoder output directly via the model's RNNT decoding object: the prompt model's transcribe dataloader resolves the prompt index from per-cut language metadata, which a bare wav fixture lacks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ity test Concat the constant language one-hot onto the encoder output, then Linear->ReLU ->Linear (prompt_kernel.0/2) on the persistent backend via run_graph. Parity vs NeMo prompt_kernel_out: max|d|=1.9e-6. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

resolve_prompt_index maps a locale to its prompt index (empty -> default_lang), and the offline + batch decode paths project the encoder output through the PromptKernel when prompt.present. Threaded target_lang through the transcribe entry points (default empty); non-prompt models take the no-op path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolve a language prompt index in the StreamingSession constructor (new target_lang param, default empty -> model default_lang) and apply the prompt_kernel projection to each chunk's encoder frames before the RNN-T decode. The one-hot is constant over time, so per-chunk application is exact and equals the offline forward's single application. Non-prompt models take the no-op path (prompt_.present()==false) and stay byte-identical. run_stream_over_pcm gains a trailing target_lang param (default empty) so a language can route through one entry point; the session already owns its resolved index, so the driver leaves it unused for now (Phase 4 wires it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Extend dump_prompt_baseline to emit baseline.stream_text: run NeMo's cache-aware streaming encoder, apply m.prompt_kernel to the concatenated streamed output for the target_lang, and RNN-T greedy decode it (specials stripped). Add tests/test_streaming_nemotron.cpp: drive a prompt-aware StreamingSession over the clip and assert sess.text() == baseline.stream_text. Parity gate (lang=en, speech.wav): got == ref EXACTLY: "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>" Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add parakeet_capi_transcribe_path_lang, parakeet_capi_transcribe_pcm_lang and parakeet_capi_stream_begin_lang for multilingual prompt-conditioned (nemotron) models. target_lang is a locale string; NULL or "" selects the model default and non-prompt models ignore it. An unknown locale on a prompt model is caught at the boundary, returning NULL with the message set on the ctx last error. The original non-lang entry points delegate to the new ones with the model default, preserving behavior. ABI version bumped to 3. test_capi gains a PARAKEET_TEST_GGUF_NEMOTRON-guarded block asserting a known lang transcribes (non-NULL) and an unknown lang returns NULL with a non-empty last_error; the two model blocks are now independent and skip cleanly (77) when neither env var is set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

transcribe gains --lang <locale> to select the language prompt for multilingual (nemotron) prompt models; empty -> the model default and non-prompt models ignore it. The plain offline path routes through the C-API parakeet_capi_transcribe_path_lang when --lang is set (so an unknown locale is a clean error), and keeps the existing free-function path otherwise so behavior for every other model is unchanged. --timestamps threads lang into transcribe_path_with_timestamps; --stream threads it into the StreamingSession ctor (what stream_begin_lang forwards), keeping the rich per-word/EOU output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ine+stream) Refactor gen_nemo_baseline.dump_prompt_baseline into an importable compute_prompt_reference helper and reuse it from the new e2e driver, which runs the built parakeet-cli per (clip, lang, mode) and asserts WER 0 vs NeMo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

README: add the prompt-conditioned multilingual streaming model (nvidia/nemotron-3.5-asr-streaming-0.6b, 40+ locales, --lang, WER 0 offline + streaming). conversion.md: document the parakeet.prompt.* KV schema, encoder.use_bias, att_context_presets, and the prompt_kernel tensors (stay F32). parity.md: add the nemotron coverage row + e2e cross-check note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…DW-1.1 card) Add the model to ALL_MODELS and KNOWN_WER (f16/q8_0/q6_k/q5_k/q4_k, all WER 0.0 offline vs NeMo with recorded sizes). Add a per-id LICENSES map (default CC-BY-4.0) so the generated card states OpenMDW-1.1 for this entry, wired into both the per-model and the collection cards (frontmatter license/license_name/ license_link, License section, per-model rows). Quant allowlist unchanged: the prompt_kernel, LSTM prediction net, and featurizer tensors stay F32. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… q8_0) Benchmarks the prompt-conditioned nemotron-3.5-asr-streaming-0.6b port on CPU against NeMo (PyTorch CPU), following the existing parakeet.cpp methodology: load once, warm up once, time transcribe only, median of N passes, RTFx = audio_sec / proc_sec. - Adds scripts/bench_nemotron.py. ours runs parakeet-cli bench with the en language prompt; NeMo runs the same prompt forward (preprocessor, encoder, PromptKernel, RNN-T greedy) reusing gen_nemo_baseline.resolve_prompt_lang. Optionally times the cache-aware streaming path too. - Adds --lang to the CLI bench subcommand so the prompt-conditioned timing path selects the same language prompt as transcribe (passed to transcribe_pcm). - Adds build_nemotron_section to gen_benchmark_md.py, fed by the new benchmarks/results/nemotron/bench.json, so the section is reproducible. Results on AMD Ryzen 9 9950X3D (20 cores, CPU-only, 8 threads), speech.wav (7.43 s), lang en, median of 7 passes: NeMo RTFx 12.2 parakeet.cpp f32 RTFx 29.4 2.40x agreement WER 0.0000% parakeet.cpp q8_0 RTFx 30.8 2.52x agreement WER 0.0000% streaming f32 compute RTFx 3.80 (latency-oriented) Transcripts are byte-identical to NeMo on the timed runs, so the speed numbers compare equal work. Full suite green (ctest 48/48 non-nemotron, 2/2 nemotron). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ffline + capi contract) The streaming StreamingSession ctor silently fell back to the default language on an unknown locale, contradicting the parakeet_capi_stream_begin_lang header contract (NULL on an unknown locale) and diverging from the offline Model::resolve_prompt_index path, which throws. A typo like --stream --lang xx produced wrong-language output with no error. Factor the throwing resolution into PromptCfg::resolve_index_or_throw and use it from both the offline path and the StreamingSession ctor so both reject typos identically. The empty-lang default and the non-prompt no-op are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mathnerd314 mentioned this pull request Jun 6, 2026

demo: Use Kroko ASR, show icon in tray richiejp/VoxInput#42

Draft

mudler and others added 16 commits June 6, 2026 07:15

test: offline nemotron end-to-end NeMo parity (multi-language)

ae0243b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mudler force-pushed the feat/nemotron-3.5-asr-streaming branch from 3bbb387 to 411ece3 Compare June 6, 2026 07:17

mudler merged commit 271e70b into master Jun 6, 2026
8 checks passed

localai-bot mentioned this pull request Jun 6, 2026

Batch mode for nemotron: batched causal subsampling + batched target_lang C-API #11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nemotron-3.5-asr-streaming-0.6b (multilingual prompt-conditioned streaming)#10

Add nemotron-3.5-asr-streaming-0.6b (multilingual prompt-conditioned streaming)#10
mudler merged 16 commits into
masterfrom
feat/nemotron-3.5-asr-streaming

localai-bot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

localai-bot commented Jun 6, 2026

Summary

What is in here

Validation (parity-first, against NeMo)

Benchmark (CPU, AMD Ryzen 9 9950X3D, 8 threads, speech.wav 7.43 s)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants