Skip to content

Add nemotron-3.5-asr-streaming-0.6b (multilingual prompt-conditioned streaming)#10

Merged
mudler merged 16 commits into
masterfrom
feat/nemotron-3.5-asr-streaming
Jun 6, 2026
Merged

Add nemotron-3.5-asr-streaming-0.6b (multilingual prompt-conditioned streaming)#10
mudler merged 16 commits into
masterfrom
feat/nemotron-3.5-asr-streaming

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

Adds support for nvidia/nemotron-3.5-asr-streaming-0.6b, the multilingual (40+ locales), prompt-conditioned, cache-aware streaming NeMo model.

The model is a standard cache-aware streaming FastConformer + RNN-T that parakeet.cpp already implements, plus one new piece: a post-encoder prompt module that conditions on the target language. We model it as an orthogonal prompt.present capability (like streaming.present), so all existing models stay byte-identical.

The new prompt module is small: take the encoder output [d_model=1024, T], concatenate a one-hot language vector [128, T] (constant over time), and project it back with Linear(1152 to 2048) -> ReLU -> Linear(2048 to 1024). The language is chosen with --lang <locale> (default auto, which is just prompt index 101, so there is no separate language-ID head).

What is in here

  • Converter emits parakeet.prompt.* KVs + encoder.use_bias, and handles the nested att_context_size presets.
  • New PromptKernel unit (src/prompt_kernel.{hpp,cpp}), wired into both the offline and the per-chunk streaming decode, gated on prompt.present.
  • target_lang plumbed through Model, StreamingSession, the C-API (new _lang variants, ABI bumped to 3), and the CLI (--lang). Unknown locales error cleanly and consistently across offline, streaming, and the C-API.
  • Quantize + publish support: added to publish_hf.py (5 variants) with an OpenMDW-1.1 license card (the upstream license differs from the other checkpoints).
  • README, docs/conversion.md, and docs/parity.md updated.

Validation (parity-first, against NeMo)

NeMo reference comes from NeMo main (the model class EncDecRNNTBPEModelWithPrompt is newer than the pinned 2.7.3; baselines were generated in a dedicated venv).

  • PromptKernel tensor parity vs NeMo: max abs diff ~1.9e-6.
  • Offline end-to-end transcript: WER 0 vs NeMo for en, de, es, ja-JP, and auto.
  • Cache-aware streaming transcript: WER 0 vs NeMo (matches NeMo's cache_aware_stream_step reference).
  • Quantized f16/q8_0/q6_k/q5_k/q4_k: WER 0 vs NeMo (prompt_kernel, LSTM, and featurizer tensors stay F32).
  • scripts/e2e_nemo_compare.py: 20-row offline+streaming comparison, all WER 0.
  • New ctest targets: test_prompt_kernel, test_transcribe_nemotron, test_streaming_nemotron (fixture-gated, skip 77 when the model gguf/baseline are absent). Full suite 50/50 green; the existing 10 models are unaffected (the prompt path is gated and never runs for them).

Benchmark (CPU, AMD Ryzen 9 9950X3D, 8 threads, speech.wav 7.43 s)

Engine RTFx Speedup vs NeMo Agreement WER
NeMo (PyTorch CPU) 12.2 1.00x reference
parakeet.cpp f32 29.4 2.40x 0.0000%
parakeet.cpp q8_0 30.8 2.52x 0.0000%

Notes

  • The HuggingFace upload of the quantized GGUFs is prepared but NOT performed here (publish dry-run only); it is left for a maintainer to run, pending the OpenMDW-1.1 redistribution check.
  • third_party/ggml submodule pointer is intentionally untouched.

🤖 Generated with Claude Code

mudler and others added 16 commits June 6, 2026 07:15
… att_context presets

Emit parakeet.prompt.{present,num_prompts,dictionary.keys,dictionary.values,
default_lang} and parakeet.encoder.use_bias for prompt-conditioned multilingual
checkpoints (nvidia/nemotron-3.5-asr-streaming-0.6b). Handle the nested
att_context_size preset list ([[56,3],[56,0],...]) by taking the first preset as
the default and recording all presets in parakeet.encoder.att_context_presets.

Also refine detect_arch: a bare aux_ctc config block is no longer enough to mark
a model hybrid. The nemotron prompt RNNT carries an unconfigured aux_ctc stub
(num_classes=-1, empty vocabulary) but has no ctc decoder and zero ctc_decoder.*
weights (NeMo initializes it RNNT-only), so require an actual model.ctc_decoder
before classifying as hybrid. This makes the model convert as arch=rnnt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add PromptCfg (present, num_prompts, default_lang, dict_keys/vals, lang_to_index)
and use_bias to ParakeetConfig, and read the new KVs in ModelLoader::load via a
new kv_str_arr helper. present=false / use_bias=true defaults keep every existing
model byte-identical. Extend test_model_loader with a PARAKEET_TEST_GGUF_NEMOTRON
block asserting the resolved prompt dictionary (de=9, auto=101, unknown=-1) and
use_bias=false; it skips silently when the fixture env var is unset.

The encoder attention/FFN linear bias loads were already optional (clone_weight_opt
+ ml.tensor guards across relpos_attention/conformer/streaming_encoder), and every
subsampling bias is present in this checkpoint, so the use_bias=false model loads
and its encoder graph builds with no further changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_GGUF

The nemotron prompt-config block was unreachable when PARAKEET_TEST_GGUF
was unset, because main() returned 77 before it. Guard the base-model
checks behind PARAKEET_TEST_GGUF and run the nemotron block whenever
PARAKEET_TEST_GGUF_NEMOTRON is set. Only skip (return 77) when neither
env var is present.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…for nemotron

Decode the prompt-conditioned encoder output directly via the model's RNNT
decoding object: the prompt model's transcribe dataloader resolves the prompt
index from per-cut language metadata, which a bare wav fixture lacks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ity test

Concat the constant language one-hot onto the encoder output, then Linear->ReLU
->Linear (prompt_kernel.0/2) on the persistent backend via run_graph. Parity vs
NeMo prompt_kernel_out: max|d|=1.9e-6.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
resolve_prompt_index maps a locale to its prompt index (empty -> default_lang),
and the offline + batch decode paths project the encoder output through the
PromptKernel when prompt.present. Threaded target_lang through the transcribe
entry points (default empty); non-prompt models take the no-op path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve a language prompt index in the StreamingSession constructor (new
target_lang param, default empty -> model default_lang) and apply the
prompt_kernel projection to each chunk's encoder frames before the RNN-T
decode. The one-hot is constant over time, so per-chunk application is exact
and equals the offline forward's single application. Non-prompt models take
the no-op path (prompt_.present()==false) and stay byte-identical.

run_stream_over_pcm gains a trailing target_lang param (default empty) so a
language can route through one entry point; the session already owns its
resolved index, so the driver leaves it unused for now (Phase 4 wires it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend dump_prompt_baseline to emit baseline.stream_text: run NeMo's
cache-aware streaming encoder, apply m.prompt_kernel to the concatenated
streamed output for the target_lang, and RNN-T greedy decode it (specials
stripped). Add tests/test_streaming_nemotron.cpp: drive a prompt-aware
StreamingSession over the clip and assert sess.text() == baseline.stream_text.

Parity gate (lang=en, speech.wav): got == ref EXACTLY:
  "Well, I don't wish to see it any more, observed Phoebe, turning away her
   eyes. <en-US> It is certainly very like the old portrait. <en-US>"

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add parakeet_capi_transcribe_path_lang, parakeet_capi_transcribe_pcm_lang and
parakeet_capi_stream_begin_lang for multilingual prompt-conditioned (nemotron)
models. target_lang is a locale string; NULL or "" selects the model default
and non-prompt models ignore it. An unknown locale on a prompt model is caught
at the boundary, returning NULL with the message set on the ctx last error. The
original non-lang entry points delegate to the new ones with the model default,
preserving behavior. ABI version bumped to 3.

test_capi gains a PARAKEET_TEST_GGUF_NEMOTRON-guarded block asserting a known
lang transcribes (non-NULL) and an unknown lang returns NULL with a non-empty
last_error; the two model blocks are now independent and skip cleanly (77) when
neither env var is set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
transcribe gains --lang <locale> to select the language prompt for multilingual
(nemotron) prompt models; empty -> the model default and non-prompt models
ignore it. The plain offline path routes through the C-API
parakeet_capi_transcribe_path_lang when --lang is set (so an unknown locale is a
clean error), and keeps the existing free-function path otherwise so behavior
for every other model is unchanged. --timestamps threads lang into
transcribe_path_with_timestamps; --stream threads it into the StreamingSession
ctor (what stream_begin_lang forwards), keeping the rich per-word/EOU output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ine+stream)

Refactor gen_nemo_baseline.dump_prompt_baseline into an importable
compute_prompt_reference helper and reuse it from the new e2e driver, which
runs the built parakeet-cli per (clip, lang, mode) and asserts WER 0 vs NeMo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
README: add the prompt-conditioned multilingual streaming model
(nvidia/nemotron-3.5-asr-streaming-0.6b, 40+ locales, --lang, WER 0 offline +
streaming). conversion.md: document the parakeet.prompt.* KV schema,
encoder.use_bias, att_context_presets, and the prompt_kernel tensors (stay F32).
parity.md: add the nemotron coverage row + e2e cross-check note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…DW-1.1 card)

Add the model to ALL_MODELS and KNOWN_WER (f16/q8_0/q6_k/q5_k/q4_k, all WER 0.0
offline vs NeMo with recorded sizes). Add a per-id LICENSES map (default
CC-BY-4.0) so the generated card states OpenMDW-1.1 for this entry, wired into
both the per-model and the collection cards (frontmatter license/license_name/
license_link, License section, per-model rows). Quant allowlist unchanged: the
prompt_kernel, LSTM prediction net, and featurizer tensors stay F32.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… q8_0)

Benchmarks the prompt-conditioned nemotron-3.5-asr-streaming-0.6b port on CPU
against NeMo (PyTorch CPU), following the existing parakeet.cpp methodology:
load once, warm up once, time transcribe only, median of N passes, RTFx =
audio_sec / proc_sec.

- Adds scripts/bench_nemotron.py. ours runs parakeet-cli bench with the en
  language prompt; NeMo runs the same prompt forward (preprocessor, encoder,
  PromptKernel, RNN-T greedy) reusing gen_nemo_baseline.resolve_prompt_lang.
  Optionally times the cache-aware streaming path too.
- Adds --lang to the CLI bench subcommand so the prompt-conditioned timing path
  selects the same language prompt as transcribe (passed to transcribe_pcm).
- Adds build_nemotron_section to gen_benchmark_md.py, fed by the new
  benchmarks/results/nemotron/bench.json, so the section is reproducible.

Results on AMD Ryzen 9 9950X3D (20 cores, CPU-only, 8 threads), speech.wav
(7.43 s), lang en, median of 7 passes:

  NeMo            RTFx 12.2
  parakeet.cpp f32  RTFx 29.4  2.40x  agreement WER 0.0000%
  parakeet.cpp q8_0 RTFx 30.8  2.52x  agreement WER 0.0000%
  streaming f32     compute RTFx 3.80 (latency-oriented)

Transcripts are byte-identical to NeMo on the timed runs, so the speed numbers
compare equal work. Full suite green (ctest 48/48 non-nemotron, 2/2 nemotron).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ffline + capi contract)

The streaming StreamingSession ctor silently fell back to the default
language on an unknown locale, contradicting the parakeet_capi_stream_begin_lang
header contract (NULL on an unknown locale) and diverging from the offline
Model::resolve_prompt_index path, which throws. A typo like --stream --lang xx
produced wrong-language output with no error.

Factor the throwing resolution into PromptCfg::resolve_index_or_throw and use
it from both the offline path and the StreamingSession ctor so both reject
typos identically. The empty-lang default and the non-prompt no-op are
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mudler mudler force-pushed the feat/nemotron-3.5-asr-streaming branch from 3bbb387 to 411ece3 Compare June 6, 2026 07:17
@mudler mudler merged commit 271e70b into master Jun 6, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants