Add nemotron-3.5-asr-streaming-0.6b (multilingual prompt-conditioned streaming)#10
Merged
Conversation
… att_context presets
Emit parakeet.prompt.{present,num_prompts,dictionary.keys,dictionary.values,
default_lang} and parakeet.encoder.use_bias for prompt-conditioned multilingual
checkpoints (nvidia/nemotron-3.5-asr-streaming-0.6b). Handle the nested
att_context_size preset list ([[56,3],[56,0],...]) by taking the first preset as
the default and recording all presets in parakeet.encoder.att_context_presets.
Also refine detect_arch: a bare aux_ctc config block is no longer enough to mark
a model hybrid. The nemotron prompt RNNT carries an unconfigured aux_ctc stub
(num_classes=-1, empty vocabulary) but has no ctc decoder and zero ctc_decoder.*
weights (NeMo initializes it RNNT-only), so require an actual model.ctc_decoder
before classifying as hybrid. This makes the model convert as arch=rnnt.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add PromptCfg (present, num_prompts, default_lang, dict_keys/vals, lang_to_index) and use_bias to ParakeetConfig, and read the new KVs in ModelLoader::load via a new kv_str_arr helper. present=false / use_bias=true defaults keep every existing model byte-identical. Extend test_model_loader with a PARAKEET_TEST_GGUF_NEMOTRON block asserting the resolved prompt dictionary (de=9, auto=101, unknown=-1) and use_bias=false; it skips silently when the fixture env var is unset. The encoder attention/FFN linear bias loads were already optional (clone_weight_opt + ml.tensor guards across relpos_attention/conformer/streaming_encoder), and every subsampling bias is present in this checkpoint, so the use_bias=false model loads and its encoder graph builds with no further changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_GGUF The nemotron prompt-config block was unreachable when PARAKEET_TEST_GGUF was unset, because main() returned 77 before it. Guard the base-model checks behind PARAKEET_TEST_GGUF and run the nemotron block whenever PARAKEET_TEST_GGUF_NEMOTRON is set. Only skip (return 77) when neither env var is present. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…for nemotron Decode the prompt-conditioned encoder output directly via the model's RNNT decoding object: the prompt model's transcribe dataloader resolves the prompt index from per-cut language metadata, which a bare wav fixture lacks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ity test Concat the constant language one-hot onto the encoder output, then Linear->ReLU ->Linear (prompt_kernel.0/2) on the persistent backend via run_graph. Parity vs NeMo prompt_kernel_out: max|d|=1.9e-6. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
resolve_prompt_index maps a locale to its prompt index (empty -> default_lang), and the offline + batch decode paths project the encoder output through the PromptKernel when prompt.present. Threaded target_lang through the transcribe entry points (default empty); non-prompt models take the no-op path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve a language prompt index in the StreamingSession constructor (new target_lang param, default empty -> model default_lang) and apply the prompt_kernel projection to each chunk's encoder frames before the RNN-T decode. The one-hot is constant over time, so per-chunk application is exact and equals the offline forward's single application. Non-prompt models take the no-op path (prompt_.present()==false) and stay byte-identical. run_stream_over_pcm gains a trailing target_lang param (default empty) so a language can route through one entry point; the session already owns its resolved index, so the driver leaves it unused for now (Phase 4 wires it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend dump_prompt_baseline to emit baseline.stream_text: run NeMo's cache-aware streaming encoder, apply m.prompt_kernel to the concatenated streamed output for the target_lang, and RNN-T greedy decode it (specials stripped). Add tests/test_streaming_nemotron.cpp: drive a prompt-aware StreamingSession over the clip and assert sess.text() == baseline.stream_text. Parity gate (lang=en, speech.wav): got == ref EXACTLY: "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>" Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add parakeet_capi_transcribe_path_lang, parakeet_capi_transcribe_pcm_lang and parakeet_capi_stream_begin_lang for multilingual prompt-conditioned (nemotron) models. target_lang is a locale string; NULL or "" selects the model default and non-prompt models ignore it. An unknown locale on a prompt model is caught at the boundary, returning NULL with the message set on the ctx last error. The original non-lang entry points delegate to the new ones with the model default, preserving behavior. ABI version bumped to 3. test_capi gains a PARAKEET_TEST_GGUF_NEMOTRON-guarded block asserting a known lang transcribes (non-NULL) and an unknown lang returns NULL with a non-empty last_error; the two model blocks are now independent and skip cleanly (77) when neither env var is set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
transcribe gains --lang <locale> to select the language prompt for multilingual (nemotron) prompt models; empty -> the model default and non-prompt models ignore it. The plain offline path routes through the C-API parakeet_capi_transcribe_path_lang when --lang is set (so an unknown locale is a clean error), and keeps the existing free-function path otherwise so behavior for every other model is unchanged. --timestamps threads lang into transcribe_path_with_timestamps; --stream threads it into the StreamingSession ctor (what stream_begin_lang forwards), keeping the rich per-word/EOU output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ine+stream) Refactor gen_nemo_baseline.dump_prompt_baseline into an importable compute_prompt_reference helper and reuse it from the new e2e driver, which runs the built parakeet-cli per (clip, lang, mode) and asserts WER 0 vs NeMo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
README: add the prompt-conditioned multilingual streaming model (nvidia/nemotron-3.5-asr-streaming-0.6b, 40+ locales, --lang, WER 0 offline + streaming). conversion.md: document the parakeet.prompt.* KV schema, encoder.use_bias, att_context_presets, and the prompt_kernel tensors (stay F32). parity.md: add the nemotron coverage row + e2e cross-check note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…DW-1.1 card) Add the model to ALL_MODELS and KNOWN_WER (f16/q8_0/q6_k/q5_k/q4_k, all WER 0.0 offline vs NeMo with recorded sizes). Add a per-id LICENSES map (default CC-BY-4.0) so the generated card states OpenMDW-1.1 for this entry, wired into both the per-model and the collection cards (frontmatter license/license_name/ license_link, License section, per-model rows). Quant allowlist unchanged: the prompt_kernel, LSTM prediction net, and featurizer tensors stay F32. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… q8_0) Benchmarks the prompt-conditioned nemotron-3.5-asr-streaming-0.6b port on CPU against NeMo (PyTorch CPU), following the existing parakeet.cpp methodology: load once, warm up once, time transcribe only, median of N passes, RTFx = audio_sec / proc_sec. - Adds scripts/bench_nemotron.py. ours runs parakeet-cli bench with the en language prompt; NeMo runs the same prompt forward (preprocessor, encoder, PromptKernel, RNN-T greedy) reusing gen_nemo_baseline.resolve_prompt_lang. Optionally times the cache-aware streaming path too. - Adds --lang to the CLI bench subcommand so the prompt-conditioned timing path selects the same language prompt as transcribe (passed to transcribe_pcm). - Adds build_nemotron_section to gen_benchmark_md.py, fed by the new benchmarks/results/nemotron/bench.json, so the section is reproducible. Results on AMD Ryzen 9 9950X3D (20 cores, CPU-only, 8 threads), speech.wav (7.43 s), lang en, median of 7 passes: NeMo RTFx 12.2 parakeet.cpp f32 RTFx 29.4 2.40x agreement WER 0.0000% parakeet.cpp q8_0 RTFx 30.8 2.52x agreement WER 0.0000% streaming f32 compute RTFx 3.80 (latency-oriented) Transcripts are byte-identical to NeMo on the timed runs, so the speed numbers compare equal work. Full suite green (ctest 48/48 non-nemotron, 2/2 nemotron). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ffline + capi contract) The streaming StreamingSession ctor silently fell back to the default language on an unknown locale, contradicting the parakeet_capi_stream_begin_lang header contract (NULL on an unknown locale) and diverging from the offline Model::resolve_prompt_index path, which throws. A typo like --stream --lang xx produced wrong-language output with no error. Factor the throwing resolution into PromptCfg::resolve_index_or_throw and use it from both the offline path and the StreamingSession ctor so both reject typos identically. The empty-lang default and the non-prompt no-op are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3bbb387 to
411ece3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds support for
nvidia/nemotron-3.5-asr-streaming-0.6b, the multilingual (40+ locales), prompt-conditioned, cache-aware streaming NeMo model.The model is a standard cache-aware streaming FastConformer + RNN-T that parakeet.cpp already implements, plus one new piece: a post-encoder prompt module that conditions on the target language. We model it as an orthogonal
prompt.presentcapability (likestreaming.present), so all existing models stay byte-identical.The new prompt module is small: take the encoder output
[d_model=1024, T], concatenate a one-hot language vector[128, T](constant over time), and project it back withLinear(1152 to 2048) -> ReLU -> Linear(2048 to 1024). The language is chosen with--lang <locale>(defaultauto, which is just prompt index 101, so there is no separate language-ID head).What is in here
parakeet.prompt.*KVs +encoder.use_bias, and handles the nestedatt_context_sizepresets.PromptKernelunit (src/prompt_kernel.{hpp,cpp}), wired into both the offline and the per-chunk streaming decode, gated onprompt.present.target_langplumbed throughModel,StreamingSession, the C-API (new_langvariants, ABI bumped to 3), and the CLI (--lang). Unknown locales error cleanly and consistently across offline, streaming, and the C-API.publish_hf.py(5 variants) with an OpenMDW-1.1 license card (the upstream license differs from the other checkpoints).docs/conversion.md, anddocs/parity.mdupdated.Validation (parity-first, against NeMo)
NeMo reference comes from NeMo main (the model class
EncDecRNNTBPEModelWithPromptis newer than the pinned 2.7.3; baselines were generated in a dedicated venv).cache_aware_stream_stepreference).scripts/e2e_nemo_compare.py: 20-row offline+streaming comparison, all WER 0.test_prompt_kernel,test_transcribe_nemotron,test_streaming_nemotron(fixture-gated, skip 77 when the model gguf/baseline are absent). Full suite 50/50 green; the existing 10 models are unaffected (the prompt path is gated and never runs for them).Benchmark (CPU, AMD Ryzen 9 9950X3D, 8 threads, speech.wav 7.43 s)
Notes
third_party/ggmlsubmodule pointer is intentionally untouched.🤖 Generated with Claude Code