[speechlm2] Add streaming inference pipeline for NemotronVoiceChat#15571
[speechlm2] Add streaming inference pipeline for NemotronVoiceChat#15571erastorgueva-nv wants to merge 39 commits intoNVIDIA-NeMo:mainfrom
Conversation
…model.py modification for function_head Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…with patches Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…le, optional torch.compile & subword cache Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…g - adjusted infer_one_step code so operations will match offline Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…nce wrapper loading Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…s which will be ignored anyway Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…_history_size parameter Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…StepResult etc Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ogit comparison Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ep, add docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
… for parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…tic parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…kens_to_str_raw Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
…ering Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
150aab1 to
81a752e
Compare
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
nemo/collections/speechlm2/inference/model_wrappers/model_factory.py
Outdated
Show resolved
Hide resolved
nemo/collections/speechlm2/inference/model_wrappers/decode_state.py
Outdated
Show resolved
Hide resolved
examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py
Outdated
Show resolved
Hide resolved
examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py
Outdated
Show resolved
Hide resolved
nemo/collections/speechlm2/inference/streaming/state/s2s_state.py
Outdated
Show resolved
Hide resolved
nemo/collections/speechlm2/inference/streaming/state/s2s_state.py
Outdated
Show resolved
Hide resolved
nemo/collections/speechlm2/inference/pipelines/s2s_pipeline_interface.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
nemo/collections/speechlm2/inference/model_wrappers/perception_cache.py
Outdated
Show resolved
Hide resolved
…ing params Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
pzelasko
left a comment
There was a problem hiding this comment.
Partial review up to streaming_s2s_pipeline.py at line 511 (note to self where to pick up later)
| from nemo.collections.speechlm2.inference import S2SPipelineBuilder | ||
|
|
||
| pipeline = S2SPipelineBuilder.build_pipeline(cfg) | ||
| output = pipeline.run(audio_filepaths, options=options) |
There was a problem hiding this comment.
Does this assume a single-turn evaluation? Or the audio file can have multiple turns and the agent is expected to handle that correctly? Let's clarify this in the docs.
There was a problem hiding this comment.
Not sure what you mean - it's full-duplex, so it just generates one frame of output for every frame of audio input. Audio input can contain single-turn, muti-turn, whatever.
Or if you're asking about "evaluation" - the code doesn't support detailed "evaluation". We just generate text & audio for the full audio file (plus with an option to add silence padding at the end, so the agent can finish speaking). The one bit of "evaluation" we have is WER
There was a problem hiding this comment.
it just generates one frame of output for every frame of audio input. Audio input can contain single-turn, muti-turn, whatever.
We just generate text & audio for the full audio file (plus with an option to add silence padding at the end, so the agent can finish speaking).
Let's write these in here - it's not obvious for outside reader what characterizes the inputs and outputs in this API.
| .. code-block:: bash | ||
|
|
||
| python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \ | ||
| audio_file=/path/to/audio \ |
There was a problem hiding this comment.
Both examples here showcase audio_file. We need to mention how to perform live streaming inference (using mic or other streaming audio input connector) if it is supported by this API; or that it is not supported.
| from nemo.collections.speechlm2.inference import S2SPipelineBuilder | ||
|
|
||
| pipeline = S2SPipelineBuilder.build_pipeline(cfg) | ||
| output = pipeline.run(audio_filepaths, options=options) |
There was a problem hiding this comment.
Is there another entry-point with a streaming input connector (mic)? We should mention.
| .. code-block:: python | ||
|
|
||
| pipeline.open_session() | ||
| for frames in streamer: |
There was a problem hiding this comment.
Can we show how streamer is constructed? You'd normally refer the user to ASR pipelines documentation but it doesn't exist yet in main IIRC, so we need to describe at least basic concepts / APIs.
nemo/collections/speechlm2/inference/model_wrappers/nemotron_voicechat_inference_wrapper.py
Outdated
Show resolved
Hide resolved
| # infer_one_step sub-stages | ||
| # ------------------------------------------------------------------ | ||
|
|
||
| def _build_input_embedding( |
There was a problem hiding this comment.
It looks like this method should live in DuplexSTT class? It's exposing inner workings of input construction to a high-level inference API.
If we build DuplexSTTv2 which does it completely differently, we don't want to re-write this wrapper - we should just call stt_model.build_input_embedding()
|
|
||
| return emb | ||
|
|
||
| def _run_llm_step( |
There was a problem hiding this comment.
This method should be split to two and live in native / vllm LLM class
|
|
||
|
|
||
| @dataclass | ||
| class PerceptionCUDAGraphState: |
There was a problem hiding this comment.
Should this (partially) live in ASR collection? Could we re-use your work here to accelerate streaming models like nemotron-speech-asr?
| state.static_cache_channel_len_in = cache_last_channel_len.clone() | ||
|
|
||
| logging.info(f" Warming up encoder for CUDA graph capture...") | ||
| for _ in range(3): |
pzelasko
left a comment
There was a problem hiding this comment.
Finalized my first review pass :)
nemo/collections/speechlm2/inference/pipelines/streaming_s2s_pipeline.py
Show resolved
Hide resolved
nemo/collections/speechlm2/inference/pipelines/streaming_s2s_pipeline.py
Show resolved
Hide resolved
|
|
||
| system_prompt: str | None = None | ||
|
|
||
| top_p: float | None = None # (0, 1] |
There was a problem hiding this comment.
Is it possible to support different top_p/temperature/repetition_penalty in different examples in a batch without using a for loop over them?
Is there a strong motivation to support that? Or could we expect the user to define session-level decoding parameters and have easier batching?
| ) | ||
|
|
||
|
|
||
| class S2SContextManager: |
There was a problem hiding this comment.
Need a docstring with overview of what's this doing
| raise RuntimeError("s2s_model must provide create_decode_state(max_len)") | ||
| return self.s2s_model.create_decode_state(self.max_len) | ||
|
|
||
| def _ensure_slot(self, stream_id: int) -> int: |
There was a problem hiding this comment.
How is a "slot" defined here?
| ) | ||
| else: | ||
| config = AutoConfig.from_pretrained(model_path_or_name, trust_remote_code=trust_remote_code) | ||
| if use_meta_device: |
There was a problem hiding this comment.
with (torch.device('meta') if use_meta_device else nullcontext()):
| pretrained_weights: Whether to load pretrained weights (True) or random init (False) | ||
| dtype: Data type for the model | ||
| trust_remote_code: Whether to trust remote code when loading model (needed for some models like Nemotron) | ||
| use_meta_device: If True, create the model on the meta device (no memory allocation). |
There was a problem hiding this comment.
Is this compatible with transformers v5 or v4? Or both? We already bumped NeMo to v5
|
|
||
| from nemo.collections.speechlm2.models import NemotronVoiceChat | ||
|
|
||
| _pretrained_llm = "TinyLlama/TinyLlama_v1.1" |
There was a problem hiding this comment.
Since so much logic is dedicated to Nemotron v2 in VoiceChat, shouldn't we test against Nemotron v2 as well? Or are you concerned it will take very long time to load in CI?
| @@ -0,0 +1,280 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
These fixtures are only really shared between nemotron voicechat tests, not all speechlm2 tests, so how about creating a new tests/collections/speechlm2/voicechat directory and moving that conftest + related tests there?
…e classes Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
nemo/collections/speechlm2/inference/model_wrappers/backend/vllm/eartts.py
Fixed
Show fixed
Hide fixed
nemo/collections/speechlm2/inference/model_wrappers/backend/vllm/base.py
Fixed
Show fixed
Hide fixed
nemo/collections/speechlm2/inference/model_wrappers/backend/vllm/base.py
Fixed
Show fixed
Hide fixed
nemo/collections/speechlm2/inference/pipelines/streaming_s2s_pipeline.py
Fixed
Show fixed
Hide fixed
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
… logs to debug Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Add a streaming (real-time, chunk-by-chunk) inference pipeline for NemotronVoiceChat,
following the same architecture as the NeMo ASR Inference Pipelines.
Collection: speechlm2
Changelog
StreamingS2SPipelinewithgenerate_step()API for both batch file processing and server integrationNemotronVoicechatInferenceWrapperwithinfer_one_step()implementing perception → LLM → TTS → codec decodeS2SPipelineBuilderfactory and Hydra config (s2s_streaming.yaml) for easy setupS2SContextManagerfor decode state lifecycle,S2SStreamingStatefor output accumulations2s_streaming_infer.pyentry script for batch inference on files/manifestsDuplexSTTModel: KV cache support for Nemotron hybrid Mamba/Attention (with monkey-patches for upstream HF bugs),save_pretrainedwith tokenizer export, function head, ASR logit boosts,cache_positionforwardingconftest.pyfixtures, offline-vs-streaming parity test, no-crash config sweepstreaming_inference.rstwith architecture, config reference, and server integration guideModifications to more general code - FYI @kevinhu-nv @Edresson
NemotronVoiceChat:from_pretrainedsupports loading from HF-format checkpoint withllm_artifacts/EarTTSModel: vectorized depth-sum, precomputed RVQ schedule, optionaltorch.compile, subword cache_patch_nemotron_cache_bugsand_patch_nemotron_block_forwardmethods inDuplexSTTModelare patching bugs in the HF Nemotron model code so we can get the KV caching to work. The patches seem to work for me, though I wonder if we can use more up-to-date code that doesn't have the patches.Usage
python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \ audio_file=/path/to/audio.wav \ s2s.model_path=/path/to/checkpoint \ s2s.speaker_name="<name>" \ s2s.engine_type="native" \ streaming.chunk_size_in_secs=0.08 \ streaming.buffer_size_in_secs=1.68GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information