[speechlm2] Add streaming inference pipeline for NemotronVoiceChat by erastorgueva-nv · Pull Request #15571 · NVIDIA-NeMo/NeMo

erastorgueva-nv · 2026-04-01T06:27:05Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add a streaming (real-time, chunk-by-chunk) inference pipeline for NemotronVoiceChat,
following the same architecture as the NeMo ASR Inference Pipelines.

Collection: speechlm2

Changelog

Add StreamingS2SPipeline with generate_step() API for both batch file processing and server integration
Add NemotronVoicechatInferenceWrapper with infer_one_step() implementing perception → LLM → TTS → codec decode
Add perception cache with optional CUDA graph support for cache-aware streaming encoding
Add S2SPipelineBuilder factory and Hydra config (s2s_streaming.yaml) for easy setup
Add state management: slot-based S2SContextManager for decode state lifecycle, S2SStreamingState for output accumulation
Add s2s_streaming_infer.py entry script for batch inference on files/manifests
Extend DuplexSTTModel: KV cache support for Nemotron hybrid Mamba/Attention (with monkey-patches for upstream HF bugs), save_pretrained with tokenizer export, function head, ASR logit boosts, cache_position forwarding
Speed up model loading: meta device initialization, skip codec silence token computation when codec has random weights
Fix sampling: nan/inf check before top-p filtering, vectorized repetition penalty
Fix byte-level BPE decoding and BOS/EOS preservation in text output
Refactor tests: shared conftest.py fixtures, offline-vs-streaming parity test, no-crash config sweep
Add docs: streaming_inference.rst with architecture, config reference, and server integration guide

Modifications to more general code - FYI @kevinhu-nv @Edresson

Extend NemotronVoiceChat: from_pretrained supports loading from HF-format checkpoint with llm_artifacts/
Extend EarTTSModel: vectorized depth-sum, precomputed RVQ schedule, optional torch.compile, subword cache
Patches: _patch_nemotron_cache_bugs and _patch_nemotron_block_forward methods in DuplexSTTModel are patching bugs in the HF Nemotron model code so we can get the KV caching to work. The patches seem to work for me, though I wonder if we can use more up-to-date code that doesn't have the patches.

Usage

python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \
    audio_file=/path/to/audio.wav \
    s2s.model_path=/path/to/checkpoint \
    s2s.speaker_name="<name>" \
    s2s.engine_type="native" \
    streaming.chunk_size_in_secs=0.08 \
    streaming.buffer_size_in_secs=1.68

from nemo.collections.speechlm2.inference import S2SPipelineBuilder

pipeline = S2SPipelineBuilder.build_pipeline(cfg)
output = pipeline.run(audio_filepaths, options=options)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

nemo/collections/speechlm2/modules/ear_tts_model.py

nemo/collections/speechlm2/inference/vllm/streaming_llm_engine.py

nemo/collections/speechlm2/models/nemotron_voicechat.py

nemo/collections/speechlm2/inference/vllm/streaming_llm_engine.py

nemo/collections/speechlm2/inference/model_wrappers/model_factory.py

examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py

…model.py modification for function_head Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…with patches Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…le, optional torch.compile & subword cache Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…g - adjusted infer_one_step code so operations will match offline Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…nce wrapper loading Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…s which will be ignored anyway Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…_history_size parameter Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…StepResult etc Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…ogit comparison Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…ep, add docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

… for parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…tic parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…kens_to_str_raw Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

…ering Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

nemo/collections/speechlm2/inference/model_wrappers/model_factory.py

nemo/collections/speechlm2/inference/model_wrappers/decode_state.py

examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py

nemo/collections/speechlm2/inference/streaming/state/s2s_state.py

nemo/collections/speechlm2/inference/pipelines/s2s_pipeline_interface.py

nemo/collections/speechlm2/inference/utils/audio_data.py

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

nemo/collections/speechlm2/inference/model_wrappers/perception_cache.py

…ing params Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

nemo/collections/speechlm2/inference/model_wrappers/perception_cache.py

nemo/collections/speechlm2/inference/vllm/streaming_llm_engine.py

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

pzelasko

Partial review up to streaming_s2s_pipeline.py at line 511 (note to self where to pick up later)

pzelasko · 2026-04-03T15:20:29Z

docs/source/speechlm2/intro.rst

+    from nemo.collections.speechlm2.inference import S2SPipelineBuilder
+
+    pipeline = S2SPipelineBuilder.build_pipeline(cfg)
+    output = pipeline.run(audio_filepaths, options=options)


Does this assume a single-turn evaluation? Or the audio file can have multiple turns and the agent is expected to handle that correctly? Let's clarify this in the docs.

Not sure what you mean - it's full-duplex, so it just generates one frame of output for every frame of audio input. Audio input can contain single-turn, muti-turn, whatever.

Or if you're asking about "evaluation" - the code doesn't support detailed "evaluation". We just generate text & audio for the full audio file (plus with an option to add silence padding at the end, so the agent can finish speaking). The one bit of "evaluation" we have is WER

it just generates one frame of output for every frame of audio input. Audio input can contain single-turn, muti-turn, whatever.

We just generate text & audio for the full audio file (plus with an option to add silence padding at the end, so the agent can finish speaking).

Let's write these in here - it's not obvious for outside reader what characterizes the inputs and outputs in this API.

pzelasko · 2026-04-03T15:21:52Z

docs/source/speechlm2/intro.rst

+.. code-block:: bash
+
+    python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \
+        audio_file=/path/to/audio \


Both examples here showcase audio_file. We need to mention how to perform live streaming inference (using mic or other streaming audio input connector) if it is supported by this API; or that it is not supported.

pzelasko · 2026-04-03T15:23:30Z

docs/source/speechlm2/streaming_inference.rst

+    from nemo.collections.speechlm2.inference import S2SPipelineBuilder
+
+    pipeline = S2SPipelineBuilder.build_pipeline(cfg)
+    output = pipeline.run(audio_filepaths, options=options)


Is there another entry-point with a streaming input connector (mic)? We should mention.

pzelasko · 2026-04-03T15:24:46Z

docs/source/speechlm2/streaming_inference.rst

+.. code-block:: python
+
+    pipeline.open_session()
+    for frames in streamer:


Can we show how streamer is constructed? You'd normally refer the user to ASR pipelines documentation but it doesn't exist yet in main IIRC, so we need to describe at least basic concepts / APIs.

docs/source/speechlm2/streaming_inference.rst

nemo/collections/speechlm2/inference/model_wrappers/nemotron_voicechat_inference_wrapper.py

pzelasko · 2026-04-03T20:35:18Z

nemo/collections/speechlm2/inference/model_wrappers/nemotron_voicechat_inference_wrapper.py

+    # infer_one_step sub-stages
+    # ------------------------------------------------------------------
+
+    def _build_input_embedding(


It looks like this method should live in DuplexSTT class? It's exposing inner workings of input construction to a high-level inference API.

If we build DuplexSTTv2 which does it completely differently, we don't want to re-write this wrapper - we should just call stt_model.build_input_embedding()

pzelasko · 2026-04-03T20:36:04Z

nemo/collections/speechlm2/inference/model_wrappers/nemotron_voicechat_inference_wrapper.py

+
+        return emb
+
+    def _run_llm_step(


This method should be split to two and live in native / vllm LLM class

pzelasko · 2026-04-03T20:42:23Z

nemo/collections/speechlm2/inference/model_wrappers/perception_cache.py

+
+
+@dataclass
+class PerceptionCUDAGraphState:


Should this (partially) live in ASR collection? Could we re-use your work here to accelerate streaming models like nemotron-speech-asr?

pzelasko · 2026-04-03T20:43:28Z

nemo/collections/speechlm2/inference/model_wrappers/perception_cache.py

+            state.static_cache_channel_len_in = cache_last_channel_len.clone()
+
+        logging.info(f"   Warming up encoder for CUDA graph capture...")
+        for _ in range(3):


what's magical about 3?

pzelasko

Finalized my first review pass :)

nemo/collections/speechlm2/inference/pipelines/streaming_s2s_pipeline.py

pzelasko · 2026-04-06T15:13:13Z

nemo/collections/speechlm2/inference/streaming/framing/s2s_request_options.py

+
+    system_prompt: str | None = None
+
+    top_p: float | None = None          # (0, 1]


Is it possible to support different top_p/temperature/repetition_penalty in different examples in a batch without using a for loop over them?

Is there a strong motivation to support that? Or could we expect the user to define session-level decoding parameters and have easier batching?

pzelasko · 2026-04-06T15:15:29Z

nemo/collections/speechlm2/inference/streaming/state/s2s_context_manager.py

+)
+
+
+class S2SContextManager:


Need a docstring with overview of what's this doing

pzelasko · 2026-04-06T15:16:17Z

nemo/collections/speechlm2/inference/streaming/state/s2s_context_manager.py

+            raise RuntimeError("s2s_model must provide create_decode_state(max_len)")
+        return self.s2s_model.create_decode_state(self.max_len)
+
+    def _ensure_slot(self, stream_id: int) -> int:


How is a "slot" defined here?

nemo/collections/speechlm2/models/nemotron_voicechat.py

pzelasko · 2026-04-06T16:03:08Z

nemo/collections/speechlm2/parts/pretrained.py

        )
    else:
        config = AutoConfig.from_pretrained(model_path_or_name, trust_remote_code=trust_remote_code)
+        if use_meta_device:


with (torch.device('meta') if use_meta_device else nullcontext()):

pzelasko · 2026-04-06T16:03:42Z

nemo/collections/speechlm2/parts/pretrained.py

        pretrained_weights: Whether to load pretrained weights (True) or random init (False)
        dtype: Data type for the model
        trust_remote_code: Whether to trust remote code when loading model (needed for some models like Nemotron)
+        use_meta_device: If True, create the model on the meta device (no memory allocation).


Is this compatible with transformers v5 or v4? Or both? We already bumped NeMo to v5

pzelasko · 2026-04-06T16:14:19Z

tests/collections/speechlm2/conftest.py

+
+from nemo.collections.speechlm2.models import NemotronVoiceChat
+
+_pretrained_llm = "TinyLlama/TinyLlama_v1.1"


Since so much logic is dedicated to Nemotron v2 in VoiceChat, shouldn't we test against Nemotron v2 as well? Or are you concerned it will take very long time to load in CI?

pzelasko · 2026-04-06T16:17:23Z

tests/collections/speechlm2/conftest.py

@@ -0,0 +1,280 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


These fixtures are only really shared between nemotron voicechat tests, not all speechlm2 tests, so how about creating a new tests/collections/speechlm2/voicechat directory and moving that conftest + related tests there?

…e classes Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

nemo/collections/speechlm2/inference/model_wrappers/backend/vllm/eartts.py

nemo/collections/speechlm2/inference/model_wrappers/backend/vllm/llm.py

nemo/collections/speechlm2/inference/model_wrappers/backend/__init__.py

nemo/collections/speechlm2/inference/model_wrappers/backend/vllm/base.py

nemo/collections/speechlm2/inference/pipelines/streaming_s2s_pipeline.py

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

… logs to debug Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

github-advanced-security bot found potential problems Apr 1, 2026

View reviewed changes

erastorgueva-nv added 27 commits April 1, 2026 06:33

add changes from duplex-realtime-inference branch, except duplex_stt_…

85f9406

…model.py modification for function_head Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add on: asr_logits boosts, speaker embedding, fc head

98da1ad

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add use_llm_cache option, will use HybridMambaAttentionDynamicCache, …

52813de

…with patches Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add tts inference speedups: vectorize depthsum, precompute rvq schedu…

a65916a

…le, optional torch.compile & subword cache Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

allow using speaker_latent with vllm (need to update vllm eartts.py)

2b21753

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

add flag for speaker_name if doing standalone inference

11abaa0

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove standalone code path; add parity check for offline vs streamin…

0b506b2

…g - adjusted infer_one_step code so operations will match offline Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

skip pretrained ASR/LLM downloads in from_pretrained; simplify infere…

a7c61d9

…nce wrapper loading Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

quickfix for parity harness regarding speaker name / reference in tts

40475f0

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

speed up model loading: use meta device, dont get codec silence token…

1449183

…s which will be ignored anyway Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

normalize indentation to 4-space

ef06833

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove hardcoded env var, simple tidy: remove dead code atc

a485dcc

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

always use codec cache => remove use_codec_cache flag and codec_token…

dd88987

…_history_size parameter Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove newlines in logs

adc42f7

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

further tidying: pass StreamingDecodeState directly, return Inference…

8babc04

…StepResult etc Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Add pytest-based offline vs. incremental inference parity test with l…

0285589

…ogit comparison Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

refactor streaming S2S pipeline: extract helpers, factor infer_one_st…

a918f7f

…ep, add docs Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

in test: use existing audio file, allow system prompt, specify params…

277511b

… for parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Refactor voicechat tests: shared fixtures, no-crash sweep, determinis…

dc6a759

…tic parity Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Fix byte-level BPE decoding in raw output: unify tokens_to_str and to…

6e98c85

…kens_to_str_raw Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

use whisper normalizer for wer calculation

819e5f4

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove unnecessary logging in perception cache step

e8e7151

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

vectorize rep penalty; fix sampling - nan/inf check before top-p filt…

eebea30

…ering Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Preserve BOS/EOS as literal strings in decoded text output

de230d9

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

update triton code; bugfix for vllm dtype/device

8b849c1

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Always send prefill before audio streaming; fix bfloat16 audio output

d3db700

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

remove triton code to keep PR simple

81a752e

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

erastorgueva-nv force-pushed the duplex-realtime-inference-rebase branch from 150aab1 to 81a752e Compare April 1, 2026 06:33

add missing __init__.py

b7673a4

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

erastorgueva-nv requested review from naymaraq and pzelasko April 1, 2026 06:38

naymaraq requested changes Apr 1, 2026

View reviewed changes

erastorgueva-nv added 3 commits April 1, 2026 19:14

use built-in type hints (X | None, dict, list) instead of typing imports

df2e3bb

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

use nemo_asr.metrics.wer.word_error_rate for wer calc

c3c0d7e

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

use SimpleTimer in s2s_streaming_infer.py script

770efb4

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

naymaraq requested changes Apr 2, 2026

View reviewed changes

nemo/collections/speechlm2/inference/model_wrappers/perception_cache.py Outdated Show resolved Hide resolved

erastorgueva-nv added 2 commits April 3, 2026 02:15

simplify flow for prefill and use per-stream options, including sampl…

89f818f

…ing params Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

perception_cache: check all three fields in is_initialized

3e9e3e1

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

github-advanced-security bot found potential problems Apr 3, 2026

View reviewed changes

nemo/collections/speechlm2/inference/model_wrappers/perception_cache.py Fixed Show fixed Hide fixed

nemo/collections/speechlm2/inference/vllm/streaming_llm_engine.py Fixed Show fixed Hide fixed

erastorgueva-nv added 2 commits April 3, 2026 18:07

move silence padding from pipeline run-loop into streamer classes

84eeec5

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

return incremental GenerateStepOutput from generate_step

800bcc2

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

pzelasko reviewed Apr 3, 2026

View reviewed changes

pzelasko requested changes Apr 6, 2026

View reviewed changes

refactor: split model_factory into backend/ modules; unify vLLM engin…

88543cf

…e classes Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

github-advanced-security bot found potential problems Apr 9, 2026

View reviewed changes

erastorgueva-nv added 3 commits April 10, 2026 00:09

address CodeQL errors

e0db2ca

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Clean up debug/logging: logger pattern, keep logits on GPU, per-frame…

65d14e6

… logs to debug Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Add per-step progress bar, timing summary, and pad-visible logging

c3af8fa

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>


		system_prompt: str \| None = None

		top_p: float \| None = None # (0, 1]


		from nemo.collections.speechlm2.models import NemotronVoiceChat

		_pretrained_llm = "TinyLlama/TinyLlama_v1.1"

		@@ -0,0 +1,280 @@
		# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

Conversation

erastorgueva-nv commented Apr 1, 2026

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pzelasko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pzelasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!