Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
85f9406
add changes from duplex-realtime-inference branch, except duplex_stt_…
erastorgueva-nv Mar 11, 2026
98da1ad
add on: asr_logits boosts, speaker embedding, fc head
erastorgueva-nv Mar 12, 2026
52813de
add use_llm_cache option, will use HybridMambaAttentionDynamicCache, …
erastorgueva-nv Mar 12, 2026
a65916a
add tts inference speedups: vectorize depthsum, precompute rvq schedu…
erastorgueva-nv Mar 13, 2026
2b21753
allow using speaker_latent with vllm (need to update vllm eartts.py)
erastorgueva-nv Mar 13, 2026
11abaa0
add flag for speaker_name if doing standalone inference
erastorgueva-nv Mar 13, 2026
0b506b2
remove standalone code path; add parity check for offline vs streamin…
erastorgueva-nv Mar 19, 2026
a7c61d9
skip pretrained ASR/LLM downloads in from_pretrained; simplify infere…
erastorgueva-nv Mar 19, 2026
40475f0
quickfix for parity harness regarding speaker name / reference in tts
erastorgueva-nv Mar 19, 2026
1449183
speed up model loading: use meta device, dont get codec silence token…
erastorgueva-nv Mar 19, 2026
ef06833
normalize indentation to 4-space
erastorgueva-nv Mar 20, 2026
a485dcc
remove hardcoded env var, simple tidy: remove dead code atc
erastorgueva-nv Mar 20, 2026
dd88987
always use codec cache => remove use_codec_cache flag and codec_token…
erastorgueva-nv Mar 20, 2026
adc42f7
remove newlines in logs
erastorgueva-nv Mar 20, 2026
8babc04
further tidying: pass StreamingDecodeState directly, return Inference…
erastorgueva-nv Mar 23, 2026
0285589
Add pytest-based offline vs. incremental inference parity test with l…
erastorgueva-nv Mar 24, 2026
a918f7f
refactor streaming S2S pipeline: extract helpers, factor infer_one_st…
erastorgueva-nv Mar 25, 2026
277511b
in test: use existing audio file, allow system prompt, specify params…
erastorgueva-nv Mar 25, 2026
dc6a759
Refactor voicechat tests: shared fixtures, no-crash sweep, determinis…
erastorgueva-nv Mar 30, 2026
6e98c85
Fix byte-level BPE decoding in raw output: unify tokens_to_str and to…
erastorgueva-nv Mar 30, 2026
819e5f4
use whisper normalizer for wer calculation
erastorgueva-nv Mar 31, 2026
e8e7151
remove unnecessary logging in perception cache step
erastorgueva-nv Mar 31, 2026
eebea30
vectorize rep penalty; fix sampling - nan/inf check before top-p filt…
erastorgueva-nv Mar 31, 2026
de230d9
Preserve BOS/EOS as literal strings in decoded text output
erastorgueva-nv Mar 31, 2026
8b849c1
update triton code; bugfix for vllm dtype/device
erastorgueva-nv Apr 1, 2026
d3db700
Always send prefill before audio streaming; fix bfloat16 audio output
erastorgueva-nv Apr 1, 2026
81a752e
remove triton code to keep PR simple
erastorgueva-nv Apr 1, 2026
b7673a4
add missing __init__.py
erastorgueva-nv Apr 1, 2026
df2e3bb
use built-in type hints (X | None, dict, list) instead of typing imports
erastorgueva-nv Apr 1, 2026
c3c0d7e
use nemo_asr.metrics.wer.word_error_rate for wer calc
erastorgueva-nv Apr 1, 2026
770efb4
use SimpleTimer in s2s_streaming_infer.py script
erastorgueva-nv Apr 1, 2026
89f818f
simplify flow for prefill and use per-stream options, including sampl…
erastorgueva-nv Apr 3, 2026
3e9e3e1
perception_cache: check all three fields in is_initialized
erastorgueva-nv Apr 3, 2026
84eeec5
move silence padding from pipeline run-loop into streamer classes
erastorgueva-nv Apr 3, 2026
800bcc2
return incremental GenerateStepOutput from generate_step
erastorgueva-nv Apr 3, 2026
88543cf
refactor: split model_factory into backend/ modules; unify vLLM engin…
erastorgueva-nv Apr 9, 2026
e0db2ca
address CodeQL errors
erastorgueva-nv Apr 10, 2026
65d14e6
Clean up debug/logging: logger pattern, keep logits on GPU, per-frame…
erastorgueva-nv Apr 10, 2026
c3af8fa
Add per-step progress bar, timing summary, and pad-visible logging
erastorgueva-nv Apr 14, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion docs/source/speechlm2/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,35 @@ You can evaluate and run full-duplex inference using the `NemotronVoiceChat` pip

print(f"Agent response: {generated_text}")
# generated_speech can now be saved or played (sampled at model.target_sample_rate)


NemotronVoiceChat Streaming Inference
*************************************

For real-time, chunk-by-chunk inference (as opposed to the offline mode shown
above), use the Streaming S2S Pipeline:

.. code-block:: python

from nemo.collections.speechlm2.inference import S2SPipelineBuilder

pipeline = S2SPipelineBuilder.build_pipeline(cfg)
output = pipeline.run(audio_filepaths, options=options)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume a single-turn evaluation? Or the audio file can have multiple turns and the agent is expected to handle that correctly? Let's clarify this in the docs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean - it's full-duplex, so it just generates one frame of output for every frame of audio input. Audio input can contain single-turn, muti-turn, whatever.

Or if you're asking about "evaluation" - the code doesn't support detailed "evaluation". We just generate text & audio for the full audio file (plus with an option to add silence padding at the end, so the agent can finish speaking). The one bit of "evaluation" we have is WER

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just generates one frame of output for every frame of audio input. Audio input can contain single-turn, muti-turn, whatever.

We just generate text & audio for the full audio file (plus with an option to add silence padding at the end, so the agent can finish speaking).

Let's write these in here - it's not obvious for outside reader what characterizes the inputs and outputs in this API.


Or from the command line:

.. code-block:: bash

python examples/speechlm2/nemo_inference_pipelines/s2s_streaming_infer.py \
audio_file=/path/to/audio \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both examples here showcase audio_file. We need to mention how to perform live streaming inference (using mic or other streaming audio input connector) if it is supported by this API; or that it is not supported.

s2s.model_path=/path/to/checkpoint \
s2s.speaker_name="<speaker>" \
s2s.engine_type=native \
s2s.system_prompt="You are a helpful assistant." \
streaming.chunk_size_in_secs=0.24 \
streaming.buffer_size_in_secs=1.68

See :doc:`streaming_inference` for full details on configuration, architecture,
and server integration.

Training a Model
----------------
Expand Down Expand Up @@ -341,3 +369,4 @@ For more information, see additional sections in the SpeechLM2 docs:
datasets
configs
training_and_scaling
streaming_inference
Loading
Loading