Skip to content

fix: reset the streaming decoder on <EOU>/<EOB> so transcription continues (#13)#15

Merged
mudler merged 1 commit into
masterfrom
fix/streaming-eou-reset
Jun 6, 2026
Merged

fix: reset the streaming decoder on <EOU>/<EOB> so transcription continues (#13)#15
mudler merged 1 commit into
masterfrom
fix/streaming-eou-reset

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Issue

#13: with --stream, transcription stops after the first [EOU]. The model keeps processing the rest of the audio but no further text comes out.

Root cause

The realtime EOU model (parakeet_realtime_eou_120m-v1) emits <EOU> / <EOB> as ordinary vocab tokens marking end of utterance. The cache-aware streaming decode carried the RNN-T decoder state across chunks but never reset it, so once <EOU> was emitted the prediction net stayed conditioned on it and the joint scored blank on every following frame. The stream went silent after the first utterance.

This is not a divergence from NeMo's greedy decode (NeMo's plain rnnt_decoder_predictions_tensor does the exact same thing, offline and streaming), it is that the model is not meant to run continuously without a reset.

What upstream does

NeMo's reference streaming driver for this model, examples/voice_agent/.../nemo/streaming_asr.py (NemoStreamingASRService.transcribe), carries partial_hypotheses + encoder caches across chunks just like our StreamingSession, and calls reset_state() whenever <EOU>/<EOB> appears in a chunk so the next utterance decodes from a fresh state.

Fix

StreamingSession::feed_mel_chunk now resets the RNN-T decoder state (LSTM h/c to zero, last token back to SOS) after a chunk emits <EOU>/<EOB>.

Only the decoder is reset, not the StreamingEncoder cache. NeMo's reset_state also drops the encoder cache, but resetting it was verified byte-identical on the transcript (decoder-only reset == full reset on the diffusion 60s / 2-EOU and 180s / 5-EOU clips), so the validated streaming-encoder path is left untouched. enc_frame_ keeps running so EOU timestamps stay absolute in the clip, and the offline path is unchanged (it matches NeMo offline on single utterances).

Verification

On the issue's diffusion2023-07-03 sample, --stream now emits every utterance (2 EOUs at 60s, 5 at 180s), byte-identical to NeMo's reset-on-EOU output.

  • New gated regression test test_streaming_eou_reset + scripts/gen_stream_reset_baseline.py: builds a two-utterance clip (speech.wav + silence + speech.wav) so an <EOU> fires mid-stream, runs NeMo's cache-aware streaming loop with reset-on-EOU, and asserts our streamed transcript matches it exactly and that the second utterance is recovered. Confirmed it fails with the reset disabled.
  • No regression: test_streaming_decode, test_transcribe_eou, test_capi_stream still pass; full suite 56/56.

The one intentional difference vs NeMo's offline-clip driver: we may emit one extra trailing end-of-clip <EOU> event (the documented streaming-tail artifact). It never changes the transcript, and the real-time NeMo service would emit it too once more audio arrives.

🤖 Generated with Claude Code

…inues (#13)

The realtime EOU model (parakeet_realtime_eou_120m-v1) emits <EOU> / <EOB> as
ordinary vocab tokens to mark end of utterance. The cache-aware streaming decode
carried the RNN-T decoder state across chunks but never reset it, so once <EOU>
was emitted the prediction net stayed conditioned on it and the joint scored
blank on every following frame: the stream went silent after the first
utterance (issue #13). This matched NeMo's plain rnnt_decoder_predictions_tensor
(which does the same), but that is not how the model is meant to run.

NeMo's reference streaming driver for this model
(examples/voice_agent/.../nemo/streaming_asr.py NemoStreamingASRService.transcribe)
calls reset_state() whenever <EOU>/<EOB> appears in a chunk, so the next
utterance decodes from a fresh decoder state. StreamingSession::feed_mel_chunk
now does the same: after a chunk emits <EOU>/<EOB> it resets the RNN-T decoder
state (LSTM h/c to zero, last token back to SOS) for the next chunk.

Only the decoder is reset, not the StreamingEncoder cache. NeMo's reset_state
also drops the encoder cache, but that was verified byte-identical on the
transcript (decoder-only reset == full reset_state on the diffusion 60s/2-EOU
and 180s/5-EOU clips), so the validated streaming-encoder path is left
untouched. enc_frame_ keeps running so EOU timestamps stay absolute in the clip,
and the offline path is unchanged (it matches NeMo offline on single utterances).

Adds a gated regression test (test_streaming_eou_reset) plus a NeMo reset-on-EOU
baseline generator (gen_stream_reset_baseline.py) that builds a two-utterance
clip so an <EOU> fires mid-stream; the test asserts our streamed transcript
matches NeMo's reset reference exactly and that the second utterance is
recovered. Confirmed it fails with the reset disabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mudler mudler merged commit abd0087 into master Jun 6, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants