fix: reset the streaming decoder on <EOU>/<EOB> so transcription continues (#13) by localai-bot · Pull Request #15 · mudler/parakeet.cpp

localai-bot · 2026-06-06T18:26:30Z

Issue

#13: with --stream, transcription stops after the first [EOU]. The model keeps processing the rest of the audio but no further text comes out.

Root cause

The realtime EOU model (parakeet_realtime_eou_120m-v1) emits <EOU> / <EOB> as ordinary vocab tokens marking end of utterance. The cache-aware streaming decode carried the RNN-T decoder state across chunks but never reset it, so once <EOU> was emitted the prediction net stayed conditioned on it and the joint scored blank on every following frame. The stream went silent after the first utterance.

This is not a divergence from NeMo's greedy decode (NeMo's plain rnnt_decoder_predictions_tensor does the exact same thing, offline and streaming), it is that the model is not meant to run continuously without a reset.

What upstream does

NeMo's reference streaming driver for this model, examples/voice_agent/.../nemo/streaming_asr.py (NemoStreamingASRService.transcribe), carries partial_hypotheses + encoder caches across chunks just like our StreamingSession, and calls reset_state() whenever <EOU>/<EOB> appears in a chunk so the next utterance decodes from a fresh state.

Fix

StreamingSession::feed_mel_chunk now resets the RNN-T decoder state (LSTM h/c to zero, last token back to SOS) after a chunk emits <EOU>/<EOB>.

Only the decoder is reset, not the StreamingEncoder cache. NeMo's reset_state also drops the encoder cache, but resetting it was verified byte-identical on the transcript (decoder-only reset == full reset on the diffusion 60s / 2-EOU and 180s / 5-EOU clips), so the validated streaming-encoder path is left untouched. enc_frame_ keeps running so EOU timestamps stay absolute in the clip, and the offline path is unchanged (it matches NeMo offline on single utterances).

Verification

On the issue's diffusion2023-07-03 sample, --stream now emits every utterance (2 EOUs at 60s, 5 at 180s), byte-identical to NeMo's reset-on-EOU output.

New gated regression test test_streaming_eou_reset + scripts/gen_stream_reset_baseline.py: builds a two-utterance clip (speech.wav + silence + speech.wav) so an <EOU> fires mid-stream, runs NeMo's cache-aware streaming loop with reset-on-EOU, and asserts our streamed transcript matches it exactly and that the second utterance is recovered. Confirmed it fails with the reset disabled.
No regression: test_streaming_decode, test_transcribe_eou, test_capi_stream still pass; full suite 56/56.

The one intentional difference vs NeMo's offline-clip driver: we may emit one extra trailing end-of-clip <EOU> event (the documented streaming-tail artifact). It never changes the transcript, and the real-time NeMo service would emit it too once more audio arrives.

🤖 Generated with Claude Code

…inues (#13) The realtime EOU model (parakeet_realtime_eou_120m-v1) emits <EOU> / <EOB> as ordinary vocab tokens to mark end of utterance. The cache-aware streaming decode carried the RNN-T decoder state across chunks but never reset it, so once <EOU> was emitted the prediction net stayed conditioned on it and the joint scored blank on every following frame: the stream went silent after the first utterance (issue #13). This matched NeMo's plain rnnt_decoder_predictions_tensor (which does the same), but that is not how the model is meant to run. NeMo's reference streaming driver for this model (examples/voice_agent/.../nemo/streaming_asr.py NemoStreamingASRService.transcribe) calls reset_state() whenever <EOU>/<EOB> appears in a chunk, so the next utterance decodes from a fresh decoder state. StreamingSession::feed_mel_chunk now does the same: after a chunk emits <EOU>/<EOB> it resets the RNN-T decoder state (LSTM h/c to zero, last token back to SOS) for the next chunk. Only the decoder is reset, not the StreamingEncoder cache. NeMo's reset_state also drops the encoder cache, but that was verified byte-identical on the transcript (decoder-only reset == full reset_state on the diffusion 60s/2-EOU and 180s/5-EOU clips), so the validated streaming-encoder path is left untouched. enc_frame_ keeps running so EOU timestamps stay absolute in the clip, and the offline path is unchanged (it matches NeMo offline on single utterances). Adds a gated regression test (test_streaming_eou_reset) plus a NeMo reset-on-EOU baseline generator (gen_stream_reset_baseline.py) that builds a two-utterance clip so an <EOU> fires mid-stream; the test asserts our streamed transcript matches NeMo's reset reference exactly and that the second utterance is recovered. Confirmed it fails with the reset disabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mudler merged commit abd0087 into master Jun 6, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reset the streaming decoder on <EOU>/<EOB> so transcription continues (#13)#15

fix: reset the streaming decoder on <EOU>/<EOB> so transcription continues (#13)#15
mudler merged 1 commit into
masterfrom
fix/streaming-eou-reset

localai-bot commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

localai-bot commented Jun 6, 2026

Issue

Root cause

What upstream does

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants