fix: reset the streaming decoder on <EOU>/<EOB> so transcription continues (#13)#15
Merged
Conversation
…inues (#13) The realtime EOU model (parakeet_realtime_eou_120m-v1) emits <EOU> / <EOB> as ordinary vocab tokens to mark end of utterance. The cache-aware streaming decode carried the RNN-T decoder state across chunks but never reset it, so once <EOU> was emitted the prediction net stayed conditioned on it and the joint scored blank on every following frame: the stream went silent after the first utterance (issue #13). This matched NeMo's plain rnnt_decoder_predictions_tensor (which does the same), but that is not how the model is meant to run. NeMo's reference streaming driver for this model (examples/voice_agent/.../nemo/streaming_asr.py NemoStreamingASRService.transcribe) calls reset_state() whenever <EOU>/<EOB> appears in a chunk, so the next utterance decodes from a fresh decoder state. StreamingSession::feed_mel_chunk now does the same: after a chunk emits <EOU>/<EOB> it resets the RNN-T decoder state (LSTM h/c to zero, last token back to SOS) for the next chunk. Only the decoder is reset, not the StreamingEncoder cache. NeMo's reset_state also drops the encoder cache, but that was verified byte-identical on the transcript (decoder-only reset == full reset_state on the diffusion 60s/2-EOU and 180s/5-EOU clips), so the validated streaming-encoder path is left untouched. enc_frame_ keeps running so EOU timestamps stay absolute in the clip, and the offline path is unchanged (it matches NeMo offline on single utterances). Adds a gated regression test (test_streaming_eou_reset) plus a NeMo reset-on-EOU baseline generator (gen_stream_reset_baseline.py) that builds a two-utterance clip so an <EOU> fires mid-stream; the test asserts our streamed transcript matches NeMo's reset reference exactly and that the second utterance is recovered. Confirmed it fails with the reset disabled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
#13: with
--stream, transcription stops after the first[EOU]. The model keeps processing the rest of the audio but no further text comes out.Root cause
The realtime EOU model (
parakeet_realtime_eou_120m-v1) emits<EOU>/<EOB>as ordinary vocab tokens marking end of utterance. The cache-aware streaming decode carried the RNN-T decoder state across chunks but never reset it, so once<EOU>was emitted the prediction net stayed conditioned on it and the joint scored blank on every following frame. The stream went silent after the first utterance.This is not a divergence from NeMo's greedy decode (NeMo's plain
rnnt_decoder_predictions_tensordoes the exact same thing, offline and streaming), it is that the model is not meant to run continuously without a reset.What upstream does
NeMo's reference streaming driver for this model,
examples/voice_agent/.../nemo/streaming_asr.py(NemoStreamingASRService.transcribe), carriespartial_hypotheses+ encoder caches across chunks just like ourStreamingSession, and callsreset_state()whenever<EOU>/<EOB>appears in a chunk so the next utterance decodes from a fresh state.Fix
StreamingSession::feed_mel_chunknow resets the RNN-T decoder state (LSTM h/c to zero, last token back to SOS) after a chunk emits<EOU>/<EOB>.Only the decoder is reset, not the StreamingEncoder cache. NeMo's
reset_statealso drops the encoder cache, but resetting it was verified byte-identical on the transcript (decoder-only reset == full reset on the diffusion 60s / 2-EOU and 180s / 5-EOU clips), so the validated streaming-encoder path is left untouched.enc_frame_keeps running so EOU timestamps stay absolute in the clip, and the offline path is unchanged (it matches NeMo offline on single utterances).Verification
On the issue's
diffusion2023-07-03sample,--streamnow emits every utterance (2 EOUs at 60s, 5 at 180s), byte-identical to NeMo's reset-on-EOU output.test_streaming_eou_reset+scripts/gen_stream_reset_baseline.py: builds a two-utterance clip (speech.wav+ silence +speech.wav) so an<EOU>fires mid-stream, runs NeMo's cache-aware streaming loop with reset-on-EOU, and asserts our streamed transcript matches it exactly and that the second utterance is recovered. Confirmed it fails with the reset disabled.test_streaming_decode,test_transcribe_eou,test_capi_streamstill pass; full suite 56/56.The one intentional difference vs NeMo's offline-clip driver: we may emit one extra trailing end-of-clip
<EOU>event (the documented streaming-tail artifact). It never changes the transcript, and the real-time NeMo service would emit it too once more audio arrives.🤖 Generated with Claude Code