feat(parakeet-cpp): real segment timestamps (NeMo-faithful)#10207
Merged
Conversation
Offline: replace the single synthetic whole-clip segment with multiple
segments grouped exactly like NeMo's get_segment_offsets - a new segment
after sentence-ending punctuation ('. ? !'), each carrying start/end and
its time-window token ids. The optional model option segment_gap_threshold
(NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split,
converted to seconds via the JSON frame_sec the engine now reports.
Per-segment words are still gated behind timestamp_granularities=["word"];
a zero-word document falls back to a single text segment.
Streaming: when libparakeet.so exposes the ABI v4 JSON entry points
(probed), drive parakeet_capi_stream_feed_json / _finalize_json and
accumulate the streamed per-word timestamps into per-utterance segments
(EOU stays the boundary), so streaming FinalResult segments now carry
start/end. Falls back to the text-only feed against an older library.
Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo
elif order, fallback), transcriptResultFromDoc (multi-segment, token
windows, word-granularity gate), and the streaming segmenter.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…hreshold Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The offline AudioTranscription specs asserted the old single synthetic segment (Segments HaveLen(1), Segments[0].Text == res.Text). With NeMo-faithful segmentation a multi-sentence clip now yields multiple punctuation-delimited segments, so assert the new contract instead: one-or-more time-ordered segments, each with text and (under word granularity) per-segment words whose span tracks the segment start/end. Caught by running the model-gated suite on the dgx (GB10) against the real tdt_ctc-110m + realtime_eou models. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Gives the
parakeet-cppbackend real, NeMo-faithful segment timestamps instead of the single synthetic whole-clip segment it emits today.Offline (
/v1/audio/transcriptions):get_segment_offsets— a new segment starts after sentence-ending punctuation (.,?,!), and each segment carriesstart/endplus the token ids whose timestamps fall in its window.segment_gap_threshold=None). An opt-in model optionsegment_gap_threshold(NeMo's unit: encoder frames, default0=off) additionally splits on inter-word silence; it's converted to seconds via the newframe_secthe engine reports.wordsremain gated behindtimestamp_granularities=["word"]; a zero-word document falls back to a single text segment (no regression).Streaming (
stream=true):libparakeet.soexposes the new ABI v4 JSON entry points (probed at load), the backend drivesparakeet_capi_stream_feed_json/_finalize_jsonand accumulates the streamed per-word timestamps into per-utterance segments (EOU stays the boundary), so streamingFinalResultsegments now carrystart/end. Falls back to the existing text-only feed against an older library — no hard version coupling.Why
Matches what NeMo produces for these checkpoints (
model.transcribe(..., timestamps=True)at the segment level), so downstream consumers get usable segment timing. Diarization/speaker labels are explicitly out of scope — the Parakeet/Nemotron models don't support it.Depends on
mudler/parakeet.cpp#16 (adds
frame_secto the JSON + the ABI v4 streaming JSON entry points). The Go side probes for the new symbols, so it builds and runs against an olderlibparakeet.so(punctuation-only, text-only streaming) until that lands.Tests
Pure-Go Ginkgo specs (no model needed) cover
splitWordsIntoSegments(punctuation + gap rules, NeMoeliforder, empty/fallback),transcriptResultFromDoc(multi-segment output, token-window assignment, word-granularity gate, zero-word fallback), and the streaming segmenter.make lint(new-from-merge-base) clean; existing model-gated specs still skip without a model.Assisted-by: Claude:claude-opus-4-8 [Claude Code]