Skip to content

feat(parakeet-cpp): real segment timestamps (NeMo-faithful)#10207

Merged
mudler merged 3 commits into
masterfrom
feat/parakeet-segment-timestamps
Jun 7, 2026
Merged

feat(parakeet-cpp): real segment timestamps (NeMo-faithful)#10207
mudler merged 3 commits into
masterfrom
feat/parakeet-segment-timestamps

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

What

Gives the parakeet-cpp backend real, NeMo-faithful segment timestamps instead of the single synthetic whole-clip segment it emits today.

Offline (/v1/audio/transcriptions):

  • Words are grouped into segments exactly like NeMo's get_segment_offsets — a new segment starts after sentence-ending punctuation (., ?, !), and each segment carries start/end plus the token ids whose timestamps fall in its window.
  • Punctuation-only by default (matches NeMo's default segment_gap_threshold=None). An opt-in model option segment_gap_threshold (NeMo's unit: encoder frames, default 0=off) additionally splits on inter-word silence; it's converted to seconds via the new frame_sec the engine reports.
  • Per-segment words remain gated behind timestamp_granularities=["word"]; a zero-word document falls back to a single text segment (no regression).

Streaming (stream=true):

  • When libparakeet.so exposes the new ABI v4 JSON entry points (probed at load), the backend drives parakeet_capi_stream_feed_json / _finalize_json and accumulates the streamed per-word timestamps into per-utterance segments (EOU stays the boundary), so streaming FinalResult segments now carry start/end. Falls back to the existing text-only feed against an older library — no hard version coupling.

Why

Matches what NeMo produces for these checkpoints (model.transcribe(..., timestamps=True) at the segment level), so downstream consumers get usable segment timing. Diarization/speaker labels are explicitly out of scope — the Parakeet/Nemotron models don't support it.

Depends on

mudler/parakeet.cpp#16 (adds frame_sec to the JSON + the ABI v4 streaming JSON entry points). The Go side probes for the new symbols, so it builds and runs against an older libparakeet.so (punctuation-only, text-only streaming) until that lands.

Tests

Pure-Go Ginkgo specs (no model needed) cover splitWordsIntoSegments (punctuation + gap rules, NeMo elif order, empty/fallback), transcriptResultFromDoc (multi-segment output, token-window assignment, word-granularity gate, zero-word fallback), and the streaming segmenter. make lint (new-from-merge-base) clean; existing model-gated specs still skip without a model.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

mudler added 2 commits June 7, 2026 08:47
Offline: replace the single synthetic whole-clip segment with multiple
segments grouped exactly like NeMo's get_segment_offsets - a new segment
after sentence-ending punctuation ('. ? !'), each carrying start/end and
its time-window token ids. The optional model option segment_gap_threshold
(NeMo's unit: encoder FRAMES, default 0=off) adds NeMo's silence-gap split,
converted to seconds via the JSON frame_sec the engine now reports.
Per-segment words are still gated behind timestamp_granularities=["word"];
a zero-word document falls back to a single text segment.

Streaming: when libparakeet.so exposes the ABI v4 JSON entry points
(probed), drive parakeet_capi_stream_feed_json / _finalize_json and
accumulate the streamed per-word timestamps into per-utterance segments
(EOU stays the boundary), so streaming FinalResult segments now carry
start/end. Falls back to the text-only feed against an older library.

Pure-Go specs cover splitWordsIntoSegments (punctuation + gap rules, NeMo
elif order, fallback), transcriptResultFromDoc (multi-segment, token
windows, word-granularity gate), and the streaming segmenter.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…hreshold

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The offline AudioTranscription specs asserted the old single synthetic
segment (Segments HaveLen(1), Segments[0].Text == res.Text). With
NeMo-faithful segmentation a multi-sentence clip now yields multiple
punctuation-delimited segments, so assert the new contract instead:
one-or-more time-ordered segments, each with text and (under word
granularity) per-segment words whose span tracks the segment start/end.
Caught by running the model-gated suite on the dgx (GB10) against the
real tdt_ctc-110m + realtime_eou models.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit a7cb587 into master Jun 7, 2026
74 of 75 checks passed
@mudler mudler deleted the feat/parakeet-segment-timestamps branch June 7, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants