This document records the numerical and end-to-end parity of the C++/ggml inference path against the NeMo reference implementation.
Anchor checkpoint: nvidia/parakeet-tdt_ctc-110m — the CTC head of the
hybrid TDT/CTC model (selected via m.change_decoding_strategy(decoder_type='ctc')).
NeMo version 2.7.3. All comparisons are CPU, batch size 1, deterministic
(CTC greedy decode).
Every published offline Parakeet checkpoint validated end-to-end against NeMo on
tests/fixtures/speech.wav (LibriSpeech 2086-149220-0033, ~7.4 s, English).
WER vs NeMo is word-error-rate (0.0 = byte-for-byte identical). All runs are
CPU, batch 1, deterministic greedy (NeMo 2.7.3).
| Checkpoint | Family | arch | d_model / layers | mel | xscaling | vocab | Validated head(s) | WER vs NeMo | Status |
|---|---|---|---|---|---|---|---|---|---|
parakeet-tdt_ctc-110m |
Hybrid TDT+CTC | hybrid_tdt_ctc |
512 / 17 | 80 | false | 1024 | TDT + CTC | 0.0 | PASS |
parakeet-tdt-0.6b-v2 |
TDT (hybrid) | hybrid_tdt_ctc |
1024 / 24 | 128 | false | 1024 | TDT | 0.0 | PASS |
parakeet-tdt-0.6b-v3 |
TDT (hybrid, multilingual) | hybrid_tdt_ctc |
1024 / 24 | 128 | false | 8192 | TDT | 0.0 | PASS |
parakeet-tdt-0.6b (v1) |
TDT | — | — | — | — | — | — | — | N/A — not published (superseded by v2) |
parakeet-tdt-1.1b |
TDT | tdt |
1024 / 42 | 80 | false | 1024 | TDT | 0.0 | PASS |
parakeet-tdt_ctc-1.1b |
Hybrid TDT+CTC | hybrid_tdt_ctc |
1024 / 42 | 80 | false | 1024 | TDT + CTC | 0.0 | PASS |
parakeet-ctc-0.6b |
CTC | ctc |
1024 / 24 | 80 | true | 1024 | CTC | 0.0 | PASS |
parakeet-ctc-1.1b |
CTC | ctc |
1024 / 42 | 80 | true | 1024 | CTC | 0.0 | PASS |
parakeet-rnnt-0.6b |
RNNT | rnnt |
1024 / 24 | 80 | true | 1024 | RNNT | 0.0 | PASS |
parakeet-rnnt-1.1b |
RNNT | rnnt |
1024 / 42 | 80 | true | 1024 | RNNT | 0.0 | PASS |
parakeet_realtime_eou_120m-v1 |
Streaming + EOU | rnnt |
512 / 17 | 128 | false | 1026 | RNNT (offline, limited-context) | 0.0 | PASS (Phase 5 — 5a milestone) |
nemotron-3.5-asr-streaming-0.6b |
Streaming, multilingual, prompt-conditioned | rnnt |
1024 / 24 | 128 | false | 13087 | RNNT offline + cache-aware streaming, per language | 0.0 | PASS (offline + streaming, langs en/de/es/ja-JP/auto) |
Notes:
xscaling= NeMo FastConformerxscale=sqrt(d_model)(true) vsxscale=None(false).- Hybrid models validated on both TDT and CTC heads where practical.
parakeet-tdt-1.1bis labelledtdt(pure, noaux_ctc); the two 0.6B TDT checkpoints are labelledhybrid_tdt_ctc(NeMo stores them asEncDecRNNTBPEModelwithcfg.aux_ctc).- Committed regression tests lock the CTC and RNNT decoder paths against
regression:
tests/test_transcribe_ctc.cpp(PARAKEET_TEST_GGUF_CTC) andtests/test_transcribe_rnnt.cpp(PARAKEET_TEST_GGUF_RNNT) — both skip (exit 77) in CI unless the corresponding GGUF env var is set. parakeet_realtime_eou_120m-v1offline (5a): vocab 1026 includes<EOU>=1024 and<EOB>=1025; blank_id=1026 (V_plus=1027); the transducer emits<EOU>literally at end-of-utterance — the C++ offline transcript matches NeMo including the trailing<EOU>token.parakeet_realtime_eou_120m-v1cache-aware streaming (5b/5c, Tasks 5-7): the streaming encoder output == offline (leading frames, max|d|≈3e-6) and == NeMo'scache_aware_stream_stepoutput; streaming decode tokens == NeMo cache-aware streaming == offline minus the trailing streaming-tail<EOU>. EOU/EOB are surfaced as timed events (stripped from the running text). For this clip NeMo's streaming does NOT emit<EOU>(the final-chunk tail has incomplete right context); the C++ streaming session/C-API/CLI match that exactly and do not fabricate one. See "Phase 5 — Streaming + EOU" below.nemotron-3.5-asr-streaming-0.6b(multilingual, prompt-conditioned): a target language one-hot (--lang <locale>, defaultauto) is projected through theprompt_kernel(Linear, ReLU, Linear) on the encoder output before the RNNT decode, both offline and per streaming chunk (the one-hot is constant over time, so per-chunk application is exact). The authoritative NeMo reference decodes the prompt-conditioned encoder output viam.decoding.rnnt_decoder_predictions_tensor(the lhotsetranscribe(target_lang=...)path needs per-cut language metadata our bare wav fixtures lack).scripts/e2e_nemo_compare.pycross-checks the C++ CLI against this reference fortests/fixtures/{speech,clip}.wav×{en, de, es, ja-JP, auto}×{offline, stream}: all 20 rows WER 0.0. Theprompt_kernel, LSTM prediction net, and featurizer tensors stay F32 in every quantized variant (f16 and q8_0 also verified WER 0.0; seedocs/quantization.md). Notejais not a dictionary key — the Japanese locale isja-JP.
The decisive proof: a real English utterance transcribed by the full C++ pipeline (mel → subsampling → FastConformer encoder → CTC head → CTC greedy decode → BPE detokenize) compared word-for-word against NeMo's CTC-head transcript of the same clip.
Fixture: tests/fixtures/speech.wav — LibriSpeech dev-clean sample
2086-149220-0033, 16 kHz mono, 16-bit PCM, ~7.4 s. Obtained from the public
NeMo tutorial mirror
https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
(already 16 kHz mono; re-encoded to canonical 16-bit PCM via soundfile).
This clip is substantially harder than the synthetic-tone clip used for the
per-stage tensor-parity tests: it is ~3.7x longer and contains real speech, so
it exercises per-feature normalization and valid-length masking at a different
sequence length T, and a non-trivial greedy collapse over a real token stream
— exactly the length-dependent code paths a single fixed-length tone clip could
not have validated.
| Source | Transcript |
|---|---|
| NeMo CTC reference | Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
C++ parakeet-cli transcribe |
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
Result: byte-for-byte identical. WER = 0.0 (0 edits over 23 reference words).
Reproduce:
# NeMo CTC reference
.venv/bin/python -c "from nemo.collections.asr.models import ASRModel; \
m=ASRModel.from_pretrained('nvidia/parakeet-tdt_ctc-110m'); \
m.change_decoding_strategy(decoder_type='ctc'); \
print(m.transcribe(['tests/fixtures/speech.wav'])[0].text)"
# C++ pipeline
./build/examples/cli/parakeet-cli transcribe --model /tmp/pk110m.gguf \
--input tests/fixtures/speech.wavA committed regression test asserts this match deterministically:
tests/test_transcribe_speech.cpp (ctest test_transcribe_speech, label
model). It runs pk::transcribe on the committed speech.wav and asserts the
output equals the stored NeMo CTC reference string.
Each stage diffs its C++ output against the matching tensor dumped from NeMo by
scripts/gen_nemo_baseline.py (stored in baseline.gguf). Numbers below are
the measured max|diff| from the ctest suite on tests/fixtures/clip.wav.
| Stage | ctest | max|diff| | Notes |
|---|---|---|---|
| Log-mel front end | test_mel |
4.13e-05 | mel [80, T] |
| dw_striding subsampling (÷8) | test_subsampling |
6.35e-03 | subsampling_out; largest-magnitude stage |
| Relative-position MHA (layer 0) | test_relpos_attention |
1.71e-03 | l0_attn_out |
| Conformer conv submodule | test_conformer |
1.58e-03 | l0_conv_out |
| Conformer layer 0 (full) | test_conformer |
5.34e-05 | enc_layer_0 |
| Relative positional encoding | test_encoder |
1.55e-06 | pos_emb [2T'-1, d_model] |
| Encoder (full, last layer) | test_encoder |
2.40e-05 | encoder_out [d_model, T'] |
| CTC head (log-softmax logits) | test_ctc |
1.72e-05 | ctc_logits [T', vocab+1] |
Notes:
- Diffs are absolute. The subsampling stage carries the largest absolute diff (6.3e-3) because its outputs have large magnitudes (O(1e3)); the relative error there is ~1e-6. Downstream stages (encoder, CTC) re-normalize and the absolute diff collapses back to O(1e-5).
- The CTC
exp-row-sums equal 1.0 (the baseline stores post-log_softmaxlogits, and the C++ head matches that orientation). - All stage diffs are well within the plan's tolerances and the greedy argmax / collapse is unaffected, which is why the end-to-end transcript is exact.
| Component | max|diff| | tolerance |
|---|---|---|
| prediction net (embedding + 1-layer LSTM) | 1.5e-06 | 2e-3 |
| joint network (enc/pred proj → ReLU → linear, raw logits) | 1.1e-05 | 5e-3 |
| composed pred→joint (integration) | 1.3e-05 | 5e-3 |
- The joint emits raw logits
[T, U, 1030]= 1024 vocab + 1 blank + 5 TDT durations[0,1,2,3,4]; the token/duration split + separate log_softmax is a Phase 3 (greedy) concern. - The prediction net prepends a literal zero SOS step (
add_sos), so 4 input ids → 5 hidden states. PyTorch LSTM gate order[i,f,g,o], both biases summed.
The decisive Phase 3 deliverable: the real Parakeet-TDT models people use,
transcribed by the full C++ TDT path (mel → FastConformer encoder → prediction
net + joint → duration-aware greedy decode → BPE detokenize) and compared
word-for-word against NeMo's TDT-head transcript of the same clip. All on
tests/fixtures/speech.wav (LibriSpeech 2086-149220-0033, English, ~7.4 s).
NeMo 2.7.3, CPU, batch 1, deterministic greedy.
| Model | arch | d_model / layers / heads | mels | conv norm | xscaling | durations | vocab | pred LSTM layers |
|---|---|---|---|---|---|---|---|---|
parakeet-tdt_ctc-110m (TDT head) |
hybrid_tdt_ctc |
512 / 17 / 8 | 80 | batch_norm | false | [0,1,2,3,4] |
1024 | 1 |
parakeet-tdt-0.6b-v2 |
hybrid_tdt_ctc |
1024 / 24 / 8 | 128 | batch_norm | false | [0,1,2,3,4] |
1024 | 2 |
parakeet-tdt-0.6b-v3 |
hybrid_tdt_ctc |
1024 / 24 / 8 | 128 | batch_norm | false | [0,1,2,3,4] |
8192 | 2 |
(xscaling is false on all three — NeMo's FastConformer xscale=None. The
0.6B checkpoints load in NeMo as EncDecRNNTBPEModel; the converter labels them
hybrid_tdt_ctc because cfg.aux_ctc is present, and the C++ default routes to
the TDT head either way, matching NeMo's default transducer transcription.)
| Model | NeMo (TDT) | C++ parakeet-cli transcribe --decoder tdt |
WER |
|---|---|---|---|
parakeet-tdt_ctc-110m |
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
0.0 |
parakeet-tdt-0.6b-v2 |
Well, I don't wish to see it any more, observed Phebe, turning away her eyes. It is certainly very like the old portrait. |
Well, I don't wish to see it any more, observed Phebe, turning away her eyes. It is certainly very like the old portrait. |
0.0 |
parakeet-tdt-0.6b-v3 |
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
0.0 |
All three are byte-for-byte identical to NeMo (0 edits over 23 reference
words). The only inter-model difference is the BPE spelling Phebe (v2 1024-token
vocab) vs Phoebe (v3 8192-token vocab) — both faithfully reproduce their own
tokenizer, exactly as NeMo does. This is the Phase 3 headline: the production
parakeet-tdt-0.6b-v2 and multilingual -v3 transcribe real speech identically
to NeMo through the C++/ggml port.
Reproduce:
.venv/bin/python scripts/convert_parakeet_to_gguf.py \
--model nvidia/parakeet-tdt-0.6b-v2 --output /tmp/tdt06v2.gguf
./build/examples/cli/parakeet-cli info /tmp/tdt06v2.gguf
./build/examples/cli/parakeet-cli transcribe \
--model /tmp/tdt06v2.gguf --input tests/fixtures/speech.wav --decoder tdt
# NeMo reference (TDT is the default head for this model):
.venv/bin/python -c "from nemo.collections.asr.models import ASRModel; \
m=ASRModel.from_pretrained('nvidia/parakeet-tdt-0.6b-v2'); \
print(m.transcribe(['tests/fixtures/speech.wav'])[0].text)"The 110m anchor (xscaling=False, 1-layer LSTM, conv/linear biases present) had
matched NeMo, so two latent assumptions surfaced only on the larger checkpoints.
The encoder output was already byte-exact on 0.6B-v2 (enc_out mean 1.80e-06,
std 0.0763, range ±0.69 — identical to NeMo); the divergences were:
-
Optional Conv1d / Linear biases. NeMo configures the FastConformer FFN and attention
nn.Linears and the conv submodule'sConv1ds withbias=Falseonparakeet-tdt-0.6b-v2/-v3(they arebias=Trueon the 110m). The C++ conformer/attention unconditionally fetched the.biastensors → null-deref / segfault. Fix:src/conformer.cppandsrc/relpos_attention.cppnow add the bias only when the tensor is actually present in the checkpoint (clone_weight_opt/ml.tensor(...)guard). -
Stacked prediction LSTM.
parakeet.decoder.pred_rnn_layers = 2on the 0.6B models (vs 1 on the 110m). The C++PredictionNetran only LSTM layer_l0, ignoring_l1, producing a wrong prediction-net output and a garbage transcript. Fix:src/prediction.{hpp,cpp}now readpred_rnn_layersfrom the GGUF and run a true stacked LSTM (PredStatecarries per-layer(h,c); layerl>0consumes layerl-1's hidden output), defaulting to 1 layer.
Both fixes honor the GGUF metadata / actual tensors; no model is special-cased. Per-stage tensor parity (above) and the 110m end-to-end TDT/CTC transcripts are unchanged (all earlier ctests still pass).
The standalone EncDecCTCModelBPE checkpoints (parakeet-ctc-*) are the first
end-to-end validation of two paths that no prior checkpoint exercised:
- Standalone CTC head name. A hybrid model stores the CTC linear head at
ctc_decoder.decoder_layers.0.{weight,bias}; a standalone CTC model stores the SAME layer atdecoder.decoder_layers.0.{weight,bias}.pk::CTCDecodernow tries the hybrid name first and falls back to the standalone name (clear error only if neither exists) —src/ctc_decoder.cpp, no model special-cased. - Encoder xscaling=True. Every previously validated checkpoint had
xscaling=False;parakeet-ctc-0.6b/-1.1bset NeMo's FastConformerxscale=sqrt(d_model). This is the first real exercise of the C++ encoder's*sqrt(d_model)branch (pk::Encoder, gated oncfg.xscaling). It worked out of the box — no fix was needed; both transcripts matched NeMo exactly.
| Model | arch | d_model / layers / heads | mels | conv norm | xscaling | vocab | CTC head tensor |
|---|---|---|---|---|---|---|---|
parakeet-ctc-0.6b |
ctc |
1024 / 24 / 8 | 80 | batch_norm | true | 1024 | decoder.decoder_layers.0.* |
parakeet-ctc-1.1b |
ctc |
1024 / 42 / 8 | 80 | batch_norm | true | 1024 | decoder.decoder_layers.0.* |
| Model | NeMo (CTC) | C++ parakeet-cli transcribe --decoder ctc |
WER |
|---|---|---|---|
parakeet-ctc-0.6b |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
0.0 |
parakeet-ctc-1.1b |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
0.0 |
Both are byte-for-byte identical to NeMo (0 edits over 23 reference words).
These standalone CTC models emit lowercase, unpunctuated text (their own
tokenizer/training), distinct from the cased/punctuated hybrid + TDT models — and
the C++ port faithfully reproduces each model's own output. Validated with the
reusable harness scripts/validate_vs_nemo.py:
.venv/bin/python scripts/convert_parakeet_to_gguf.py \
--model nvidia/parakeet-ctc-0.6b --output /tmp/val_ctc06.gguf
./build/examples/cli/parakeet-cli info /tmp/val_ctc06.gguf # arch=ctc, xscaling=true
.venv/bin/python scripts/validate_vs_nemo.py \
--model nvidia/parakeet-ctc-0.6b --gguf /tmp/val_ctc06.gguf \
--audio tests/fixtures/speech.wav --head ctc
# -> MODEL nvidia/parakeet-ctc-0.6b HEAD ctc arch=ctc xscaling=true WER 0.0000 ... PASSThe harness loads the NeMo model, auto-selects the head (CTC model→ctc,
RNNT/TDT→rnnt, hybrid→transducer), gets the NeMo reference via m.transcribe,
shells out to parakeet-cli transcribe (head ctc→--decoder ctc,
rnnt/tdt→--decoder tdt), computes word-level WER, and exits 0 on PASS / 1 on
WER>0 / 77 if NeMo can't be imported.
The pure-RNNT EncDecRNNTBPEModel checkpoints (parakeet-rnnt-*) are the first
end-to-end validation of the standard RNN-Transducer greedy loop — a path no
prior checkpoint exercised. Every transducer validated in Phase 3 was a TDT model
(its joint emits vocab+1+num_durations and decoding is duration-aware). A pure
RNNT model has no duration head: its joint output is exactly vocab+1
(Joint::num_durations()==0, V_plus=vocab+1), and greedy decoding advances time
by exactly one frame on a blank and emits-and-stays on a non-blank (capped by
max_symbols).
pk::rnnt_greedy(src/rnnt.hpp/src/rnnt.cpp) ports NeMoGreedyRNNTInfer._greedy_decode. Versuspk::tdt_greedy: no duration logits (argmax over the fullvocab+1joint output); blank → advancetby 1; non-blank → emit + commit state + stay att(capped bymax_symbols=10).- Routing (
src/parakeet.cpp): when the transducer head is selected, the loader'scfg.tdt_durationsnow decides the loop — non-empty →tdt_greedy(unchanged); empty →rnnt_greedy. This makesarch ∈ {rnnt, hybrid_rnnt_ctc}(and any TDT-less transducer) work; it replaces the old guard that fell back to CTC when no durations were configured (which a pure-RNNT model lacks). TDT and CTC paths are unchanged. Bothparakeet-rnnt-*models also setxscaling=true.
| Model | arch | d_model / layers / heads | mels | conv norm | xscaling | durations | vocab |
|---|---|---|---|---|---|---|---|
parakeet-rnnt-0.6b |
rnnt |
1024 / 24 / 8 | 80 | batch_norm | true | none | 1024 |
parakeet-rnnt-1.1b |
rnnt |
1024 / 42 / 8 | 80 | batch_norm | true | none | 1024 |
| Model | NeMo (RNNT) | C++ parakeet-cli transcribe --decoder tdt |
WER |
|---|---|---|---|
parakeet-rnnt-0.6b |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
0.0 |
parakeet-rnnt-1.1b |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
0.0 |
Both are byte-for-byte identical to NeMo (0 edits over 23 reference words) —
the rnnt_greedy loop matched on the first run with no divergence debugging
needed. The CLI --decoder tdt selects the transducer head and now routes to
rnnt_greedy because the duration table is empty. Validated with the harness:
.venv/bin/python scripts/convert_parakeet_to_gguf.py \
--model nvidia/parakeet-rnnt-0.6b --output /tmp/val_rnnt06.gguf
./build/examples/cli/parakeet-cli info /tmp/val_rnnt06.gguf # arch=rnnt, xscaling=true
.venv/bin/python scripts/validate_vs_nemo.py \
--model nvidia/parakeet-rnnt-0.6b --gguf /tmp/val_rnnt06.gguf \
--audio tests/fixtures/speech.wav --head rnnt
# -> MODEL nvidia/parakeet-rnnt-0.6b HEAD rnnt arch=rnnt xscaling=true WER 0.0000 ... PASSThe remaining published TDT and hybrid-TDT/CTC checkpoints, validated end-to-end
with scripts/validate_vs_nemo.py on tests/fixtures/speech.wav (NeMo 2.7.3,
CPU, batch 1, deterministic greedy). No code change was required — every one
loaded purely from GGUF metadata (the loader already handles 80/128 mel,
1024/8192 vocab, 1/2 LSTM layers, 24/42 layers, xscaling true/false, optional
biases). The 1.1B TDT model is the first checkpoint NeMo restores as a pure
EncDecRNNTBPEModel whose converter label is tdt (it has no aux_ctc), so it
also exercises the standalone-tdt-arch routing — distinct from the -0.6b-v2/-v3
hybrids that NeMo restores as RNNT but the converter labels hybrid_tdt_ctc.
| Model | HF id | arch | d_model / layers / heads | mels | conv norm | xscaling | durations | vocab |
|---|---|---|---|---|---|---|---|---|
| parakeet-tdt-0.6b (v1) | nvidia/parakeet-tdt-0.6b |
— | N/A — not published | — | — | — | — | — |
| parakeet-tdt-1.1b | nvidia/parakeet-tdt-1.1b |
tdt |
1024 / 42 / 8 | 80 | batch_norm | false | [0,1,2,3,4] |
1024 |
| parakeet-tdt_ctc-1.1b | nvidia/parakeet-tdt_ctc-1.1b |
hybrid_tdt_ctc |
1024 / 42 / 8 | 80 | batch_norm | false | [0,1,2,3,4] |
1024 |
The v1 nvidia/parakeet-tdt-0.6b is not published under that id (HF API 401 /
converter 404; the only public nvidia parakeet-tdt repos are -0.6b-v2,
-0.6b-v3, -1.1b, the parakeet-tdt_ctc-* hybrids, and -tdt_ctc-0.6b-ja).
It was superseded by parakeet-tdt-0.6b-v2, already validated at WER 0 in Phase 3.
| Model | head | NeMo | C++ | WER |
|---|---|---|---|---|
parakeet-tdt-1.1b |
TDT (--decoder tdt) |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
0.0 |
parakeet-tdt_ctc-1.1b |
TDT (--decoder tdt) |
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
0.0 |
parakeet-tdt_ctc-1.1b |
CTC (--decoder ctc) |
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait. |
0.0 |
All are byte-for-byte identical to NeMo (0 edits over 23 reference words). The
pure-TDT parakeet-tdt-1.1b emits lowercase/unpunctuated text (its own training);
the hybrid parakeet-tdt_ctc-1.1b emits cased/punctuated text on both of its
heads — and the C++ port reproduces each head exactly, including the second head
(CTC) of the hybrid via --decoder ctc. Reproduce:
.venv/bin/python scripts/convert_parakeet_to_gguf.py \
--model nvidia/parakeet-tdt_ctc-1.1b --output /tmp/val_tdtctc11.gguf
./build/examples/cli/parakeet-cli info /tmp/val_tdtctc11.gguf # arch=hybrid_tdt_ctc
.venv/bin/python scripts/validate_vs_nemo.py \
--model nvidia/parakeet-tdt_ctc-1.1b --gguf /tmp/val_tdtctc11.gguf \
--audio tests/fixtures/speech.wav --head rnnt # TDT head; --head ctc for the CTC head
# -> MODEL nvidia/parakeet-tdt_ctc-1.1b HEAD rnnt arch=hybrid_tdt_ctc xscaling=false WER 0.0000 ... PASSctest --test-dir build --output-on-failure (with PARAKEET_TEST_GGUF,
PARAKEET_TEST_BASELINE, PARAKEET_TEST_BASELINE_SPEECH exported): all runnable
tests pass. The full suite includes the 15 Phase 0/1 tests, test_prediction,
test_joint, test_transducer_core, test_prediction_step, test_tdt_greedy,
test_transcribe_tdt, plus three large-model regression tests that skip (exit 77)
when the corresponding GGUF env var is absent:
| Test | Env var | Decoder path locked |
|---|---|---|
test_transcribe_0_6b |
PARAKEET_TEST_GGUF_06B |
TDT greedy (parakeet-tdt-0.6b-v2/-v3) |
test_transcribe_ctc |
PARAKEET_TEST_GGUF_CTC |
Standalone CTC head (parakeet-ctc-0.6b) |
test_transcribe_rnnt |
PARAKEET_TEST_GGUF_RNNT |
RNNT greedy (parakeet-rnnt-0.6b) |
Each asserts the C++ transcript equals the stored NeMo reference word-for-word
(WER 0) on tests/fixtures/speech.wav. All three carry the model label and run
from the project root (WORKING_DIRECTORY).
Model: nvidia/parakeet_realtime_eou_120m-v1 — cache-aware streaming
FastConformer + RNNT (pure RNNT, no TDT duration table).
Offline (limited-context) validated: the full pipeline — mel → encoder with
conv_norm_type=layer_norm, causal depthwise conv, causal subsampling, and
chunked-limited attention mask (att_context_size=[70,1],
att_context_style=chunked_limited) — matches NeMo's offline transcript on
tests/fixtures/speech.wav at WER 0 (byte-for-byte identical), including
the exact token-id sequence (45 tokens, last = 1024 = <EOU>).
| Field | Value |
|---|---|
| arch | rnnt |
| d_model / layers / heads | 512 / 17 / 8 |
| mel | 128 |
| conv_norm_type | layer_norm |
| conv_causal | true |
| causal_downsampling | true |
| att_context_size | [70, 1] |
| att_context_style | chunked_limited |
| xscaling | false |
| vocab_size | 1026 (<EOU>=1024, <EOB>=1025) |
| blank_id | 1026 (V_plus = 1027) |
| tdt_durations | none (pure RNNT → rnnt_greedy) |
| Mode | NeMo (offline RNNT) | C++ parakeet-cli transcribe --decoder tdt |
WER |
|---|---|---|---|
| Offline (limited-context) | well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait<EOU> |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait<EOU> |
0.0 |
The C++ transcript is byte-for-byte identical to NeMo, including the literal
<EOU> token at the end (token id 1024 emitted by the RNNT transducer). The
token-id sequence (45 tokens) is also identical to the NeMo reference stored in
/tmp/baseline_eou.gguf (rnnt_token_ids tensor).
Note: for this offline 5a milestone, <EOU> appears literally in the detokenized
text — it is rendered by the BPE detokenizer as the piece string "<EOU>".
EOU-as-a-separate-event (stripping it from text + exposing a timed EOU signal) is
planned for Task 7 (Part 5c).
tests/test_transcribe_eou.cpp (test_transcribe_eou, label model,
WORKING_DIRECTORY) — skips (exit 77) unless both PARAKEET_TEST_GGUF_EOU and
PARAKEET_TEST_BASELINE_EOU are set. Asserts:
pk::transcribe(eou.gguf, speech.wav, kTDT)equalsbaseline.rnnt_text(incl.<EOU>).- The raw RNNT token-id sequence equals the
rnnt_token_idsint32 tensor from the baseline.
Reproduce:
# Convert (if not already done)
.venv/bin/python scripts/convert_parakeet_to_gguf.py \
--model nvidia/parakeet_realtime_eou_120m-v1 --output /tmp/eou.gguf
# CLI
./build/examples/cli/parakeet-cli transcribe \
--model /tmp/eou.gguf --input tests/fixtures/speech.wav --decoder tdt
# -> well i don't wish to see it any more observed phoebe ... portrait<EOU>
# ctest
PARAKEET_TEST_GGUF_EOU=/tmp/eou.gguf \
PARAKEET_TEST_BASELINE_EOU=/tmp/baseline_eou.gguf \
ctest --test-dir build -R test_transcribe_eou --output-on-failureModel: nvidia/parakeet_realtime_eou_120m-v1 — cache-aware streaming
FastConformer RNN-T. Streaming parameters (from the GGUF parakeet.streaming.*
KV): chunk_size=[9,16], pre_encode_cache_size=[0,9], valid_out_len=2,
last_channel_cache_size=70, drop_extra_pre_encoded=2, att_context=[70,1],
att_context_style=chunked_limited.
pk::StreamingEncoder carries per-layer convolution (cache_last_time) and
attention (cache_last_channel) caches across chunks. The concatenated per-chunk
valid output equals the OFFLINE encoder output over the leading frames
(cache-aware equivalence, max|d| ≈ 2.9e-6 over the leading 92 of 94 frames),
and matches NeMo's cache_aware_stream_step output. The trailing
valid_out_len=2 frames of the final chunk are the streaming tail (incomplete
right context — diverge from offline by design).
pk::StreamingSession drives the streaming encoder + a carried RNN-T greedy
decoder state (RnntDecodeState: prediction-net LSTM h/c, last emitted token,
SOS flag — never reset between chunks; mirrors NeMo partial_hypotheses). For
speech.wav it emits 44 tokens, EXACTLY equal to NeMo's own cache-aware
streaming decode (conformer_stream_step carrying previous_hypotheses) and to
the offline rnnt_token_ids (45) minus the trailing streaming-tail <EOU>=1024.
<EOU>=1024 / <EOB>=1025 are resolved from the tokenizer pieces, STRIPPED from
the running transcript, and surfaced as timed events
(pk::EouEvent{token, is_eob, encoder_frame, time_sec},
time_sec = encoder_frame * hop * subsampling_factor / sample_rate).
| Mode | NeMo streaming transcript | C++ streaming (StreamingSession / C-API / --stream) |
<EOU> |
|---|---|---|---|
Cache-aware streaming (speech.wav) |
well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait |
identical (byte-for-byte) | not emitted in-stream (matches NeMo eou_in_stream=0) |
The streaming transcript is the offline transcript with the trailing <EOU>
dropped — NeMo's cache-aware streaming does NOT emit it because the final-chunk
tail has incomplete right context. finalize() flushes the tail but does not
fabricate an <EOU> NeMo would not emit; for clips where an utterance ends with
full right context, the corresponding <EOU>/<EOB> would surface as an event.
parakeet_stream* parakeet_capi_stream_begin(parakeet_ctx*);
char* parakeet_capi_stream_feed(parakeet_stream*, const float* pcm, int n, int* eou_out);
char* parakeet_capi_stream_finalize(parakeet_stream*);
void parakeet_capi_stream_free(parakeet_stream*);feed accepts 16 kHz mono f32 PCM, buffers it, decodes encoder chunks as audio
arrives (carried caches), and returns the newly-finalized text (malloc'd; "" if
none); *eou_out=1 if an <EOU>/<EOB> event fired this feed. finalize
flushes the tail. Strings are freed with parakeet_capi_free_string. Exported
from libparakeet.so (verified via nm -D).
tests/test_streaming_encoder.cpp(test_streaming_encoder) — streaming encoder == offline + NeMo streaming (PARAKEET_TEST_GGUF_EOU+PARAKEET_TEST_BASELINE_EOU+PARAKEET_TEST_BASELINE_EOU_STREAM).tests/test_streaming_decode.cpp(test_streaming_decode) — streaming tokens == NeMo streaming == offline minus trailing<EOU>.tests/test_streaming_eou_reset.cpp(test_streaming_eou_reset) — multi-utterance streaming. The realtime EOU model emits<EOU>/<EOB>at the end of each utterance; NeMo's reference streaming driver (voice_agentNemoStreamingASRService.transcribe→reset_stateon<EOU>/<EOB>) resets the decoder so the NEXT utterance decodes fresh. Without that reset the prediction net stays conditioned on<EOU>and the stream goes silent after the first utterance (issue #13). The baseline (scripts/gen_stream_reset_baseline.py) builds a two-utterance clip (speech.wav+ silence +speech.wav) so an<EOU>fires mid-stream, runs NeMo's cache-aware streaming loop WITH reset-on-EOU, and stores the token sequence; the test asserts our streamed transcript (non-special tokens) matches it EXACTLY and that the second utterance is recovered after the mid-stream<EOU>. Skips (exit 77) unlessPARAKEET_TEST_GGUF_EOU+PARAKEET_TEST_BASELINE_EOU_RESETare set. NOTE:pk::StreamingSessionresets only the decoder state (LSTM + last token), not the encoder cache — verified byte-identical to NeMo's fullreset_stateon the transcript; the lone difference is a possible trailing end-of-clip<EOU>(the documented streaming-tail artifact), which never changes the transcript.tests/test_capi_stream.cpp(test_capi_stream) — feedsspeech.wavPCM in chunks through the streaming C-API; the concatenated text +finalizeequalsbaseline.stream_textfrom/tmp/baseline_eou_stream.gguf(NeMo streaming). Skips (exit 77) unlessPARAKEET_TEST_GGUF_EOU+PARAKEET_TEST_BASELINE_EOU_STREAMare set.
Reproduce:
# Streaming reference (NeMo cache-aware streaming encode + decode):
.venv/bin/python scripts/gen_stream_baseline.py \
--model nvidia/parakeet_realtime_eou_120m-v1 \
--audio tests/fixtures/speech.wav --output /tmp/baseline_eou_stream.gguf
# Multi-utterance reset-on-EOU reference (issue #13):
.venv/bin/python scripts/gen_stream_reset_baseline.py \
--model nvidia/parakeet_realtime_eou_120m-v1 \
--audio tests/fixtures/speech.wav --output /tmp/baseline_eou_reset.gguf
# CLI streaming:
./build/examples/cli/parakeet-cli transcribe \
--model /tmp/eou.gguf --input tests/fixtures/speech.wav --stream
# -> [stream] well i don't wish to see it ... like the old portrait
# [stream:final] well i don't wish to see it ... like the old portrait
# ctest:
PARAKEET_TEST_GGUF_EOU=/tmp/eou.gguf \
PARAKEET_TEST_BASELINE_EOU=/tmp/baseline_eou.gguf \
PARAKEET_TEST_BASELINE_EOU_STREAM=/tmp/baseline_eou_stream.gguf \
PARAKEET_TEST_BASELINE_EOU_RESET=/tmp/baseline_eou_reset.gguf \
ctest --test-dir build -R "test_capi_stream|test_streaming" --output-on-failurePer-token and per-word timestamps and confidence match NeMo's
transcribe(timestamps=True, return_hypotheses=True) with the decoding config's
confidence method set to max_prob (the rescaled (N·p_max − 1)/(N − 1) form,
α = 1.0, over the same logit slice NeMo log-softmaxes; N = vocab + 1 incl. blank).
Validated on tests/fixtures/speech.wav + nvidia/parakeet-tdt_ctc-110m for
both the TDT and CTC heads, against /tmp/baseline_ts.gguf (dumped by
scripts/gen_nemo_baseline.py).
| Quantity | TDT head | CTC head | Tolerance |
|---|---|---|---|
| per-token id / frame | exact | exact | exact |
| per-token confidence | ≤5e-6 | ≤5e-6 | < 1e-3 |
| word text (all 23 words) | exact | exact | exact |
| word start / end | 0.0 s diff | 0.0 s diff | ≤ frame_sec (1 frame) |
word confidence (min aggregate) |
≤5e-6 | ≤5e-6 | < 1e-3 |
frame_sec = hop_length × subsampling_factor / sample_rate = 0.08 s/frame here.
Word start = first_token.frame × frame_sec; word end =
(last_token.frame + span) × frame_sec (TDT span = predicted duration; CTC span =
the next collapsed token's start − this token's start, i.e. NeMo's run-length
end-offset). Words are grouped on the SentencePiece ▁ (U+2581) word-start
marker with NeMo's punctuation-attachment + _refine_timestamps rules
(pk::group_words, src/transcription.cpp).
pk::Model::transcribe_with_timestamps/transcribe_path_with_timestamps→pk::Transcription{text, words[], tokens[]}.- C-API
parakeet_capi_transcribe_path_json(ctx, wav, decoder)→ malloc'd JSON{"text":..,"words":[{"w","start","end","conf"}],"tokens":[{"id","t","conf"}]}(times %.3f s, conf %.4f; no-throw boundary; freed byparakeet_capi_free_string; exported fromlibparakeet.so, verified vianm -D). - CLI
parakeet-cli transcribe --timestamps(one<start>-<end> <word> (<conf>)line per word) and--json(the JSON above). - Streaming
pk::StreamingSession::drain_words()surfaces finalizedpk::Words as they complete (alongsidedrain_events()); the CLI--stream --timestampsprints per-word times as words finalize.
tests/test_timestamps.cpp(test_timestamps) — token + word timestamps/confidence vs NeMo for both heads (PARAKEET_TEST_GGUF+PARAKEET_TEST_BASELINE_TS).tests/test_capi_timestamps.cpp(test_capi_timestamps) — the C-API JSON:"text"== the NeMo TDT transcript, 23 words,words[0]==Well,with start ≈ 0.48 s and a non-empty"tokens"array. Skips (77) unlessPARAKEET_TEST_GGUFis set.
Reproduce:
# NeMo timestamp + max_prob confidence baseline:
.venv/bin/python scripts/gen_nemo_baseline.py # -> /tmp/baseline_ts.gguf (see script)
# CLI:
./build/examples/cli/parakeet-cli transcribe \
--model /tmp/pk110m.gguf --input tests/fixtures/speech.wav --timestamps
# -> 0.48-0.64 Well, (0.79)
# 0.80-0.88 I (1.00)
# ...
./build/examples/cli/parakeet-cli transcribe \
--model /tmp/pk110m.gguf --input tests/fixtures/speech.wav --json
# -> {"text":"Well, I don't ...","words":[{"w":"Well,","start":0.480,...}],"tokens":[...]}
# ctest:
PARAKEET_TEST_GGUF=/tmp/pk110m.gguf \
PARAKEET_TEST_BASELINE_TS=/tmp/baseline_ts.gguf \
ctest --test-dir build -R "test_timestamps|test_capi_timestamps" --output-on-failure