Skip to content

[WIP] Add NeMo Conformer TDT ASR support#1571

Draft
ysdede wants to merge 42 commits intohuggingface:mainfrom
ysdede:v4-nemo-conformer-tdt-main-r3
Draft

[WIP] Add NeMo Conformer TDT ASR support#1571
ysdede wants to merge 42 commits intohuggingface:mainfrom
ysdede:v4-nemo-conformer-tdt-main-r3

Conversation

@ysdede
Copy link

@ysdede ysdede commented Mar 9, 2026

Summary

This PR adds NeMo Conformer TDT ASR support to transformers.js, including model execution, feature extraction, decoding, reconstruction, pipeline wiring, registry integration, and NeMo-specific regression coverage.

The NeMo pipeline is aligned with the shared automatic-speech-recognition task contract, while richer direct model.transcribe() outputs remain available for lower-level use.

This branch supersedes my earlier fork-side iteration and is rebased onto the current upstream/main line.

What is included

1. Model and decoder

  • Added the NeMo Conformer TDT model implementation.
  • Implemented greedy token-and-duration transducer decoding.
  • Added model.transcribe() support for text, timestamps, confidences, optional words and tokens, and optional metrics/debug payloads.

2. Feature extraction

  • Added Conformer-TDT-specific log-mel feature extraction.
  • Added optional temporal deltas and delta-delta features.
  • Added optional feature-cache utilities with tensor ownership and lifecycle handling.

3. ASR pipeline integration

  • Integrated the Conformer TDT model type into AutomaticSpeechRecognitionPipeline dispatch.
  • Aligned pipeline outputs with the shared ASR task shape:
    • default: { text }
    • return_timestamps: true: { text, chunks } with sentence-like finalized chunks
    • return_timestamps: 'word': { text, chunks } with word-level timestamps
  • Kept richer model-native outputs on direct model.transcribe().

4. Long-audio handling

  • Added automatic long-audio handling for NeMo pipeline calls above 180 seconds.
  • Replaced the older overlap-oriented long-audio path with sentence-cursor restart logic.
  • Long-audio windowing finalizes stable sentence-like segments, drops the immature trailing segment, and retranscribes from that segment start.
  • chunk_length_s is used as the NeMo window-size override in pipeline mode.

5. Word reconstruction and timestamp grouping

  • Reworked word reconstruction to derive boundaries from the final decoded text instead of isolated token decoding alone.
  • Improved segment grouping from timed words so sentence-like chunks are more stable than Whisper-style random splits.
  • Fixed spacing and boundary failures around punctuation-heavy and numeric outputs such as:
    • score.48-year-old
    • with0.5
    • March20th,2021.

6. Registry and model file resolution

  • Added model, processor, and feature extractor exports and mappings for Conformer TDT.
  • Added dual-artifact model file handling for encoder_model and decoder_model_merged.

7. Follow-up review fixes included in this branch

  • Moved NeMo adapter-specific assertions out of the shared ASR pipeline test file and into a dedicated NeMo pipeline adapter suite.
  • Fixed pending-prefix preservation when cursor snapping restarts inside the trailing sentence.
  • Hardened vocab handling and validation in word-offset reconstruction.
  • Added cache borrow/release handling so evicted borrowed tensors are disposed only after release.
  • Honored encoder_input_layout for canonical input_features feeds.
  • Raised the auto-window budget to match the minimum guaranteed cursor advance.
  • Kept borrowed cache-entry bytes counted until the final release.
  • Rejected tokenizer-less non-empty word-offset reconstruction instead of silently dropping detail.
  • Derived fallback vocab_size from the maximum tokenizer id so sparse vocabularies do not undersize decoder logits.
  • Kept punctuation-only merge dedupe from collapsing distinct overlapping tokens.

Regression coverage

Added or updated tests in:

  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js

Coverage includes:

  • task-shaped pipeline outputs for default, sentence-chunk, and word-chunk modes
  • sentence-cursor long-audio windowing and retranscription
  • timestamp grouping and word-boundary reconstruction for punctuation and numeric tokens
  • encoder input layout handling for canonical feeds
  • sparse vocab fallback sizing
  • punctuation-only merge dedupe behavior
  • cache ownership, eviction, release, and accounting behavior
  • processor tensor lifetime behavior in the NeMo path

Upstream sync included

This branch was synced with upstream/main through commit f65a4c7c via merge commit 49a4af8f.

Included upstream commits in that sync:

  • 2120d13e [deno] Support both wgpu and dawn webgpu backends (#1546)
  • a289e5c3 Add support for new Qwen VL models (Qwen2.5-VL, Qwen3-VL, Qwen3.5, and Qwen3.5 MoE) (#1551)
  • b5b51ca9 [version] Update to 4.0.0-next.5
  • 2a210d3e [deno via CDN] Fix simultaneous multi-session loading (e.g. VLMs) and support usage for image loading (#1556)
  • e60a6ee3 Use ModelRegistry for pipeline file loading (#1555)
  • 4331d723 Support PKV cached generation for Qwen-VL models (#1557)
  • cd155a05 fix: prevent partial file reads during concurrent downloads (#1548)
  • 30773fb7 Fix WASM factory blob URL loading (#1558)
  • f65a4c7c feat: add fast boolean is_cached / is_pipeline_cached (#1559)

NeMo adaptation after upstream sync

Relevant branch commits after that sync include:

  • ee819a1c fix(nemo-tdt): add supports() for ASR model class selection
  • a85dff25 fix(nemo-tdt): address PR #12 reviewer feedback
  • 8dfccddc feat(nemo-tdt): align asr pipeline outputs and long-audio handling
  • 816f5811 chore(tests): drop unrelated parakeet feature extractor coverage
  • f59ba068 feat(nemo-conformer-tdt): add sentence-based ASR pipeline chunking
  • 00b3d934 fix(nemo): scope ASR tests and address review fixes
  • 07118c38 fix(nemo-tdt): address follow-up review threads
  • 341df3d7 chore(asr): restore upstream cast spacing
  • 29f2baaf fix(nemo-tdt): handle sparse vocab and merge dedupe

Validation

Executed for this refresh:

  • node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
  • node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
  • pnpm build
  • pnpm format:check on files changed by this branch: OK`

Note: repo-wide pnpm format:check currently reports unrelated formatting issues outside this branch, but the files touched by this PR pass Prettier checks.

A prebuilt demo for this branch is available at https://ysdede.github.io/tdt-webgpu-demo to help review the current behavior, parameters, outputs, and example usage interactively.

Scope boundary

This PR stays focused on NeMo Conformer TDT integration and the follow-up work needed to:

  • align pipeline behavior with the shared ASR contract
  • improve long-audio handling in pipeline mode
  • improve word reconstruction and timestamp grouping
  • address targeted reviewer-reported NeMo correctness issues

Direct model.transcribe() remains the low-level API for advanced app-specific postprocessing.

ysdede added 30 commits March 1, 2026 16:56
…cache helpers

Carry over non-runtime typing fixes from the prior branch while intentionally excluding the WebGPU disable_prepacking workaround in session.js.\n\n- Cast dynamic model.transcribe access for Nemo TDT pipeline method checks/calls.\n- Cast Tensor data byteLength access in transducer cache utilities.\n- Add explicit tuple/object JSDoc annotations in transducer timestamp builder.\n\nThis keeps main-based v4 work clean with latest ORT-Web on origin/main and avoids retaining the temporary encoder prepacking workaround.
- Replace legacy per-feature flags (return_token_timestamps,
  return_word_timestamps, return_utterance_timestamp) with a layered API:
  return_timestamps (utterance-level), return_words, return_tokens
- Merge duplicate outputs: words absorbs word_timestamps,
  tokens absorbs token_timestamps and token_ids
- Add per-token confidence, word-level confidence aggregation,
  utterance_confidence, and confidence_scores summary
- Gate frame confidences behind returnFrameConfidences flag
- Add return_metrics with encode/decode/total timing and RTF
- Add debug flags: returnFrameIndices, returnLogProbs, returnTdtSteps
- Fix vocab Map handling in getIdToTokenMap and _resolveVocabSize
  (tokenizer.get_vocab() returns Map in WASM binding)
- Update ASR pipeline to wire timestamp_granularity to new model flags
- Format all changed files with Prettier per CONTRIBUTING.md
…ipeline

- Add roundTs() for millisecond-precision timestamp rounding at source

- Round all confidence averages to 6 decimal places

- Round per-token and per-word confidence values

- Remove timestamp_granularity and formatting helpers from pipeline

- Pipeline returns model.transcribe() output directly

- Auto-enable return_words and return_metrics when return_timestamps is true
…imestamps, honor return_metrics kwarg

- modeling_nemo_conformer_tdt: dispose logits and new decoder state tensors
  before throwing when logitsData.length < vocabSize to prevent resource leak
- modeling_nemo_conformer_tdt: move returnFrameConfidences output block outside
  the return_timestamps guard so frame/frame_avg are emitted independently
- automatic-speech-recognition: change return_metrics from hardcoded true to
  kwargs.return_metrics ?? false to respect user intent and avoid overhead
- Accept upstream restructuring: SUPPORTED_TASKS and pipeline imports moved
  from pipelines.js to pipelines/index.js
- Migrate NemoConformerForTDT registration to pipelines/index.js accordingly
- Add MODEL_TYPES.NemoConformerTDT (id=16) to modeling_utils
- Register NemoConformerForTDT in MODEL_TYPE_MAPPING, MODEL_NAME_TO_CLASS_MAPPING,
  and MODEL_CLASS_TO_NAME_MAPPING so the base class from_pretrained, ModelRegistry,
  and is_pipeline_cached all recognise the model correctly
- Add NemoConformerTDT case to get_model_files so progress_callback receives
  accurate file size totals for encoder_model.onnx + decoder_model_merged.onnx
Standardizes internal logging to follow the upstream convention introduced
in ModelRegistry refactor.
- Guard feature extractor against empty/short audio (NaN prevention)

- Move decoder tensor init inside try block for safe disposal

- Add architecture key to MODEL_TYPE_MAPPING

- Add input validation in buildTransducerDetailedOutputs

- Harden audio cache hash against NaN samples

- Add order validation in computeTemporalDeltas

- Restore pipeline: return_timestamps truthy => words + metrics always on
- Remove all timestamp_granularity tests (feature was removed)

- Fix option names: return_tokens, return_words, return_timestamps

- Fix output fields: tokens/words arrays, not token_ids/word_timestamps

- Verify pipeline passes return_words + return_metrics when timestamps on

- Add test: return_timestamps 'word' treated as truthy
Address reviewer findings except the return_metrics policy decision.

- Fix temporal delta concatenation to interleave per frame and add dtype validation.
- Validate preemphasis range and clamp normalization variance in feature extraction.
- Remove unsafe encoder layout inference; require explicit encoder_output_layout.
- Redesign decode loop to read frame data on-demand instead of eager frame materialization.
- Deduplicate word finalization and avoid zero-filling missing word confidences.
- Tighten tests for delta layout/type checks, explicit layout requirement, call counts, and naming accuracy.
Fixes high-impact issues found in PR review validation:\n- force NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT in registry overrides\n- ensure encoder outputs are disposed when pre-decode validation throws\n- remove stride sampling from audio cache key hashing to prevent false cache hits\n- use encoder_model selector key in get_model_files for Nemo per-component dtype/device overrides\n\nAlso adds targeted regression tests for mapping, disposal behavior, file selection, and cache key correctness.
- Clamp token end timestamps to encoder frame bounds during TDT decoding.\n- Validate FeatureLRUCache constructor limits to fail fast on invalid settings.\n- Add regression tests for timestamp clamping and cache limit validation.
Dispose intermediate tensors in computeTemporalDeltas concatenate paths and dispose replaced base input features when delta concatenation returns a new tensor.\n\nAdd regression tests that assert disposal behavior for delta concatenate flows and feature extractor reassignment.
Dispose non-essential Tensor outputs returned by decoder steps to prevent cumulative memory growth. Keep logits/state tensors alive for decoding/state transitions and dispose extras immediately.\n\nAdd regression test to assert auxiliary decoder tensor outputs are disposed each step.
Compute encoder length directly from attention_mask.data instead of attention_mask.tolist() to avoid large transient array allocations in ASR decode hot path.
Fail fast when duration logits are required but missing in decoder output, and enforce positive-integer vocab size at runtime config validation.

Validate prepared Nemo pipeline audio for non-empty finite samples before processor/model calls.

Add regression tests for missing duration logits and non-finite audio rejection.
Fix placeholder interpolation in _prepare_model_inputs error text.

Add fail-fast validation for Nemo delta_window and reject duplicate decoder output aliases in transducer io config.

Add regression tests for delta_window validation and duplicate decoder output alias rejection.
Validate transcribe timeOffset as finite and guard encoderOutputs cleanup path to avoid masking primary failures.

Align transducer_text JSDoc token type with runtime shape (include id).

Harden Parakeet feature extractor test by using direct mask data and explicit tensor disposal via try/finally; add timeOffset validation regression test.
- fail fast on missing decoder state outputs and invalid encoder layout enums\n- make FeatureLRUCache own cached tensor lifetimes (replace/evict/clear) with deduped disposal and deterministic size fallback\n- validate n_fft/win_length in Nemo feature extractor\n- align Nemo ASR pipeline docs with actual forwarded options\n- add regression coverage for runtime config validation, non-concatenated deltas/cache behavior, missing decoder state outputs, and cache disposal semantics\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt\n- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
Apply Gemini review nit in Nemo decode loop by replacing a redundant duration expression with Math.max(1, step).\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
Checklist (bot comment IDs):
- [x] 2892132356: guard tokenizer.get_vocab() return type before Object.keys in _resolveVocabSize.
- [x] 2892132367: treat zero cache limits as explicit no-cache mode; do not store/dispose just-produced values.
- [x] 2892132372: dispose processor tensors in Nemo ASR pipeline when cache does not own lifetimes.

Added regression tests for vocab resolution fallback, zero-limit cache semantics, and Nemo pipeline tensor ownership behavior.

Validation:
- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
- widen confidenceFromLogits input type to Tensor data arrays

- narrow feature_cache access with explicit typed cast in ASR pipeline
Checklist (bot comment IDs):
- [x] 2892287484: handle array-returning tokenizer vocab in _resolveVocabSize.
- [x] 2892322884: avoid disposing when re-setting the same object for an existing cache key.
- [x] 2892322906: skip caching oversized values to prevent insert-then-dispose of caller-owned tensors.
- [x] 2892322910: guard byteLength type in estimateSizeBytes.

Added regression tests for array vocab sizing, same-object set behavior, oversized value skipping, and non-numeric byteLength handling.

Validation:
- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
ysdede added 12 commits March 6, 2026 01:44
Align the Nemo ASR pipeline with the shared task contract by returning
text-only results by default and chunk-based timestamps for segment and
word modes. Add automatic long-audio windowing, decoded-text-driven word
reconstruction, and model-local helpers for window merge and chunk
assembly.

Also add regression coverage for numeric/punctuation word boundaries,
windowed merge behavior, and auto-windowed long-form pipeline decoding.
Remove the standalone parakeet feature extractor test from this branch.
It exercises an existing parakeet_ctc path that is outside the scope of
Conformer TDT integration and makes the PR look broader than it is.
Use conservative sentence boundaries for pipeline timestamps and long-audio cursoring in the NeMo Conformer TDT pipeline. This keeps the HF-style pipeline contract while replacing the old fixed-window merge path with sentence-driven retranscription.

Also remove dead NeMo window-merge helpers, delete the obsolete compatibility barrel, and extend the model and pipeline tests around cache handling, timestamps, and long-audio behavior.
Keep the shared ASR pipeline suite focused on the public Nemo contract and move adapter-specific windowing, retranscription, cache-ownership, and disposal coverage into a dedicated Nemo pipeline test file.

Narrow the source diff by removing explanatory Nemo comments and reverting unrelated upstream-only tweaks, while also fixing the review findings around cursor snap-forward merging, tokenizer vocab-shape handling, empty timestamp validation, and cache borrow/release semantics for active inference.

Verification:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Apply the remaining valid Nemo Conformer TDT review fixes without widening the shared ASR pipeline surface.

- honor encoder_input_layout for canonical input_features feeds
- keep borrowed cache entries counted until they are actually released
- reject tokenizer-less non-empty word-offset reconstruction
- raise the auto-window budget to match the minimum guaranteed cursor advance
- add focused model and pipeline regressions for each fix

Verified with:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Restore the original cast spacing in the unrelated moonshine path so the Nemo PR does not carry an extra formatting-only diff in automatic-speech-recognition.js.
Resolve sparse tokenizer vocab fallback by deriving the runtime size from the maximum token id instead of counting entries. This keeps decoder sizing correct when tokenizer ids are non-contiguous.

Tighten merged-word dedupe so punctuation-only overlaps are only collapsed when their raw normalized text also matches, which avoids dropping distinct punctuation tokens across window boundaries.

Add focused Nemo model regressions and verify with:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Treat likely domain suffixes as continuations when tokenizer decoding inserts whitespace after a trailing period, so sequences like `LibriVox. org.` reconstruct as `LibriVox.org.` in detailed word offsets.

Add a focused regression covering the split `.org` token pattern and verify with:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
@ysdede ysdede marked this pull request as draft March 9, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant