[WIP] Add NeMo Conformer TDT ASR support#1571
Draft
ysdede wants to merge 42 commits intohuggingface:mainfrom
Draft
[WIP] Add NeMo Conformer TDT ASR support#1571ysdede wants to merge 42 commits intohuggingface:mainfrom
ysdede wants to merge 42 commits intohuggingface:mainfrom
Conversation
…cache helpers Carry over non-runtime typing fixes from the prior branch while intentionally excluding the WebGPU disable_prepacking workaround in session.js.\n\n- Cast dynamic model.transcribe access for Nemo TDT pipeline method checks/calls.\n- Cast Tensor data byteLength access in transducer cache utilities.\n- Add explicit tuple/object JSDoc annotations in transducer timestamp builder.\n\nThis keeps main-based v4 work clean with latest ORT-Web on origin/main and avoids retaining the temporary encoder prepacking workaround.
- Replace legacy per-feature flags (return_token_timestamps, return_word_timestamps, return_utterance_timestamp) with a layered API: return_timestamps (utterance-level), return_words, return_tokens - Merge duplicate outputs: words absorbs word_timestamps, tokens absorbs token_timestamps and token_ids - Add per-token confidence, word-level confidence aggregation, utterance_confidence, and confidence_scores summary - Gate frame confidences behind returnFrameConfidences flag - Add return_metrics with encode/decode/total timing and RTF - Add debug flags: returnFrameIndices, returnLogProbs, returnTdtSteps - Fix vocab Map handling in getIdToTokenMap and _resolveVocabSize (tokenizer.get_vocab() returns Map in WASM binding) - Update ASR pipeline to wire timestamp_granularity to new model flags - Format all changed files with Prettier per CONTRIBUTING.md
…ipeline - Add roundTs() for millisecond-precision timestamp rounding at source - Round all confidence averages to 6 decimal places - Round per-token and per-word confidence values - Remove timestamp_granularity and formatting helpers from pipeline - Pipeline returns model.transcribe() output directly - Auto-enable return_words and return_metrics when return_timestamps is true
…imestamps, honor return_metrics kwarg - modeling_nemo_conformer_tdt: dispose logits and new decoder state tensors before throwing when logitsData.length < vocabSize to prevent resource leak - modeling_nemo_conformer_tdt: move returnFrameConfidences output block outside the return_timestamps guard so frame/frame_avg are emitted independently - automatic-speech-recognition: change return_metrics from hardcoded true to kwargs.return_metrics ?? false to respect user intent and avoid overhead
- Accept upstream restructuring: SUPPORTED_TASKS and pipeline imports moved from pipelines.js to pipelines/index.js - Migrate NemoConformerForTDT registration to pipelines/index.js accordingly
- Add MODEL_TYPES.NemoConformerTDT (id=16) to modeling_utils - Register NemoConformerForTDT in MODEL_TYPE_MAPPING, MODEL_NAME_TO_CLASS_MAPPING, and MODEL_CLASS_TO_NAME_MAPPING so the base class from_pretrained, ModelRegistry, and is_pipeline_cached all recognise the model correctly - Add NemoConformerTDT case to get_model_files so progress_callback receives accurate file size totals for encoder_model.onnx + decoder_model_merged.onnx
Standardizes internal logging to follow the upstream convention introduced in ModelRegistry refactor.
- Guard feature extractor against empty/short audio (NaN prevention) - Move decoder tensor init inside try block for safe disposal - Add architecture key to MODEL_TYPE_MAPPING - Add input validation in buildTransducerDetailedOutputs - Harden audio cache hash against NaN samples - Add order validation in computeTemporalDeltas - Restore pipeline: return_timestamps truthy => words + metrics always on
- Remove all timestamp_granularity tests (feature was removed) - Fix option names: return_tokens, return_words, return_timestamps - Fix output fields: tokens/words arrays, not token_ids/word_timestamps - Verify pipeline passes return_words + return_metrics when timestamps on - Add test: return_timestamps 'word' treated as truthy
Address reviewer findings except the return_metrics policy decision. - Fix temporal delta concatenation to interleave per frame and add dtype validation. - Validate preemphasis range and clamp normalization variance in feature extraction. - Remove unsafe encoder layout inference; require explicit encoder_output_layout. - Redesign decode loop to read frame data on-demand instead of eager frame materialization. - Deduplicate word finalization and avoid zero-filling missing word confidences. - Tighten tests for delta layout/type checks, explicit layout requirement, call counts, and naming accuracy.
Fixes high-impact issues found in PR review validation:\n- force NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT in registry overrides\n- ensure encoder outputs are disposed when pre-decode validation throws\n- remove stride sampling from audio cache key hashing to prevent false cache hits\n- use encoder_model selector key in get_model_files for Nemo per-component dtype/device overrides\n\nAlso adds targeted regression tests for mapping, disposal behavior, file selection, and cache key correctness.
- Clamp token end timestamps to encoder frame bounds during TDT decoding.\n- Validate FeatureLRUCache constructor limits to fail fast on invalid settings.\n- Add regression tests for timestamp clamping and cache limit validation.
Dispose intermediate tensors in computeTemporalDeltas concatenate paths and dispose replaced base input features when delta concatenation returns a new tensor.\n\nAdd regression tests that assert disposal behavior for delta concatenate flows and feature extractor reassignment.
Dispose non-essential Tensor outputs returned by decoder steps to prevent cumulative memory growth. Keep logits/state tensors alive for decoding/state transitions and dispose extras immediately.\n\nAdd regression test to assert auxiliary decoder tensor outputs are disposed each step.
Compute encoder length directly from attention_mask.data instead of attention_mask.tolist() to avoid large transient array allocations in ASR decode hot path.
Fail fast when duration logits are required but missing in decoder output, and enforce positive-integer vocab size at runtime config validation. Validate prepared Nemo pipeline audio for non-empty finite samples before processor/model calls. Add regression tests for missing duration logits and non-finite audio rejection.
Fix placeholder interpolation in _prepare_model_inputs error text. Add fail-fast validation for Nemo delta_window and reject duplicate decoder output aliases in transducer io config. Add regression tests for delta_window validation and duplicate decoder output alias rejection.
Validate transcribe timeOffset as finite and guard encoderOutputs cleanup path to avoid masking primary failures. Align transducer_text JSDoc token type with runtime shape (include id). Harden Parakeet feature extractor test by using direct mask data and explicit tensor disposal via try/finally; add timeOffset validation regression test.
- fail fast on missing decoder state outputs and invalid encoder layout enums\n- make FeatureLRUCache own cached tensor lifetimes (replace/evict/clear) with deduped disposal and deterministic size fallback\n- validate n_fft/win_length in Nemo feature extractor\n- align Nemo ASR pipeline docs with actual forwarded options\n- add regression coverage for runtime config validation, non-concatenated deltas/cache behavior, missing decoder state outputs, and cache disposal semantics\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt\n- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
Apply Gemini review nit in Nemo decode loop by replacing a redundant duration expression with Math.max(1, step).\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
Checklist (bot comment IDs): - [x] 2892132356: guard tokenizer.get_vocab() return type before Object.keys in _resolveVocabSize. - [x] 2892132367: treat zero cache limits as explicit no-cache mode; do not store/dispose just-produced values. - [x] 2892132372: dispose processor tensors in Nemo ASR pipeline when cache does not own lifetimes. Added regression tests for vocab resolution fallback, zero-limit cache semantics, and Nemo pipeline tensor ownership behavior. Validation: - pnpm test -- tests/models.test.js --filter nemo_conformer_tdt - pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
- widen confidenceFromLogits input type to Tensor data arrays - narrow feature_cache access with explicit typed cast in ASR pipeline
Checklist (bot comment IDs): - [x] 2892287484: handle array-returning tokenizer vocab in _resolveVocabSize. - [x] 2892322884: avoid disposing when re-setting the same object for an existing cache key. - [x] 2892322906: skip caching oversized values to prevent insert-then-dispose of caller-owned tensors. - [x] 2892322910: guard byteLength type in estimateSizeBytes. Added regression tests for array vocab sizing, same-object set behavior, oversized value skipping, and non-numeric byteLength handling. Validation: - pnpm test -- tests/models.test.js --filter nemo_conformer_tdt - pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
…o pipelines" This reverts commit b44f7f3.
Align the Nemo ASR pipeline with the shared task contract by returning text-only results by default and chunk-based timestamps for segment and word modes. Add automatic long-audio windowing, decoded-text-driven word reconstruction, and model-local helpers for window merge and chunk assembly. Also add regression coverage for numeric/punctuation word boundaries, windowed merge behavior, and auto-windowed long-form pipeline decoding.
Remove the standalone parakeet feature extractor test from this branch. It exercises an existing parakeet_ctc path that is outside the scope of Conformer TDT integration and makes the PR look broader than it is.
Use conservative sentence boundaries for pipeline timestamps and long-audio cursoring in the NeMo Conformer TDT pipeline. This keeps the HF-style pipeline contract while replacing the old fixed-window merge path with sentence-driven retranscription. Also remove dead NeMo window-merge helpers, delete the obsolete compatibility barrel, and extend the model and pipeline tests around cache handling, timestamps, and long-audio behavior.
Keep the shared ASR pipeline suite focused on the public Nemo contract and move adapter-specific windowing, retranscription, cache-ownership, and disposal coverage into a dedicated Nemo pipeline test file. Narrow the source diff by removing explanatory Nemo comments and reverting unrelated upstream-only tweaks, while also fixing the review findings around cursor snap-forward merging, tokenizer vocab-shape handling, empty timestamp validation, and cache borrow/release semantics for active inference. Verification: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Apply the remaining valid Nemo Conformer TDT review fixes without widening the shared ASR pipeline surface. - honor encoder_input_layout for canonical input_features feeds - keep borrowed cache entries counted until they are actually released - reject tokenizer-less non-empty word-offset reconstruction - raise the auto-window budget to match the minimum guaranteed cursor advance - add focused model and pipeline regressions for each fix Verified with: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Restore the original cast spacing in the unrelated moonshine path so the Nemo PR does not carry an extra formatting-only diff in automatic-speech-recognition.js.
Resolve sparse tokenizer vocab fallback by deriving the runtime size from the maximum token id instead of counting entries. This keeps decoder sizing correct when tokenizer ids are non-contiguous. Tighten merged-word dedupe so punctuation-only overlaps are only collapsed when their raw normalized text also matches, which avoids dropping distinct punctuation tokens across window boundaries. Add focused Nemo model regressions and verify with: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Treat likely domain suffixes as continuations when tokenizer decoding inserts whitespace after a trailing period, so sequences like `LibriVox. org.` reconstruct as `LibriVox.org.` in detailed word offsets. Add a focused regression covering the split `.org` token pattern and verify with: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
This reverts commit 39e5cb1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds NeMo Conformer TDT ASR support to
transformers.js, including model execution, feature extraction, decoding, reconstruction, pipeline wiring, registry integration, and NeMo-specific regression coverage.The NeMo pipeline is aligned with the shared
automatic-speech-recognitiontask contract, while richer directmodel.transcribe()outputs remain available for lower-level use.This branch supersedes my earlier fork-side iteration and is rebased onto the current
upstream/mainline.What is included
1. Model and decoder
model.transcribe()support for text, timestamps, confidences, optional words and tokens, and optional metrics/debug payloads.2. Feature extraction
3. ASR pipeline integration
AutomaticSpeechRecognitionPipelinedispatch.{ text }return_timestamps: true:{ text, chunks }with sentence-like finalized chunksreturn_timestamps: 'word':{ text, chunks }with word-level timestampsmodel.transcribe().4. Long-audio handling
chunk_length_sis used as the NeMo window-size override in pipeline mode.5. Word reconstruction and timestamp grouping
score.48-year-oldwith0.5March20th,2021.6. Registry and model file resolution
encoder_modelanddecoder_model_merged.7. Follow-up review fixes included in this branch
encoder_input_layoutfor canonicalinput_featuresfeeds.vocab_sizefrom the maximum tokenizer id so sparse vocabularies do not undersize decoder logits.Regression coverage
Added or updated tests in:
packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jsCoverage includes:
Upstream sync included
This branch was synced with
upstream/mainthrough commitf65a4c7cvia merge commit49a4af8f.Included upstream commits in that sync:
2120d13e[deno] Support both wgpu and dawn webgpu backends (#1546)a289e5c3Add support for new Qwen VL models (Qwen2.5-VL, Qwen3-VL, Qwen3.5, and Qwen3.5 MoE) (#1551)b5b51ca9[version] Update to 4.0.0-next.52a210d3e[deno via CDN] Fix simultaneous multi-session loading (e.g. VLMs) and support usage for image loading (#1556)e60a6ee3Use ModelRegistry for pipeline file loading (#1555)4331d723Support PKV cached generation for Qwen-VL models (#1557)cd155a05fix: prevent partial file reads during concurrent downloads (#1548)30773fb7Fix WASM factory blob URL loading (#1558)f65a4c7cfeat: add fast boolean is_cached / is_pipeline_cached (#1559)NeMo adaptation after upstream sync
Relevant branch commits after that sync include:
ee819a1cfix(nemo-tdt): add supports() for ASR model class selectiona85dff25fix(nemo-tdt): address PR #12 reviewer feedback8dfccddcfeat(nemo-tdt): align asr pipeline outputs and long-audio handling816f5811chore(tests): drop unrelated parakeet feature extractor coveragef59ba068feat(nemo-conformer-tdt): add sentence-based ASR pipeline chunking00b3d934fix(nemo): scope ASR tests and address review fixes07118c38fix(nemo-tdt): address follow-up review threads341df3d7chore(asr): restore upstream cast spacing29f2baaffix(nemo-tdt): handle sparse vocab and merge dedupeValidation
Executed for this refresh:
node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"pnpm buildpnpm format:checkon files changed by this branch: OK`Note: repo-wide
pnpm format:checkcurrently reports unrelated formatting issues outside this branch, but the files touched by this PR pass Prettier checks.A prebuilt demo for this branch is available at https://ysdede.github.io/tdt-webgpu-demo to help review the current behavior, parameters, outputs, and example usage interactively.
Scope boundary
This PR stays focused on NeMo Conformer TDT integration and the follow-up work needed to:
Direct
model.transcribe()remains the low-level API for advanced app-specific postprocessing.