feat: add Nemo Conformer TDT support (pipeline-aligned refresh)#13
feat: add Nemo Conformer TDT support (pipeline-aligned refresh)#13
Conversation
…cache helpers Carry over non-runtime typing fixes from the prior branch while intentionally excluding the WebGPU disable_prepacking workaround in session.js.\n\n- Cast dynamic model.transcribe access for Nemo TDT pipeline method checks/calls.\n- Cast Tensor data byteLength access in transducer cache utilities.\n- Add explicit tuple/object JSDoc annotations in transducer timestamp builder.\n\nThis keeps main-based v4 work clean with latest ORT-Web on origin/main and avoids retaining the temporary encoder prepacking workaround.
- Replace legacy per-feature flags (return_token_timestamps, return_word_timestamps, return_utterance_timestamp) with a layered API: return_timestamps (utterance-level), return_words, return_tokens - Merge duplicate outputs: words absorbs word_timestamps, tokens absorbs token_timestamps and token_ids - Add per-token confidence, word-level confidence aggregation, utterance_confidence, and confidence_scores summary - Gate frame confidences behind returnFrameConfidences flag - Add return_metrics with encode/decode/total timing and RTF - Add debug flags: returnFrameIndices, returnLogProbs, returnTdtSteps - Fix vocab Map handling in getIdToTokenMap and _resolveVocabSize (tokenizer.get_vocab() returns Map in WASM binding) - Update ASR pipeline to wire timestamp_granularity to new model flags - Format all changed files with Prettier per CONTRIBUTING.md
…ipeline - Add roundTs() for millisecond-precision timestamp rounding at source - Round all confidence averages to 6 decimal places - Round per-token and per-word confidence values - Remove timestamp_granularity and formatting helpers from pipeline - Pipeline returns model.transcribe() output directly - Auto-enable return_words and return_metrics when return_timestamps is true
…imestamps, honor return_metrics kwarg - modeling_nemo_conformer_tdt: dispose logits and new decoder state tensors before throwing when logitsData.length < vocabSize to prevent resource leak - modeling_nemo_conformer_tdt: move returnFrameConfidences output block outside the return_timestamps guard so frame/frame_avg are emitted independently - automatic-speech-recognition: change return_metrics from hardcoded true to kwargs.return_metrics ?? false to respect user intent and avoid overhead
- Accept upstream restructuring: SUPPORTED_TASKS and pipeline imports moved from pipelines.js to pipelines/index.js - Migrate NemoConformerForTDT registration to pipelines/index.js accordingly
- Add MODEL_TYPES.NemoConformerTDT (id=16) to modeling_utils - Register NemoConformerForTDT in MODEL_TYPE_MAPPING, MODEL_NAME_TO_CLASS_MAPPING, and MODEL_CLASS_TO_NAME_MAPPING so the base class from_pretrained, ModelRegistry, and is_pipeline_cached all recognise the model correctly - Add NemoConformerTDT case to get_model_files so progress_callback receives accurate file size totals for encoder_model.onnx + decoder_model_merged.onnx
Standardizes internal logging to follow the upstream convention introduced in ModelRegistry refactor.
- Guard feature extractor against empty/short audio (NaN prevention) - Move decoder tensor init inside try block for safe disposal - Add architecture key to MODEL_TYPE_MAPPING - Add input validation in buildTransducerDetailedOutputs - Harden audio cache hash against NaN samples - Add order validation in computeTemporalDeltas - Restore pipeline: return_timestamps truthy => words + metrics always on
- Remove all timestamp_granularity tests (feature was removed) - Fix option names: return_tokens, return_words, return_timestamps - Fix output fields: tokens/words arrays, not token_ids/word_timestamps - Verify pipeline passes return_words + return_metrics when timestamps on - Add test: return_timestamps 'word' treated as truthy
Address reviewer findings except the return_metrics policy decision. - Fix temporal delta concatenation to interleave per frame and add dtype validation. - Validate preemphasis range and clamp normalization variance in feature extraction. - Remove unsafe encoder layout inference; require explicit encoder_output_layout. - Redesign decode loop to read frame data on-demand instead of eager frame materialization. - Deduplicate word finalization and avoid zero-filling missing word confidences. - Tighten tests for delta layout/type checks, explicit layout requirement, call counts, and naming accuracy.
Fixes high-impact issues found in PR review validation:\n- force NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT in registry overrides\n- ensure encoder outputs are disposed when pre-decode validation throws\n- remove stride sampling from audio cache key hashing to prevent false cache hits\n- use encoder_model selector key in get_model_files for Nemo per-component dtype/device overrides\n\nAlso adds targeted regression tests for mapping, disposal behavior, file selection, and cache key correctness.
- Clamp token end timestamps to encoder frame bounds during TDT decoding.\n- Validate FeatureLRUCache constructor limits to fail fast on invalid settings.\n- Add regression tests for timestamp clamping and cache limit validation.
Dispose intermediate tensors in computeTemporalDeltas concatenate paths and dispose replaced base input features when delta concatenation returns a new tensor.\n\nAdd regression tests that assert disposal behavior for delta concatenate flows and feature extractor reassignment.
Dispose non-essential Tensor outputs returned by decoder steps to prevent cumulative memory growth. Keep logits/state tensors alive for decoding/state transitions and dispose extras immediately.\n\nAdd regression test to assert auxiliary decoder tensor outputs are disposed each step.
Compute encoder length directly from attention_mask.data instead of attention_mask.tolist() to avoid large transient array allocations in ASR decode hot path.
Fail fast when duration logits are required but missing in decoder output, and enforce positive-integer vocab size at runtime config validation. Validate prepared Nemo pipeline audio for non-empty finite samples before processor/model calls. Add regression tests for missing duration logits and non-finite audio rejection.
Fix placeholder interpolation in _prepare_model_inputs error text. Add fail-fast validation for Nemo delta_window and reject duplicate decoder output aliases in transducer io config. Add regression tests for delta_window validation and duplicate decoder output alias rejection.
Validate transcribe timeOffset as finite and guard encoderOutputs cleanup path to avoid masking primary failures. Align transducer_text JSDoc token type with runtime shape (include id). Harden Parakeet feature extractor test by using direct mask data and explicit tensor disposal via try/finally; add timeOffset validation regression test.
- fail fast on missing decoder state outputs and invalid encoder layout enums\n- make FeatureLRUCache own cached tensor lifetimes (replace/evict/clear) with deduped disposal and deterministic size fallback\n- validate n_fft/win_length in Nemo feature extractor\n- align Nemo ASR pipeline docs with actual forwarded options\n- add regression coverage for runtime config validation, non-concatenated deltas/cache behavior, missing decoder state outputs, and cache disposal semantics\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt\n- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
Apply Gemini review nit in Nemo decode loop by replacing a redundant duration expression with Math.max(1, step).\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
Checklist (bot comment IDs): - [x] 2892132356: guard tokenizer.get_vocab() return type before Object.keys in _resolveVocabSize. - [x] 2892132367: treat zero cache limits as explicit no-cache mode; do not store/dispose just-produced values. - [x] 2892132372: dispose processor tensors in Nemo ASR pipeline when cache does not own lifetimes. Added regression tests for vocab resolution fallback, zero-limit cache semantics, and Nemo pipeline tensor ownership behavior. Validation: - pnpm test -- tests/models.test.js --filter nemo_conformer_tdt - pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
- widen confidenceFromLogits input type to Tensor data arrays - narrow feature_cache access with explicit typed cast in ASR pipeline
Checklist (bot comment IDs): - [x] 2892287484: handle array-returning tokenizer vocab in _resolveVocabSize. - [x] 2892322884: avoid disposing when re-setting the same object for an existing cache key. - [x] 2892322906: skip caching oversized values to prevent insert-then-dispose of caller-owned tensors. - [x] 2892322910: guard byteLength type in estimateSizeBytes. Added regression tests for array vocab sizing, same-object set behavior, oversized value skipping, and non-numeric byteLength handling. Validation: - pnpm test -- tests/models.test.js --filter nemo_conformer_tdt - pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
📝 WalkthroughWalkthroughThis PR introduces a complete Nemo Conformer TDT (Transducer with Decoder Training) ASR model integration, including feature extraction with configurable caching, encoder/decoder orchestration with frame-based decoding, processing pipeline, and pipeline dispatch. New model type registered and ASR pipeline extended with Nemo support. Changes
Sequence Diagram(s)sequenceDiagram
participant App
participant Pipeline as ASR Pipeline
participant Processor
participant FeatureExtractor
participant Model as Nemo Conformer TDT
participant Encoder
participant Decoder
participant Tokenizer
App->>Pipeline: audio + decode_options
Pipeline->>Processor: audio
Processor->>FeatureExtractor: audio
FeatureExtractor->>FeatureExtractor: compute log-Mel spectrogram
FeatureExtractor->>FeatureExtractor: apply optional deltas
FeatureExtractor->>FeatureExtractor: frame normalization
FeatureExtractor-->>Processor: input_features, attention_mask
Processor-->>Pipeline: processed inputs
Pipeline->>Model: inputs → transcribe()
Model->>Encoder: input_features, attention_mask
Encoder-->>Model: encoder_output, frame_count
Model->>Model: frame-based greedy decoding loop
loop Per Frame
Model->>Decoder: frame_data, decoder_state
Decoder-->>Model: logits, updated_state
Model->>Model: argmax token, compute confidence
Model->>Model: update timestamp, accumulate token
end
Model->>Tokenizer: token_ids
Tokenizer-->>Model: decoded text
Model->>Model: align tokens to text → word objects
Model->>Model: (optional) segment words → chunks
Model-->>Pipeline: transcription result (text, tokens, words, timestamps)
Pipeline-->>App: final output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Align the Nemo ASR pipeline with the shared task contract by returning text-only results by default and chunk-based timestamps for segment and word modes. Add automatic long-audio windowing, decoded-text-driven word reconstruction, and model-local helpers for window merge and chunk assembly. Also add regression coverage for numeric/punctuation word boundaries, windowed merge behavior, and auto-windowed long-form pipeline decoding.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands transformers.js by introducing comprehensive, end-to-end support for Nemo Conformer TDT Automatic Speech Recognition models. It enables seamless integration of these models across browser and Node.js environments, providing both a standardized ASR pipeline interface and access to the model's advanced native capabilities. The changes include specialized feature extraction, efficient long-audio processing, and refined word reconstruction, all backed by extensive testing to ensure reliability and accuracy. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
| * `feature_size` (e.g. 80 or 128 mel bins via `preprocessor_config.json`). | ||
| */ | ||
| export class NemoConformerTDTFeatureExtractor extends FeatureExtractor { | ||
| constructor(config) { |
There was a problem hiding this comment.
MEDIUM: Missing validation for hop_length in constructor. Unlike n_fft and win_length which are validated (lines 20-33), hop_length is used directly in _extract (line 163) without validation. Invalid hop_length (0, negative, or non-integer) would cause runtime errors or incorrect feature dimensions.
Recommended fix: Add validation in constructor:
if (!Number.isInteger(this.config.hop_length) || this.config.hop_length <= 0) {
throw new Error(
`NemoConformerTDTFeatureExtractor expected \`hop_length\` as a positive integer, got ${this.config.hop_length}.`
);
}
Code Review SummaryStatus: 5 Issues Already Flagged | Recommendation: Address existing issues before merge Overview
Additional ObservationsAfter thorough analysis of this PR adding Nemo Conformer TDT support, I found the following: Test Coverage (Good):
Security Review:
Performance Review:
Reliability:
One Additional Note (Not in Diff):
Files Reviewed (10+ files)
|
There was a problem hiding this comment.
Code Review
This is an extensive and well-executed pull request that adds end-to-end support for Nemo Conformer TDT models. The changes are well-structured, with clear separation of concerns across new modules for feature extraction, model implementation, pipeline integration, and various transducer-related utilities. The implementation of the TDT decoder, long-audio windowing, and the improved word reconstruction logic is particularly impressive and robust. The addition of comprehensive unit tests, including mocked components for isolated testing, significantly increases confidence in the new functionality.
My review found the code to be of high quality. I have only a couple of minor suggestions regarding error handling style for consistency. Overall, great work on this significant feature addition!
| if (stride_length_s === null) { | ||
| stride_length_s = chunk_length_s / 6; | ||
| } else if (!(stride_length_s >= 0)) { | ||
| throw Error('`stride_length_s` must be non-negative.'); |
There was a problem hiding this comment.
| throw Error('`stride_length_s` must be non-negative.'); | ||
| } | ||
| if (chunk_length_s <= 2 * stride_length_s) { | ||
| throw Error('`chunk_length_s` must be larger than `2 * stride_length_s` for Nemo windowed decoding.'); |
There was a problem hiding this comment.
For consistency and best practice, it's better to use new Error() to create an error object instance rather than calling Error() as a function.
| throw Error('`chunk_length_s` must be larger than `2 * stride_length_s` for Nemo windowed decoding.'); | |
| throw new Error('`chunk_length_s` must be larger than `2 * stride_length_s` for Nemo windowed decoding.'); |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8dfccddc4d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!cacheOwnsTensors) { | ||
| const seen = new Set(); | ||
| for (const value of Object.values(inputs ?? {})) { | ||
| if (value instanceof Tensor && !seen.has(value)) { | ||
| value.dispose(); |
There was a problem hiding this comment.
Dispose uncached processor tensors when feature cache skips entry
runNemoConformerTDTPipeline decides tensor ownership from cache limits (max_entries/max_size_mb) instead of whether the current feature extraction result was actually cached. FeatureLRUCache.set explicitly skips caching oversized values while leaving ownership with the caller, so this if (!cacheOwnsTensors) guard can skip disposal for large windows and leak input_features/attention_mask tensors across calls when caching is enabled but entries do not fit.
Useful? React with 👍 / 👎.
| if (name === 'input_features') { | ||
| feeds[name] = inputFeatures; | ||
| continue; |
There was a problem hiding this comment.
Honor encoder_input_layout for input_features feeds
This branch always forwards input_features as-is and bypasses transducer.encoder_input_layout; transpose handling is only applied in the separate audio_signal branch. That means configs that legitimately specify encoder_input_layout: 'BFT' but use input_features as the encoder input name will silently receive BTF tensors and produce incorrect decoding despite passing config validation.
Useful? React with 👍 / 👎.
Remove the standalone parakeet feature extractor test from this branch. It exercises an existing parakeet_ctc path that is outside the scope of Conformer TDT integration and makes the PR look broader than it is.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 39b36607-a412-4780-b0dc-0545844a16f2
📒 Files selected for processing (23)
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/models.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/models/parakeet/test_feature_extraction_parakeet.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Applied to files:
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/models.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Applied to files:
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/models.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/tests/models/parakeet/test_feature_extraction_parakeet.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Applied to files:
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/models.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.
Applied to files:
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/models.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/pipelines/index.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Applied to files:
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/models.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Applied to files:
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/models.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.
Applied to files:
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:55.984Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T22:59:55.984Z
Learning: When a model subclass overrides from_pretrained and bypasses the generic model-type branch, do not introduce or rely on a MODEL_TYPES entry for that subclass in the model registry. For this NemoConformerTDT case, in packages/transformers/src/utils/model_registry/get_model_files.js, avoid adding a NemoConformerTDT entry in the model type map; rely on the override logic in modeling_nemo_conformer_tdt.js. This keeps the registry explicit to the actual file and prevents unintended dispatch through the generic branch.
Applied to files:
packages/transformers/src/utils/model_registry/get_model_files.js
🔇 Additional comments (31)
packages/transformers/tests/models/parakeet/test_feature_extraction_parakeet.js (3)
1-4: Imports are correct.Relative paths are appropriate for the test file location.
5-20: Test setup is well-structured.Config values are consistent with standard Parakeet feature extraction parameters. The
beforeAllpattern correctly scopes extractor instantiation.
22-47: Test logic is sound with proper resource management.The try/finally pattern ensures tensor disposal regardless of assertion outcome. Shape and validity assertions align with the expected output structure from
ParakeetFeatureExtractor._call.One observation on line 35:
attention_mask.datais aBigInt64Array. TheNumber(x)coercion works correctly for mask values (0n/1n), but be aware this pattern would silently lose precision for values exceedingNumber.MAX_SAFE_INTEGER. Not an issue here given the mask domain.packages/transformers/src/models/modeling_utils.js (1)
121-121: LGTM: New model type enum value.The
NemoConformerTDT: 16addition correctly extends theMODEL_TYPESenum. Based on learnings,NemoConformerForTDToverridesfrom_pretrainedand explicitly callsconstructSessions, so no correspondingMODEL_TYPE_CONFIGentry is required.packages/transformers/src/models/feature_extractors.js (1)
8-8: LGTM: Feature extractor export.Standard barrel re-export following existing alphabetical ordering conventions.
packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js (1)
1-19: LGTM: Processor implementation.The processor correctly delegates audio preprocessing to the feature extractor. The
tokenizer_classdeclaration is standard even if unused in_call—it ensures the tokenizer is loaded and available for downstream model operations.packages/transformers/src/models/processors.js (1)
11-11: LGTM: Processor export.Standard barrel re-export following alphabetical ordering.
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (2)
9-25: LGTM: Cache key generation.FNV-1a hash implementation is correct. The full sample iteration is intentional per documented design to minimize cache-key collisions.
167-177: Hardcoded field names are tightly coupled to feature extractor output.The
collectCachedTensorsfunction hardcodes field names (input_features,attention_mask,delta_features,delta_delta_features) that must match the feature extractor's return type. This coupling is acceptable given the cache is purpose-built for Nemo Conformer TDT, but any future changes to the feature extractor output schema will require parallel updates here.packages/transformers/src/models/models.js (1)
106-106: LGTM: Model export.Standard barrel re-export following alphabetical ordering.
packages/transformers/src/utils/model_registry/get_model_files.js (1)
180-182: LGTM: NemoConformerTDT file resolution.The file list correctly specifies
encoder_modelanddecoder_model_merged, matching the session construction inNemoConformerForTDT.from_pretrained. The absence ofgeneration_config.jsonis appropriate for transducer-based models.packages/transformers/src/pipelines/index.js (2)
33-33: LGTM: Import for pipeline integration.Direct import of
NemoConformerForTDTfrom the modeling module is correct for explicit model class registration.
154-154: LGTM: Pipeline model registration.Adding
NemoConformerForTDTto theautomatic-speech-recognitionmodel array enables the pipeline to accept and dispatch Nemo Conformer TDT models. The_callmethod inAutomaticSpeechRecognitionPipelinealready handles thenemo-conformer-tdtmodel_type case via_call_nemo_conformer_tdt.packages/transformers/src/pipelines/automatic-speech-recognition.js (2)
6-8: LGTM - Clean pipeline integration.The Nemo Conformer TDT routing and delegation to
runNemoConformerTDTPipelineis correctly implemented. TheprepareAudiosfunction is passed as a reference without binding, matching the expected call signature in the pipeline implementation.Also applies to: 158-159, 308-324
341-341: Minor formatting adjustment - no functional change.packages/transformers/src/models/registry.js (1)
44-44: LGTM - Correct registry wiring for dual-artifact model.The encoder-only mapping enables AutoModel lookup, while the CUSTOM_MAPPING override ensures
MODEL_TYPES.NemoConformerTDT(value 16) is used at runtime for proper two-artifact file resolution (encoder_model+decoder_model_merged).Also applies to: 584-586
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js (2)
9-13: LGTM - Correct tokenizer interface usage.The guard for empty arrays prevents the tokenizer's "must be a non-empty array" error. The
{ skip_special_tokens: true }option matches the expected signature perPreTrainedTokenizer.decode().
28-31: LGTM - Clean delegation to word offset builder.packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (4)
129-180: Unit test mocks are adequate for pipeline behavior testing.The simplified mock processor returning
{ input_features: {} }is acceptable sincemodel.transcribe()is also mocked and doesn't consume the inputs. These tests verify pipeline orchestration, not model inference.
229-315: LGTM - Window merging test with correct offset calculation.The test correctly expects
timeOffset: 1for the second window: withchunk_length_s=2andstride_length_s=0.5,jump = 32000 - 16000 = 16000samples = 1 second.
396-473: LGTM - Auto-window test with correct offset progression.For 300s audio with
NEMO_AUTO_CHUNK_LENGTH_S=90andNEMO_AUTO_STRIDE_LENGTH_S=10:jump = 1440000 - 320000 = 1120000samples = 70s. Four windows at offsets[0, 70, 140, 210]is correct.
482-604: LGTM - Tensor lifecycle tests verify disposal semantics.Tests correctly verify disposal behavior across three scenarios: cache disabled (disposes), cache enabled with ownership (no dispose), and cache limits disabling caching (disposes).
packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js (2)
9-70: LGTM - Correct delta coefficient computation.The implementation follows the standard regression-based delta formula with proper boundary clamping. Denominator
2 * sum(n^2)matches the normalization convention. Disposal of intermediate tensors is handled correctly.
72-91: LGTM - Frame-by-frame interleaving for concatenated output.The interleaving produces
[batch, T, F*items.length]shape as expected, with proper validation of input array lengths.packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js (3)
100-131: Test correctly verifies internal disposal behavior.The monkey-patching of
Tensor.prototype.disposeis appropriate for verifying that the feature extractor properly disposes intermediate tensors when producing concatenated delta outputs. Thefinallyblock ensures proper restoration.
133-188: LGTM - Cache behavior tests verify tensor identity and lifecycle.Tests correctly verify that cached results return the same tensor references (
===) and that cache statistics are properly maintained.
190-225: LGTM - Validation tests cover construction and runtime errors.Preemphasis validation at runtime and delta_window/n_fft/win_length validation at construction time are properly tested.
packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js (1)
1-8: LGTM - Clean barrel consolidating Nemo Conformer TDT exports.Re-exports correctly aggregate all internal utilities for backwards-compatible imports.
packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js (3)
11-57: LGTM - Correct window spec generation.The validation chain (
chunk_length_s > 2 * stride_length_s) and window iteration withjump = window - 2 * stridecorrectly produces overlapping windows for Nemo TDT decoding.
59-113: LGTM - Overlap filtering and deduplication logic.
shouldKeepTimedItemuses midpoint-based filtering for stride regions. Deduplication prefers longer-duration items on collision, which is a reasonable heuristic for window boundary artifacts.
120-179: LGTM - Window result merging correctly aggregates and deduplicates.The merge function properly filters by stride boundaries, deduplicates, reconstructs text from words (preferred) or tokens (fallback), and extracts utterance timestamps.
| if (!Number.isInteger(this.config.n_fft) || this.config.n_fft <= 0) { | ||
| throw new Error( | ||
| `NemoConformerTDTFeatureExtractor expected \`n_fft\` as a positive integer, got ${this.config.n_fft}.`, | ||
| ); | ||
| } | ||
| if ( | ||
| !Number.isInteger(this.config.win_length) || | ||
| this.config.win_length <= 0 || | ||
| this.config.win_length > this.config.n_fft | ||
| ) { | ||
| throw new Error( | ||
| `NemoConformerTDTFeatureExtractor expected \`win_length\` in [1, n_fft], got win_length=${this.config.win_length}, n_fft=${this.config.n_fft}.`, | ||
| ); | ||
| } | ||
|
|
||
| // Prefer given `mel_filters` from preprocessor_config.json, or calculate them if they don't exist. | ||
| this.config.mel_filters ??= mel_filter_bank( | ||
| Math.floor(1 + this.config.n_fft / 2), // num_frequency_bins | ||
| this.config.feature_size, // num_mel_filters | ||
| 0.0, // min_frequency | ||
| this.config.sampling_rate / 2, // max_frequency | ||
| this.config.sampling_rate, // sampling_rate | ||
| 'slaney', // norm | ||
| 'slaney', // mel_scale | ||
| ); |
There was a problem hiding this comment.
Fail fast on the remaining spectrogram parameters.
Lines 36-43 and 163 depend on feature_size, sampling_rate, and hop_length, but only n_fft and win_length are validated. A malformed preprocessor_config.json can currently produce NaN mel filters or invalid frame stepping instead of a clear configuration error.
Proposed fix
if (
!Number.isInteger(this.config.win_length) ||
this.config.win_length <= 0 ||
this.config.win_length > this.config.n_fft
) {
throw new Error(
`NemoConformerTDTFeatureExtractor expected \`win_length\` in [1, n_fft], got win_length=${this.config.win_length}, n_fft=${this.config.n_fft}.`,
);
}
+ if (!Number.isInteger(this.config.hop_length) || this.config.hop_length <= 0) {
+ throw new Error(
+ `NemoConformerTDTFeatureExtractor expected \`hop_length\` as a positive integer, got ${this.config.hop_length}.`,
+ );
+ }
+ if (!Number.isFinite(this.config.sampling_rate) || this.config.sampling_rate <= 0) {
+ throw new Error(
+ `NemoConformerTDTFeatureExtractor expected \`sampling_rate\` as a positive number, got ${this.config.sampling_rate}.`,
+ );
+ }
+ if (!Number.isInteger(this.config.feature_size) || this.config.feature_size <= 0) {
+ throw new Error(
+ `NemoConformerTDTFeatureExtractor expected \`feature_size\` as a positive integer, got ${this.config.feature_size}.`,
+ );
+ }Also applies to: 162-163
| const output = await runNemoTranscribe(window.audio, { | ||
| tokenizer, | ||
| return_timestamps: true, | ||
| return_words: true, | ||
| return_tokens: true, | ||
| return_metrics: false, | ||
| timeOffset: window.start_s, | ||
| }); | ||
| windowResults.push({ window, output }); | ||
| } | ||
|
|
||
| const merged = mergeNemoWindowResults(tokenizer, windowResults); | ||
| const result = { text: merged.text || windowResults.map((x) => x.output.text ?? '').join(' ').trim() }; | ||
| if (wantWordTimestamps) { | ||
| result.chunks = buildWordChunks(merged.words); | ||
| } else if (wantTimestampChunks) { | ||
| result.chunks = buildNemoSegmentChunks(merged.words, merged.utterance_timestamp, result.text); | ||
| } |
There was a problem hiding this comment.
Do not strip the native Nemo payload out of pipeline responses.
Lines 133 and 154 force return_metrics: false, and Lines 140-145 / 157-162 rebuild the response as { text, chunks? }. That drops is_final, utterance_confidence, utterance_timestamp, confidence_scores, and the metrics payload from the public pipeline result. Preserve the model.transcribe() object in the unchunked path, and aggregate native fields instead of discarding them in the chunked path.
Based on learnings, _call_nemo_conformer_tdt intentionally hardcodes return_metrics: true as part of the Nemo Conformer TDT pipeline contract.
Also applies to: 150-163
| _evict() { | ||
| const max_bytes = this.max_size_mb * 1024 * 1024; | ||
| while (this.cache.size > this.max_entries || this.current_size_bytes > max_bytes) { | ||
| const oldest_key = this.cache.keys().next().value; | ||
| if (oldest_key === undefined) break; | ||
| const oldest = this.cache.get(oldest_key); | ||
| this.cache.delete(oldest_key); | ||
| disposeCachedValue(oldest?.value); | ||
| this.current_size_bytes -= oldest?.size_bytes ?? 0; | ||
| } | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Defensive optional chaining is unnecessary but harmless.
Line 133-134 uses oldest?.value and oldest?.size_bytes defensively. Given the while condition guarantees cache.size > 0 when entering the loop body, and oldest_key is checked against undefined at line 130, oldest retrieved at line 131 will always exist. The optional chaining is superfluous but does not cause incorrect behavior.
| if (tokenizer?.get_vocab) { | ||
| const vocab = tokenizer.get_vocab(); | ||
| const entries = vocab instanceof Map ? vocab.entries() : Object.entries(vocab); | ||
| for (const [token, id] of entries) { | ||
| if (Number.isInteger(id)) { | ||
| cached.set(id, token); | ||
| } | ||
| } | ||
| } | ||
| TOKEN_ID_TO_TEXT_CACHE.set(tokenizer, cached); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the file in question and related tokenizer code
find . -name "transducer_word_offsets.js" -o -name "tokenization_utils.js" -o -name "transducer_window_merge.js" | head -20Repository: ysdede/transformers.js
Length of output: 277
🏁 Script executed:
# Get the full implementation of transducer_word_offsets.js
wc -l packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jsRepository: ysdede/transformers.js
Length of output: 148
🏁 Script executed:
# Read the full file to understand vocab cache logic
cat -n packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jsRepository: ysdede/transformers.js
Length of output: 8896
🏁 Script executed:
# Check tokenization_utils.js for get_vocab() implementation
wc -l packages/transformers/src/tokenization_utils.jsRepository: ysdede/transformers.js
Length of output: 117
🏁 Script executed:
# Find the get_vocab method in tokenization_utils.js
sed -n '1,300p' packages/transformers/src/tokenization_utils.js | cat -n | grep -A 20 "get_vocab"Repository: ysdede/transformers.js
Length of output: 914
🏁 Script executed:
# Check how vocab data is actually structured in practice
grep -r "get_vocab" packages/transformers/src/ -A 5 -B 2 | head -60Repository: ysdede/transformers.js
Length of output: 3626
🏁 Script executed:
# Search for tokenizer implementations and their get_vocab returns
find . -name "*.js" -type f | xargs grep -l "get_vocab" | head -15Repository: ysdede/transformers.js
Length of output: 438
🏁 Script executed:
# Look for test files related to nemo_conformer_tdt
find . -name "*test*" -o -name "*spec*" | grep -i nemoRepository: ysdede/transformers.js
Length of output: 259
🏁 Script executed:
# Check how tokenizers from huggingface/tokenizers library work
grep -r "get_vocab" packages/transformers/src/ -B 5 -A 10 | grep -A 8 "vocab\s*=" | head -40Repository: ysdede/transformers.js
Length of output: 3133
🏁 Script executed:
# Look at specific tokenizer implementation classes
find . -path "*/models/*/tokenization*.js" -type f | head -10Repository: ysdede/transformers.js
Length of output: 739
🏁 Script executed:
# Check if there are any test files that mock get_vocab
grep -r "get_vocab\|TOKEN_ID_TO_TEXT_CACHE" packages/ --include="*.js" | grep -i testRepository: ysdede/transformers.js
Length of output: 1284
🏁 Script executed:
# Look at the test file in detail to see the tokenizer mocks
wc -l packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jsRepository: ysdede/transformers.js
Length of output: 159
🏁 Script executed:
# Extract the mock tokenizer definitions from test file
sed -n '1,200p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | cat -nRepository: ysdede/transformers.js
Length of output: 7861
🏁 Script executed:
# Find all the get_vocab mocks in the test file
grep -n "get_vocab" packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js -B 3 -A 3Repository: ysdede/transformers.js
Length of output: 1762
🏁 Script executed:
# Check what huggingface/tokenizers actually returns from get_vocab
# Search for actual tokenizer library usage
grep -r "from.*tokenizers\|require.*tokenizers" packages/transformers/src/ -A 5 | head -30Repository: ysdede/transformers.js
Length of output: 2301
🏁 Script executed:
# Look for actual tokenizer instantiation to understand real vocab structure
grep -r "_tokenizer" packages/transformers/src/tokenization_utils.js -B 2 -A 5 | head -50Repository: ysdede/transformers.js
Length of output: 2146
🏁 Script executed:
# Find the test case with rawById to see the full context
sed -n '650,700p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | cat -nRepository: ysdede/transformers.js
Length of output: 1534
🏁 Script executed:
# Look for more context around that rawById test
sed -n '630,750p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | cat -nRepository: ysdede/transformers.js
Length of output: 4074
🏁 Script executed:
# Search the modeling_nemo_conformer_tdt.js file for how get_vocab is used there
grep -n "get_vocab" packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js -B 3 -A 15Repository: ysdede/transformers.js
Length of output: 793
🏁 Script executed:
# Check what buildTransducerDetailedOutputs does - does it use buildTransducerWordOffsets?
grep -n "buildTransducerDetailedOutputs\|buildTransducerWordOffsets" packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js | head -20Repository: ysdede/transformers.js
Length of output: 371
🏁 Script executed:
# Look at transducer_text.js to see if it uses transducer_word_offsets
wc -l packages/transformers/src/models/nemo_conformer_tdt/transducer_text.jsRepository: ysdede/transformers.js
Length of output: 139
🏁 Script executed:
# Get the implementation of buildTransducerDetailedOutputs
sed -n '1,150p' packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js | cat -nRepository: ysdede/transformers.js
Length of output: 1626
🏁 Script executed:
# Look at the actual imports in transducer_text.js
head -30 packages/transformers/src/models/nemo_conformer_tdt/transducer_text.jsRepository: ysdede/transformers.js
Length of output: 1407
🏁 Script executed:
# Now let's verify the exact behavior - check if the test passes with id-keyed vocab
# and whether the cache is actually empty or populated
grep -A 50 "keeps word boundaries from the final decoded text" packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | head -60Repository: ysdede/transformers.js
Length of output: 1263
🏁 Script executed:
# Let's trace through what happens when Object.entries is called on rawById
python3 << 'EOF'
# Simulate what happens with id-keyed vocab
rawById = {
"1": "▁score",
"2": ".",
"3": "48",
}
entries = list(rawById.items())
print("Object.entries equivalent:", entries)
# Simulate the loop in transducer_word_offsets.js
cached = {}
for token, id_val in entries: # destructure as [token, id]
print(f"token={token}, id={id_val}, isinstance(id, int)={isinstance(id_val, int)}")
if isinstance(id_val, int):
cached[id_val] = token
print("Cached result:", cached)
print("Cache is empty:", len(cached) == 0)
EOFRepository: ysdede/transformers.js
Length of output: 303
🏁 Script executed:
# Verify by looking at what the test actually checks for raw_token in the output
sed -n '70,85p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jsRepository: ysdede/transformers.js
Length of output: 551
🏁 Script executed:
# Check if test validates raw_token field
grep -n "raw_token" packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jsRepository: ysdede/transformers.js
Length of output: 48
🏁 Script executed:
# Look at the actual raw_token usage in resolveTokenPiece more carefully
sed -n '35,52p' packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jsRepository: ysdede/transformers.js
Length of output: 710
🏁 Script executed:
# Check what happens in resolveTokenPiece when cache is empty - trace the fallback
sed -n '29,52p' packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js | cat -nRepository: ysdede/transformers.js
Length of output: 1085
🏁 Script executed:
# Verify that the test doesn't actually check raw_token values
sed -n '650,720p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | grep -A 30 "expect(output.tokens"Repository: ysdede/transformers.js
Length of output: 484
Handle both token→id and id→token vocab shapes.
Line 19 assumes each entry is [token, id], leaving TOKEN_ID_TO_TEXT_CACHE empty when get_vocab() returns id-keyed objects like { 1: "▁score", 2: "." }. This breaks boundary marker detection (▁ / Ġ) in resolveTokenPiece(), degrading word segmentation in production. Normalize vocab orientation by detecting key type, not position.
Proposed fix
cached = new Map();
if (tokenizer?.get_vocab) {
const vocab = tokenizer.get_vocab();
const entries = vocab instanceof Map ? vocab.entries() : Object.entries(vocab);
- for (const [token, id] of entries) {
- if (Number.isInteger(id)) {
- cached.set(id, token);
- }
+ for (const [a, b] of entries) {
+ if (typeof a === 'string' && Number.isInteger(b)) {
+ cached.set(b, a);
+ continue;
+ }
+ const id = typeof a === 'number' ? a : Number(a);
+ if (Number.isInteger(id) && typeof b === 'string') {
+ cached.set(id, b);
+ }
}
}There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 816f581180
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const window = Math.floor(sampling_rate * chunk_length_s); | ||
| const stride = Math.floor(sampling_rate * stride_length_s); | ||
| const jump = window - 2 * stride; |
There was a problem hiding this comment.
Reject chunk sizes that quantize to zero samples
buildNemoWindowSpecs floors chunk_length_s/stride_length_s to integer samples but never validates the quantized values; when chunk_length_s is very small (e.g. < 1 / sampling_rate), window and jump can become 0, so the while (true) loop never advances offset and the ASR pipeline hangs indefinitely instead of failing fast.
Useful? React with 👍 / 👎.
| prev && | ||
| prev.text === word.text && | ||
| word.start_time < prev.end_time | ||
| ) { |
There was a problem hiding this comment.
Avoid collapsing repeated words during overlap de-dup
dedupeMergedWords drops any adjacent entries that share the same text and overlap in time, which can delete legitimate repetitions (e.g. "no no") near chunk boundaries when timestamp jitter causes slight overlap; this silently corrupts merged transcripts for windowed decoding, not just duplicate overlap artifacts.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js (1)
16-24:⚠️ Potential issue | 🟠 MajorHandle both token→id and id→token vocab shapes.
This was flagged in a prior review iteration. The current implementation at line 19 assumes
[token, id]ordering. Whenget_vocab()returns an id-keyed object (e.g.,{ 1: "▁score", 2: "." }),Object.entriesyields[["1", "▁score"], ...]where the key is a stringified id. The destructure assignstoken="1"andid="▁score", causingNumber.isInteger(id)to fail and leaving the cache empty.This breaks boundary marker detection (
▁/Ġ) inresolveTokenPiece(), degrading word segmentation.Proposed fix
cached = new Map(); if (tokenizer?.get_vocab) { const vocab = tokenizer.get_vocab(); - const entries = vocab instanceof Map ? vocab.entries() : Object.entries(vocab); - for (const [token, id] of entries) { - if (Number.isInteger(id)) { - cached.set(id, token); - } + if (vocab instanceof Map) { + for (const [token, id] of vocab.entries()) { + if (Number.isInteger(id)) { + cached.set(id, token); + } + } + } else if (Array.isArray(vocab)) { + vocab.forEach((token, id) => cached.set(id, token)); + } else if (vocab && typeof vocab === 'object') { + for (const [key, val] of Object.entries(vocab)) { + if (typeof key === 'string' && Number.isInteger(val)) { + cached.set(val, key); + } else { + const numKey = Number(key); + if (Number.isInteger(numKey) && typeof val === 'string') { + cached.set(numKey, val); + } + } + } } }
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cae8f5ec-d592-49c3-b335-8285e4155a59
📒 Files selected for processing (22)
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/models.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Applied to files:
packages/transformers/src/models/processors.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/models.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.
Applied to files:
packages/transformers/src/models/processors.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/models.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Applied to files:
packages/transformers/src/models/processors.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/models.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Applied to files:
packages/transformers/src/models/processors.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/models.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Applied to files:
packages/transformers/src/models/processors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/models.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Applied to files:
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/models.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.
Applied to files:
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
📚 Learning: 2026-03-03T22:59:55.984Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T22:59:55.984Z
Learning: When a model subclass overrides from_pretrained and bypasses the generic model-type branch, do not introduce or rely on a MODEL_TYPES entry for that subclass in the model registry. For this NemoConformerTDT case, in packages/transformers/src/utils/model_registry/get_model_files.js, avoid adding a NemoConformerTDT entry in the model type map; rely on the override logic in modeling_nemo_conformer_tdt.js. This keeps the registry explicit to the actual file and prevents unintended dispatch through the generic branch.
Applied to files:
packages/transformers/src/utils/model_registry/get_model_files.js
🔇 Additional comments (44)
packages/transformers/src/models/processors.js (1)
11-11: LGTM.Export follows the established alphabetical ordering convention and correctly surfaces the Nemo Conformer TDT processor module.
packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js (2)
1-70: LGTM.The delta computation implementation is correct:
- Input validation is thorough (Tensor type, dims, window, order, dtype checks)
- The finite difference formula with boundary clamping (lines 36-37) is standard
- Memory management properly disposes intermediate tensors when concatenating
- Recursive delta-delta computation via order=2 is clean
72-91: LGTM.The
interleaveByFramehelper correctly validates input lengths and performs frame-wise interleaving with proper offset calculations.packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js (2)
130-149: LGTM.Input validation is thorough: checks for empty inputs, length mismatches between
token_ids/token_timestamps/token_confidences, and gracefully returns empty results for invalid tokenizer or empty arrays. Error messages include actual lengths for debugging.
160-202: LGTM.The main loop correctly:
- Resolves token pieces with boundary detection
- Aligns decoded text against
fullTextfor accurate character positions- Aggregates tokens into words based on boundary markers and whitespace
- Handles optional confidence values with proper null checks and finite validation
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js (2)
9-13: LGTM.Defensive handling: returns empty string for invalid/empty token_ids, falls back to space-joined tokens if tokenizer is unavailable.
28-31: LGTM.Clean composition that decodes full text first then delegates to
buildTransducerWordOffsetsfor structured output generation.packages/transformers/src/utils/model_registry/get_model_files.js (1)
180-182: LGTM.The NemoConformerTDT branch correctly loads
encoder_modelanddecoder_model_mergedfiles, matching the session keys expected byNemoConformerForTDT.from_pretrained(). Nogeneration_config.jsonis needed since TDT uses greedy transducer decoding rather than autoregressive generation.packages/transformers/src/models/models.js (1)
106-106: LGTM.Export follows alphabetical ordering and correctly surfaces the Nemo Conformer TDT modeling module.
packages/transformers/src/models/feature_extractors.js (1)
8-8: LGTM.Export follows alphabetical ordering and correctly surfaces the Nemo Conformer TDT feature extractor module.
packages/transformers/src/models/modeling_utils.js (2)
121-121: LGTM.
NemoConformerTDT: 16is added toMODEL_TYPES. No corresponding entry inMODEL_TYPE_CONFIGis required sinceNemoConformerForTDToverridesfrom_pretrainedand bypasses generic model-type branching logic.
880-882: LGTM.Error message formatting simplified without functional change.
packages/transformers/src/models/registry.js (1)
44-44: Registry wiring is correct.Dual registration in
MODEL_MAPPING_NAMES_ENCODER_ONLY(for AutoModel lookup) andCUSTOM_MAPPING(for runtime model type override) matches the documented intent. The approach correctly forcesMODEL_TYPES.NemoConformerTDTfor the two-artifact loading path while preserving standard lookup.Also applies to: 584-586
packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js (1)
8-18: Processor design is appropriate for ASR workflow.The
_callmethod correctly delegates tofeature_extractoronly. Whiletokenizer_classis declared, ASR tokenization occurs inmodel.transcribe(), not in the processor. The pipeline runner passes the tokenizer separately. This aligns with the ASR contract.packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js (1)
1-227: Test coverage is comprehensive.Tests validate mel bin configurations, delta feature modes, disposal semantics, cache behavior, and constructor validation. Proper cleanup with
dispose()andclear_cache()prevents resource leaks across tests.packages/transformers/src/pipelines/automatic-speech-recognition.js (2)
158-159: Nemo Conformer TDT dispatch is correctly implemented.The switch case routes to
_call_nemo_conformer_tdt, which delegates torunNemoConformerTDTPipelinewith the correct options object structure. Per established API contract, advanced decode controls are intentionally exposed only through directmodel.transcribe()calls.Also applies to: 315-324
341-341: Trivial formatting change.Type cast spacing adjustment from
/**@type{Tensor} */ (outputs)to/**@type{Tensor} */(outputs). Non-functional.packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js (3)
22-28: O(n) audio validation is justified for correctness.While iterating all samples is expensive for long audio, catching non-finite values (NaN, Infinity) before inference prevents silent corruption or cryptic downstream errors. This is acceptable as a safety guardrail.
133-145: Pipeline returns task-shaped output per API contract.Lines 133 and 154 set
return_metrics: false, and results are rebuilt as{ text, chunks? }. Per the established learning, this is intentional: the pipeline surface aligns with the shared ASR task contract, and rich native outputs are available through directmodel.transcribe(). This is not a defect.Also applies to: 150-163
90-105: Tensor disposal logic is correct.The
Set-based deduplication prevents double-dispose when the same Tensor instance appears in multiple input fields. The finally block ensures cleanup even on transcription failure.packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (3)
129-180: Mock construction is well-designed.The
makeUnitPipehelper correctly mocks the model, processor, and tokenizer interfaces required by the pipeline. Thecallsarray enables assertion of options passed totranscribe(). This pattern cleanly isolates unit tests from real model loading.
229-315: Window merge test validates deduplication and offset handling.The test confirms overlapping windows produce correct merged output with unique words and proper timestamps. The
timeOffsetassertions (lines 306-313) verify windowing parameters are passed correctly.
482-604: Disposal semantics tests are thorough.Three scenarios are covered: no cache (dispose), cache owns tensors (no dispose), and zero-limit cache (dispose). The monkey-patching approach with
trackDisposecorrectly verifies disposal behavior without modifying production code.packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js (4)
10-44: MockNemoConformerForTDT correctly scripts decoder behavior.The mock class overrides
_runEncoderand_runDecoderto return controlled tensors, enabling deterministic testing of the greedy decode loop. ThedecoderScriptarray provides step-by-step logits for validation.
648-712: Word boundary reconstruction test validates punctuation handling.The test confirms
buildTransducerDetailedOutputscorrectly groups tokens into words based on sentencepiece markers (▁) and preserves punctuation/numeric boundaries. Critical for ASR accuracy.
826-910: LRU cache tests validate eviction, disposal, and shared reference handling.Tests confirm LRU eviction order, single disposal for replaced entries, and correct handling of shared tensor references across fields. The
finallyblocks ensure cleanup even on assertion failure.
791-810: Remove unusedlocal_files_onlyparameter from test call.The
get_model_filesfunction does not acceptlocal_files_onlyin its parameter signature and does not forward it toAutoConfig.from_pretrained. The parameter is silently ignored. The test works correctly due to the pre-loadedconfigobject, which short-circuits any file loading, not because oflocal_files_only. Remove this unused parameter to avoid misleading future readers about the function's capabilities.> Likely an incorrect or invalid review comment.packages/transformers/src/pipelines/index.js (1)
33-33: Integration is correct.
NemoConformerForTDTcorrectly exports a staticsupports()method at line 246 that returnsmodel_type === 'nemo-conformer-tdt'. The import at line 33 and task registration at line 154 follow the established pattern and are properly integrated.packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js (3)
20-44: Fail fast onhop_length,sampling_rate, andfeature_sizevalidation.Constructor validates
n_fftandwin_lengthbut omits validation forhop_length,sampling_rate, andfeature_size, all of which are used in_extract(line 163) and_extract_fbank_features(lines 109, 40-41). A malformedpreprocessor_config.jsonwith invalid values (zero, negative, or non-numeric) will produce incorrect frame counts, NaN mel filters, or division errors instead of a clear configuration error.
140-156: Cache key and ownership semantics are correctly implemented.The cache key at line 144 incorporates audio hash, sampling rate, and delta configuration, ensuring distinct entries for different extraction parameters. The spread return (
{ ...cached }) at lines 147/152 correctly returns a shallow copy while the cache retains ownership of the tensor instances. This aligns with the documented contract in the JSDoc (lines 137-138).
158-231: Extraction pipeline and tensor lifecycle are correctly managed.The normalization logic (lines 175-199) correctly handles the divide-by-zero case for empty/short audio by checking
features_length > 0. The delta computation path (lines 212-228) properly disposes the originalinput_featurestensor when replaced by the concatenated delta result. The attention mask construction (lines 201-205) correctly usesBigInt64Arrayfor int64 tensor dtype.packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js (1)
1-8: Barrel file correctly consolidates Nemo Conformer TDT utilities.Re-exports are properly structured for backwards compatibility. No circular dependency concerns since
transducer_segment_offsets.jshas no imports from this barrel.packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js (3)
11-57: Window specification logic is correctly guarded against infinite loops.The validation at lines 29-31 ensures
chunk_length_s > 2 * stride_length_s, which guaranteesjump = window - 2 * stride > 0. Combined with theis_lastbreak condition at line 52, the while loop is guaranteed to terminate. The defensive NaN checks at lines 12 and 26 are appropriate.
70-113: Deduplication logic correctly handles window boundary overlaps.Both
dedupeMergedWordsanddedupeMergedTokensuse overlap detection (word.start_time < prev.end_time/token.start_time < prev.end_time) combined with identity checks (text/id+raw_token) to detect duplicates at window boundaries. Keeping the longer-duration item is a reasonable heuristic for windowed ASR merging.
120-178: Merged result construction handles edge cases appropriately.The text reconstruction at lines 165-170 correctly falls back from words to tokens to empty string. The utterance_timestamp derivation (lines 171-176) gracefully handles empty results. The visibility filtering via
shouldKeepTimedItemcorrectly uses midpoint-based inclusion to avoid double-counting items at stride boundaries.packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (2)
9-25: Cache key generation correctly incorporates all audio samples.The FNV-1a hash iterates over all samples with 16-bit quantization, which is a deliberate design choice to minimize false cache hits. The key format
${sampling_rate}:${audio.length}:${hash}provides good collision resistance.
65-107: Cache set() correctly handles edge cases and ownership transfer.The zero-cache mode check (lines 67-72), oversized item handling (lines 88-96), and existing entry replacement (lines 98-102) are all correctly implemented. The size tracking and eviction trigger at line 106 maintain cache invariants.
packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js (2)
8-22: Word joining correctly handles punctuation spacing.The regex at line 15 correctly identifies punctuation-only tokens that should attach without leading space. The accumulation logic handles the empty-text edge case at line 12.
49-87: Segment chunking logic handles break conditions and edge cases correctly.The break conditions at lines 65-67 correctly combine sentence-ending punctuation detection and inter-word gap threshold. The gap calculation at line 64 uses
Math.max(0, ...)to handle overlapping words. Empty input handling at lines 50-55 provides reasonable fallback behavior.packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js (5)
85-235: Transducer configuration validation is thorough and defensive.The
resolveTransducerConfigfunction validates:
- Decoder layer/hidden size (lines 97-104)
- I/O name uniqueness (lines 110-129)
- Session input/output availability (lines 131-171)
- Runtime parameters with appropriate type checks and bounds (lines 173-215)
The error messages are actionable and reference the config path for remediation.
254-317: Session loading with proper error wrapping and model type validation.The
from_pretrainedimplementation correctly validates model type before session construction, rejects unsupportedmodel_file_nameoverrides, and wrapsconstructSessionserrors with helpful context about expected file structure.
640-823: Decoder loop tensor lifecycle is correctly managed with no infinite loop risk.The try-finally block (lines 640-823) ensures
targetLengthTensor,decoderState, andencoderOutputsare disposed even on error paths. The frame advancement logic guarantees termination:
step > 0: advances by stepstep == 0with non-blank: incrementsemittedOnFrameuntilmaxSymbolsPerStep- Blank or max symbols reached: advances by 1
The decoder output disposal at lines 684-692 correctly handles aliased tensor references via
Set.
684-692: Decoder output tensor disposal handles aliased outputs correctly.Using a
Setto track seen tensors before disposal prevents double-free when decoder session returns aliased output references. The exclusion oflogits,outputState1, andoutputState2ensures they remain available for subsequent processing.
924-931: Registry mappings correctly register both model type and architecture keys.Registering both
'nemo-conformer-tdt'(model_type) and'NemoConformerForTDT'(architecture) ensures proper resolution viaMODEL_TYPE_MAPPING. The bidirectional class-name mappings enable dynamic instantiation.
| const featureCache = /** @type {{ max_entries: number, max_size_mb: number }|null|undefined} */ ( | ||
| /** @type {any} */ (processor.feature_extractor)?.feature_cache | ||
| ); | ||
| const cacheOwnsTensors = !!( | ||
| featureCache && | ||
| featureCache.max_entries > 0 && | ||
| featureCache.max_size_mb > 0 | ||
| ); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Cache ownership detection may produce false negatives.
The check featureCache.max_entries > 0 && featureCache.max_size_mb > 0 assumes both limits must be positive for caching to be active. If the feature extractor uses a different cache structure or only one limit, this could incorrectly trigger disposal of cache-owned tensors.
Consider defensive access:
Proposed fix
const featureCache = /** `@type` {{ max_entries: number, max_size_mb: number }|null|undefined} */ (
/** `@type` {any} */ (processor.feature_extractor)?.feature_cache
);
- const cacheOwnsTensors = !!(
- featureCache &&
- featureCache.max_entries > 0 &&
- featureCache.max_size_mb > 0
- );
+ const cacheOwnsTensors = !!(
+ featureCache &&
+ (featureCache.max_entries ?? 0) > 0 &&
+ (featureCache.max_size_mb ?? 0) > 0
+ );| function estimateSizeBytes(value) { | ||
| if (value instanceof Tensor) { | ||
| return tensorByteSize(value); | ||
| } | ||
| if (value?.input_features instanceof Tensor) { | ||
| let bytes = tensorByteSize(value.input_features); | ||
| if (value.attention_mask instanceof Tensor) { | ||
| bytes += tensorByteSize(value.attention_mask); | ||
| } | ||
| if (value.delta_features instanceof Tensor) { | ||
| bytes += tensorByteSize(value.delta_features); | ||
| } | ||
| if (value.delta_delta_features instanceof Tensor) { | ||
| bytes += tensorByteSize(value.delta_delta_features); | ||
| } | ||
| return bytes; | ||
| } | ||
| const byteLength = value?.byteLength; | ||
| if (typeof byteLength === 'number' && Number.isFinite(byteLength) && byteLength >= 0) { | ||
| return byteLength; | ||
| } | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Document that estimateSizeBytes returns 0 for unknown types.
The function returns 0 for values that are neither Tensors nor objects with recognized tensor properties nor have a byteLength property. While the current usage with feature extraction outputs is safe, this could cause unbounded cache growth if misused with arbitrary objects. Consider adding a JSDoc comment clarifying the expected input types.
Documentation suggestion
+/**
+ * Estimate byte size of a cached value.
+ * Returns 0 for unrecognized types - callers must ensure only Tensor or
+ * Tensor-containing feature objects are cached.
+ * `@param` {any} value
+ * `@returns` {number}
+ */
function estimateSizeBytes(value) {| it( | ||
| "disposes replaced base features when concatenated delta output is used", | ||
| async () => { | ||
| const extractor = new NemoConformerTDTFeatureExtractor({ | ||
| ...base, | ||
| feature_size: 80, | ||
| delta_order: 1, | ||
| delta_window: 2, | ||
| delta_concatenate: true, | ||
| }); | ||
|
|
||
| const originalDispose = Tensor.prototype.dispose; | ||
| let disposeCalls = 0; | ||
| Tensor.prototype.dispose = function () { | ||
| disposeCalls += 1; | ||
| return originalDispose.call(this); | ||
| }; | ||
|
|
||
| let input_features; | ||
| try { | ||
| ({ input_features } = await extractor(audio)); | ||
| expect(input_features.dims[2]).toBe(80 * 2); | ||
| } finally { | ||
| Tensor.prototype.dispose = originalDispose; | ||
| input_features?.dispose(); | ||
| } | ||
|
|
||
| // One dispose from computeTemporalDeltas intermediate tensor, one from replacing base features tensor. | ||
| expect(disposeCalls).toBe(2); | ||
| }, | ||
| MAX_TEST_EXECUTION_TIME, | ||
| ); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Disposal count assertion may be fragile.
The test asserts exactly 2 dispose calls. If the implementation changes (e.g., adds intermediate tensors), this will break. Consider asserting disposeCalls >= 2 or documenting the expected disposal sources explicitly in the test comment.
| it("maps NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT", () => { | ||
| expect(MODEL_TYPE_MAPPING.get("NemoConformerForTDT")).toBe(MODEL_TYPES.NemoConformerTDT); | ||
| expect(MODEL_TYPE_MAPPING.get("nemo-conformer-tdt")).toBe(MODEL_TYPES.NemoConformerTDT); | ||
| }); |
There was a problem hiding this comment.
Assertion on line 81 will fail; MODEL_TYPE_MAPPING is keyed by class name, not model_type.
MODEL_TYPE_MAPPING is populated with class names as keys (e.g., "NemoConformerForTDT"), not model_type strings (e.g., "nemo-conformer-tdt"). Line 81 will return undefined.
Proposed fix
it("maps NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT", () => {
expect(MODEL_TYPE_MAPPING.get("NemoConformerForTDT")).toBe(MODEL_TYPES.NemoConformerTDT);
- expect(MODEL_TYPE_MAPPING.get("nemo-conformer-tdt")).toBe(MODEL_TYPES.NemoConformerTDT);
});Alternatively, if testing model_type → class resolution is intended, use the appropriate mapping (MODEL_NAME_TO_CLASS_MAPPING or similar).
|
Superseded by #14. The replacement PR includes the latest sentence-based pipeline and long-audio fixes on top of the original NeMo Conformer TDT work. |
Summary
Supersedes closed PR #12 with the latest follow-up work on the same
main-based branch line.This PR adds end-to-end Conformer TDT ASR support to
transformers.js(browser + Node.js), including model execution, feature extraction, decoding/reconstruction, pipeline wiring, registry integration, and regression coverage.It also includes the recent pipeline-alignment work so Conformer TDT behaves more like the shared
automatic-speech-recognitiontask contract while preserving the richer directmodel.transcribe()API.What Is Included
1. Conformer TDT Model + Decoder
model.transcribe()output contract for text, timestamps, confidences, optional words/tokens, and optional metrics/debug payloads.2. Feature Extraction
3. ASR Pipeline Integration
AutomaticSpeechRecognitionPipelinedispatch.{ text }return_timestamps: true:{ text, chunks }with segment-level timestampsreturn_timestamps: 'word':{ text, chunks }with word-level timestampsmodel.transcribe().4. Long-Audio Handling
chunk_length_s/stride_length_s.timeOffset.5. Word Reconstruction / Detokenization Improvements
score.48-year-oldwith0.5March20th,2021.6. Registry + Model File Resolution
encoder_model,decoder_model_merged).7. Regression Coverage
Added or updated tests in:
packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jsCoverage now includes:
Upstream Sync Included
This branch was synced with
upstream/mainthrough commitf65a4c7c(merge commit49a4af8f).Included upstream commits:
2120d13e[deno] Support both wgpu and dawn webgpu backends (#1546)a289e5c3Add support for new Qwen VL models (#1551)b5b51ca9[version] Update to 4.0.0-next.52a210d3e[deno via CDN] Fix simultaneous multi-session loading (#1556)e60a6ee3Use ModelRegistry for pipeline file loading (#1555)4331d723Support PKV cached generation for Qwen-VL models (#1557)cd155a05fix: prevent partial file reads during concurrent downloads (#1548)30773fb7Fix WASM factory blob URL loading (#1558)f65a4c7cfeat: add fast boolean is_cached / is_pipeline_cached (#1559)Conformer TDT adaptation after upstream sync:
ee819a1cfix(nemo-tdt): add supports() for ASR model class selection8dfccddcfeat(nemo-tdt): align asr pipeline outputs and long-audio handlingValidation
Executed during this refresh:
pnpm --filter @huggingface/transformers buildnode --experimental-vm-modules .\node_modules\jest\bin\jest.js tests/models.test.js -t "keeps word boundaries from the final decoded text for numeric and punctuation tokens|Nemo Conformer TDT" --runInBandnode --experimental-vm-modules .\node_modules\jest\bin\jest.js tests/pipelines.test.js -t "nemo-conformer-tdt \(unit\)" --runInBandScope Boundary
This PR keeps focus on Conformer TDT integration and the follow-up work needed to:
NeMo is still relevant as the source architecture/provenance for this implementation, but the user-facing scope of the PR is Conformer TDT support in
transformers.js.Architectural Review: Conformer TDT ASR Integration
Risk Summary: MEDIUM-HIGH
This PR introduces substantial new functionality (Conformer TDT end-to-end ASR, feature extraction with caching, windowed decoding, word-level timestamp reconstruction) with several correctness and reliability issues that require fixes. The design is generally sound but has critical gaps in validation, error recovery, and edge case handling.
CRITICAL FINDINGS
[High] transducer_word_offsets.js - Silent text-token alignment failures without validation
consumeAlignedTokenText()skips whitespace at the cursor, then returns the consumed segment. However, there is no validation that the consumed text fromfullTextactually matches the expectedtokenText. If alignment fails (e.g., tokenizer produces a token that doesn't exist at that position in the decoded text), the function silently continues with misaligned boundaries.buildTransducerWordOffsets()callconsumeAlignedTokenText(fullText, textCursor, clean)but never validate thataligned.text === cleanor that alignment succeeded. If fullText diverges from token sequence, words get wrong timestamps.consumeAlignedTokenText(), after consuming text fromfullText, assert that consumed segment matches tokenText (accounting for whitespace normalization). Throw descriptive error with context (fullText slice, tokenText, cursor position) if mismatch detected. Add debug logging for alignment skips.[High] modeling_nemo_conformer_tdt.js - Frame index bounds check missing in decoder loop
_getFrameData(encoderOutput, frameIndex, reusableFrame)without bounds verification. The loop incrementsframeIndexdynamically based on duration steps, but if duration calculation produces invalid values (e.g., step > remaining frames), the code reads past encoder output bounds. The code does computeframeCountcorrectly but never assertsframeIndex < frameCountbefore accessing frame data.const frameData = this._getFrameData(encoderOutput, frameIndex, reusableFrame);has no prior bounds check. In_getFrameData(), for 'BDT' layout, it computesframe[d] = data[d * T + frameIndex]without clamping frameIndex. For 'BTD' layout, it usesdata.subarray(offset, offset + D)where offset = frameIndex * D, which can exceed array bounds if frameIndex >= T.frameIndex < frameCount, throwing error if violated. (2) In_getFrameData(), clamp frameIndex:frameIndex = Math.min(frameIndex, T - 1)at minimum, or throw error if frameIndex >= T. (3) Add runtime assertion that frameIndex is non-negative and integer.[High] transducer_window_merge.js - Deduplication assumes strict equality, creates duplicates
dedupeMergedTokens()deduplicates by checking if tokens have overlapping timestamps:startTime < otherEndTime && endTime > otherStartTime. However, if two tokens are identical (same ID and text) but have slightly different timestamps (e.g., [1.0-1.1] vs [1.05-1.15] due to floating-point rounding or frame boundary variance), they are NOT deduplicated because the function only looks for exact timestamp overlap, not timestamp proximity. This causes duplicate words in output.dedupeMergedTokens()check onlytoken.id === prev.id && token.raw_token === prev.raw_tokenplus time overlap; no tolerance for near-identical timestamps. If two windows decode the same word with slightly offset times (common in overlapping window scenarios), both instances survive.[Medium] feature_extraction_nemo_conformer_tdt.js - Per-frame normalization doesn't guard against zero variance
(frame - mean) / (variance + epsilon). However, there is no epsilon guard before division. If all frames are identical (variance = 0), or if input is very short (single frame), the denominator can be zero or near-zero, producing Infinity or NaN.if (n_frames > 0), there's no check forvariance > epsilonbefore the scaling division. For very short audio (< 10 frames), variance is likely to be small.const scaled = n_frames > 1 ? (frame - mean) / Math.max(variance, 1e-8) : frame. Log warning for very short inputs (< 5 frames).[Medium] transducer_window_merge.js - Window specifications insufficient bounds validation
buildNemoWindowSpecs()enforceschunk_length_s > 2 * stride_length_sbut does not validate that parameters are positive, reasonable, or that resulting windows actually cover the full audio. Ifchunk_length_s = 1.0and audio duration is 180s, only ~180 windows are created; if stride is too large, gaps are possible at the end.if (chunk_length_s <= 0)but does not validate final chunk_length_s is within safe bounds (e.g., 0.1-300s). No post-generation check thatwindows[-1].end >= audio.length.[Medium] modeling_nemo_conformer_tdt.js - Timestamp validation insufficient downstream
timeOffsetupfront:if (!Number.isFinite(timeOffset)). However, downstream timestamp arithmetic does not re-validate intermediate results. If encoder outputs contain NaN (due to upstream bugs), or if log-probabilities compute -Infinity, downstreamMath.max(),Math.min(), or arithmetic can silently propagate NaN.tokenTimestamps.push([roundTs(frameIndex * frameTime + timeOffset), roundTs(endFrame * frameTime + timeOffset)]). IfframeTimeorframeIndexare NaN, result is NaN. No assertion post-arithmetic.isFinite(result), throwing descriptive error if not. Add debug logging for NaN/Infinity detection with context.[Medium] transducer_deltas.js - Delta order validation incomplete
computeTemporalDeltas()acceptsorderparameter (1 or 2). The code throws if order is not 1 or 2, which is good. However, edge cases are not handled: (1) ifinput_featureshas fewer frames thanwindow(e.g., window=3 but input has 2 frames), delta computation produces incorrect or empty results; (2) concatenation happens without checking if output tensor dimensions are sensible.for (let i = window; i < T; ++i)but if T < window, the loop never runs and returns empty delta tensor. Concatenation proceeds anyway, resulting in mismatched dimensions.MEDIUM-SEVERITY FINDINGS (Design & Maintainability)
[Medium] Cache ownership semantics unclear and error-prone
cacheOwnsTensors = !!(featureCache && featureCache.max_entries > 0 && featureCache.max_size_mb > 0)to decide whether to dispose tensors. However, if cache configuration changes at runtime, or if eviction happens mid-transcription, tensor ownership boundaries become unclear. Also, if cache eviction occurs while a tensor is being used by the model, use-after-free is possible.pipeline_nemo_conformer_tdt.js, tensors are disposed only if!cacheOwnsTensors. But FeatureLRUCache can evict entries at any time via_evict(), potentially disposing tensors that are still referenced.[Medium] Window merging lacks comprehensive end-to-end test coverage
[Low] Insufficient resource cleanup in exception paths
transcribe()method has a single large try-catch wrapping the entire decoding loop. If an exception occurs mid-loop (e.g., during decoder session run), tensors allocated in previous iterations (frameData, frameTensor, tokenTensor, etc.) are disposed in finally block, which is correct. However, if an exception occurs during encoder run,encoderOutputsand intermediate tensors are cleaned up, which is also correct. No obvious leaks in the explicit try-finally paths, but implicit assumptions about RAII could fail if new code paths are added.[Low] Error messages lack context and observability
INTEGRATION FINDINGS
[Medium] Pipeline assumes processor and tokenizer availability without upfront validation
runNemoConformerTDTPipeline()checks for missing components mid-execution:if (!model.transcribe) throw ...,if (!processor) throw ...,if (!tokenizer) throw .... These checks occur after pipeline initialization, meaning errors surface late.pipeline_nemo_conformer_tdt.js, validation happens inside the async function, not at pipeline construction._call_nemo_conformer_tdt()before calling pipeline, or in pipeline constructor.[Low] Test coverage for error paths and edge cases
POSITIVE FINDINGS (No Issues)
Math.imul(hash, 16777619)properly applies the FNV prime in 32-bit space. Hash collision risk is low for typical audio cache use.MERGE DECISION: NEEDS CHANGES
This PR introduces several critical correctness issues:
These are not style or performance issues—they are bugs with real production impact. Most fixes are 2-5 lines (bounds check, epsilon guard, dedup tolerance).
Blockers (must fix before merge):
Non-blockers (follow-up issues acceptable, but should be tracked):
Do not merge without addressing the HIGH-severity frame indexing and text alignment issues. These are correctness bugs that will cause silent data corruption.