feat: add Nemo Conformer TDT support (sentence-based pipeline refresh)#14
feat: add Nemo Conformer TDT support (sentence-based pipeline refresh)#14
Conversation
…cache helpers Carry over non-runtime typing fixes from the prior branch while intentionally excluding the WebGPU disable_prepacking workaround in session.js.\n\n- Cast dynamic model.transcribe access for Nemo TDT pipeline method checks/calls.\n- Cast Tensor data byteLength access in transducer cache utilities.\n- Add explicit tuple/object JSDoc annotations in transducer timestamp builder.\n\nThis keeps main-based v4 work clean with latest ORT-Web on origin/main and avoids retaining the temporary encoder prepacking workaround.
- Replace legacy per-feature flags (return_token_timestamps, return_word_timestamps, return_utterance_timestamp) with a layered API: return_timestamps (utterance-level), return_words, return_tokens - Merge duplicate outputs: words absorbs word_timestamps, tokens absorbs token_timestamps and token_ids - Add per-token confidence, word-level confidence aggregation, utterance_confidence, and confidence_scores summary - Gate frame confidences behind returnFrameConfidences flag - Add return_metrics with encode/decode/total timing and RTF - Add debug flags: returnFrameIndices, returnLogProbs, returnTdtSteps - Fix vocab Map handling in getIdToTokenMap and _resolveVocabSize (tokenizer.get_vocab() returns Map in WASM binding) - Update ASR pipeline to wire timestamp_granularity to new model flags - Format all changed files with Prettier per CONTRIBUTING.md
…ipeline - Add roundTs() for millisecond-precision timestamp rounding at source - Round all confidence averages to 6 decimal places - Round per-token and per-word confidence values - Remove timestamp_granularity and formatting helpers from pipeline - Pipeline returns model.transcribe() output directly - Auto-enable return_words and return_metrics when return_timestamps is true
…imestamps, honor return_metrics kwarg - modeling_nemo_conformer_tdt: dispose logits and new decoder state tensors before throwing when logitsData.length < vocabSize to prevent resource leak - modeling_nemo_conformer_tdt: move returnFrameConfidences output block outside the return_timestamps guard so frame/frame_avg are emitted independently - automatic-speech-recognition: change return_metrics from hardcoded true to kwargs.return_metrics ?? false to respect user intent and avoid overhead
- Accept upstream restructuring: SUPPORTED_TASKS and pipeline imports moved from pipelines.js to pipelines/index.js - Migrate NemoConformerForTDT registration to pipelines/index.js accordingly
- Add MODEL_TYPES.NemoConformerTDT (id=16) to modeling_utils - Register NemoConformerForTDT in MODEL_TYPE_MAPPING, MODEL_NAME_TO_CLASS_MAPPING, and MODEL_CLASS_TO_NAME_MAPPING so the base class from_pretrained, ModelRegistry, and is_pipeline_cached all recognise the model correctly - Add NemoConformerTDT case to get_model_files so progress_callback receives accurate file size totals for encoder_model.onnx + decoder_model_merged.onnx
Standardizes internal logging to follow the upstream convention introduced in ModelRegistry refactor.
- Guard feature extractor against empty/short audio (NaN prevention) - Move decoder tensor init inside try block for safe disposal - Add architecture key to MODEL_TYPE_MAPPING - Add input validation in buildTransducerDetailedOutputs - Harden audio cache hash against NaN samples - Add order validation in computeTemporalDeltas - Restore pipeline: return_timestamps truthy => words + metrics always on
- Remove all timestamp_granularity tests (feature was removed) - Fix option names: return_tokens, return_words, return_timestamps - Fix output fields: tokens/words arrays, not token_ids/word_timestamps - Verify pipeline passes return_words + return_metrics when timestamps on - Add test: return_timestamps 'word' treated as truthy
Address reviewer findings except the return_metrics policy decision. - Fix temporal delta concatenation to interleave per frame and add dtype validation. - Validate preemphasis range and clamp normalization variance in feature extraction. - Remove unsafe encoder layout inference; require explicit encoder_output_layout. - Redesign decode loop to read frame data on-demand instead of eager frame materialization. - Deduplicate word finalization and avoid zero-filling missing word confidences. - Tighten tests for delta layout/type checks, explicit layout requirement, call counts, and naming accuracy.
Fixes high-impact issues found in PR review validation:\n- force NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT in registry overrides\n- ensure encoder outputs are disposed when pre-decode validation throws\n- remove stride sampling from audio cache key hashing to prevent false cache hits\n- use encoder_model selector key in get_model_files for Nemo per-component dtype/device overrides\n\nAlso adds targeted regression tests for mapping, disposal behavior, file selection, and cache key correctness.
- Clamp token end timestamps to encoder frame bounds during TDT decoding.\n- Validate FeatureLRUCache constructor limits to fail fast on invalid settings.\n- Add regression tests for timestamp clamping and cache limit validation.
Dispose intermediate tensors in computeTemporalDeltas concatenate paths and dispose replaced base input features when delta concatenation returns a new tensor.\n\nAdd regression tests that assert disposal behavior for delta concatenate flows and feature extractor reassignment.
Dispose non-essential Tensor outputs returned by decoder steps to prevent cumulative memory growth. Keep logits/state tensors alive for decoding/state transitions and dispose extras immediately.\n\nAdd regression test to assert auxiliary decoder tensor outputs are disposed each step.
Compute encoder length directly from attention_mask.data instead of attention_mask.tolist() to avoid large transient array allocations in ASR decode hot path.
Fail fast when duration logits are required but missing in decoder output, and enforce positive-integer vocab size at runtime config validation. Validate prepared Nemo pipeline audio for non-empty finite samples before processor/model calls. Add regression tests for missing duration logits and non-finite audio rejection.
Fix placeholder interpolation in _prepare_model_inputs error text. Add fail-fast validation for Nemo delta_window and reject duplicate decoder output aliases in transducer io config. Add regression tests for delta_window validation and duplicate decoder output alias rejection.
Validate transcribe timeOffset as finite and guard encoderOutputs cleanup path to avoid masking primary failures. Align transducer_text JSDoc token type with runtime shape (include id). Harden Parakeet feature extractor test by using direct mask data and explicit tensor disposal via try/finally; add timeOffset validation regression test.
- fail fast on missing decoder state outputs and invalid encoder layout enums\n- make FeatureLRUCache own cached tensor lifetimes (replace/evict/clear) with deduped disposal and deterministic size fallback\n- validate n_fft/win_length in Nemo feature extractor\n- align Nemo ASR pipeline docs with actual forwarded options\n- add regression coverage for runtime config validation, non-concatenated deltas/cache behavior, missing decoder state outputs, and cache disposal semantics\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt\n- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
Apply Gemini review nit in Nemo decode loop by replacing a redundant duration expression with Math.max(1, step).\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
Checklist (bot comment IDs): - [x] 2892132356: guard tokenizer.get_vocab() return type before Object.keys in _resolveVocabSize. - [x] 2892132367: treat zero cache limits as explicit no-cache mode; do not store/dispose just-produced values. - [x] 2892132372: dispose processor tensors in Nemo ASR pipeline when cache does not own lifetimes. Added regression tests for vocab resolution fallback, zero-limit cache semantics, and Nemo pipeline tensor ownership behavior. Validation: - pnpm test -- tests/models.test.js --filter nemo_conformer_tdt - pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
- widen confidenceFromLogits input type to Tensor data arrays - narrow feature_cache access with explicit typed cast in ASR pipeline
Checklist (bot comment IDs): - [x] 2892287484: handle array-returning tokenizer vocab in _resolveVocabSize. - [x] 2892322884: avoid disposing when re-setting the same object for an existing cache key. - [x] 2892322906: skip caching oversized values to prevent insert-then-dispose of caller-owned tensors. - [x] 2892322910: guard byteLength type in estimateSizeBytes. Added regression tests for array vocab sizing, same-object set behavior, oversized value skipping, and non-numeric byteLength handling. Validation: - pnpm test -- tests/models.test.js --filter nemo_conformer_tdt - pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
There was a problem hiding this comment.
Code Review
This is an impressive and extensive pull request that adds end-to-end support for Nemo Conformer TDT models. The changes are well-structured, introducing new modules for feature extraction, model implementation, pipeline logic, and various utilities. The code demonstrates a strong focus on robustness, with thorough configuration validation, error handling, and careful memory management of tensors. The addition of comprehensive unit tests for the new components is also a significant strength. My review includes one suggestion to enhance type safety, which will improve the long-term maintainability of this new functionality.
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 4
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c7096cf9-11c9-489f-b30e-b2fd11b10b91
📒 Files selected for processing (21)
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/models.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Applied to files:
packages/transformers/src/models/models.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Applied to files:
packages/transformers/src/models/models.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.
Applied to files:
packages/transformers/src/models/models.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Applied to files:
packages/transformers/src/models/models.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Applied to files:
packages/transformers/src/models/models.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Applied to files:
packages/transformers/src/models/models.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.
Applied to files:
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:55.984Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T22:59:55.984Z
Learning: When a model subclass overrides from_pretrained and bypasses the generic model-type branch, do not introduce or rely on a MODEL_TYPES entry for that subclass in the model registry. For this NemoConformerTDT case, in packages/transformers/src/utils/model_registry/get_model_files.js, avoid adding a NemoConformerTDT entry in the model type map; rely on the override logic in modeling_nemo_conformer_tdt.js. This keeps the registry explicit to the actual file and prevents unintended dispatch through the generic branch.
Applied to files:
packages/transformers/src/utils/model_registry/get_model_files.js
🔇 Additional comments (42)
packages/transformers/src/utils/model_registry/get_model_files.js (1)
180-183: LGTM: NemoConformerTDT file resolution is correctly configured.The branch correctly loads
encoder_modelanddecoder_model_mergedartifacts withoutgeneration_config.json, matching the non-generative ASR model pattern. Placement between Chatterbox and AutoEncoder is logical.packages/transformers/src/models/modeling_utils.js (2)
121-122: LGTM: MODEL_TYPES enum extension is correct.Value 16 is unique and sequential. No corresponding MODEL_TYPE_CONFIG entry is needed since NemoConformerForTDT overrides
from_pretrainedand handles session construction directly. Based on learnings, this is the intended design.
880-883: LGTM: Error message refactored to template literals.Semantically equivalent change; cleaner string construction.
packages/transformers/src/models/feature_extractors.js (1)
8-8: LGTM: Feature extractor re-export follows established pattern.Alphabetical ordering maintained between moonshine and parakeet.
packages/transformers/src/models/registry.js (2)
44-44: LGTM: Encoder-only mapping entry enables AutoModel discovery.
584-587: LGTM: CUSTOM_MAPPING override correctly sets NemoConformerTDT model type.The dual-registration approach is correct: encoder-only mapping provides AutoModel lookup while CUSTOM_MAPPING ensures the correct model type for two-artifact loading. Comment adequately explains the rationale.
packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js (1)
1-30: LGTM: Word deduplication logic is correct.The normalization handles punctuation stripping and NFKC normalization appropriately. The deduplication correctly identifies overlapping adjacent words by normalized text and retains the longer-duration instance. Non-overlapping repeated words are correctly preserved.
packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js (2)
9-70: LGTM: Delta computation implementation is correct.Input validation is comprehensive. The standard delta formula is correctly implemented with proper edge handling for boundary frames. Memory management is sound: intermediate tensors are properly disposed when concatenating. The recursive call for order=2 delta-delta computation is clean.
72-91: LGTM: Frame interleaving helper is correct.Length validation prevents mismatched arrays. The interleaving logic correctly produces [T, F*N] output by copying each item's frame segment sequentially.
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (3)
9-25: LGTM: Audio cache key generation is deterministic and collision-resistant.FNV-1a hash over quantized 16-bit samples with sampling_rate and length ensures stable cross-runtime keys. Non-finite value handling at line 19 prevents NaN propagation.
31-139: LGTM: FeatureLRUCache implementation is correct with proper ownership semantics.The cache correctly handles:
- No-cache mode (max_entries=0 or max_size_mb=0)
- Identical value refresh without disposal
- Oversized value rejection
- Existing value replacement with proper disposal
- LRU eviction with accurate size tracking
- Return value indicating retained ownership (line 108)
The
_evictloop has proper termination via the undefined key check at line 132.
141-208: LGTM: Tensor lifecycle helpers are correctly scoped.
collectCachedTensorshandles the known value structure (input_features, attention_mask, delta_features, delta_delta_features).tensorByteSizehas appropriate fallbacks when byteLength is unavailable. Size estimation handles both raw Tensors and structured cache values.packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js (5)
27-93: LGTM: Constructor validation and initialization are comprehensive.All configuration parameters are validated with clear error messages. The centered Hann window construction and delta configuration are correctly handled. The warning for non-concatenated delta mode appropriately alerts users to the separate output fields.
100-136: LGTM: Feature bank extraction correctly applies preemphasis and centered STFT.Cloning the waveform at line 108 prevents destructive modification of caller's data. Preemphasis applied in reverse order (line 110-112) is correct for in-place computation. Spectrogram parameters match NeMo conventions.
150-166: LGTM: Caching logic correctly tracks tensor ownership.Cache key construction includes delta parameters ensuring cache isolation for different configurations. The
cacheOwnsTensorsreturn value fromfeature_cache.set()correctly drives ownership tagging, allowing pipelines to determine disposal responsibility.
168-241: LGTM: Extraction pipeline is correct with proper memory management.Key observations:
- The
raw_features_lengthformula at lines 172-174 intentionally excludes the terminal padded frame (per learnings).- Per-feature normalization uses sample variance (N-1 divisor) which is appropriate.
- Empty/short audio handling correctly skips normalization to avoid divide-by-zero.
unsqueeze_(0)at line 214 creates the [1, T, F] shape required bycomputeTemporalDeltas.- When delta concatenation produces a new Tensor, the original
input_featuresis properly disposed at line 229.
243-249: LGTM: Cache control methods are correctly implemented.Direct delegation to the underlying cache with null safety.
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js (8)
27-83: Utility functions are correct.The
argmaximplementation handles offset/length properly. TheconfidenceFromLogitsuses the numerically stable log-sum-exp trick for softmax computation. ThetoInthandles bigint conversion from ONNX int64 outputs.
85-235: Configuration validation is thorough.The validation covers:
- Decoder config (num_layers, hidden_size) integrity
- I/O name uniqueness constraints
- Session existence and expected I/O names
- Layout and dtype validations
The error messages provide actionable guidance for model exporters. Approved.
296-316: Session construction aligns with registry contract.The explicit keys
encoder_modelanddecoder_model_mergedmatch theget_model_files.jsbranch forMODEL_TYPES.NemoConformerTDT. The error wrapping provides context when session loading fails.
607-616: Encoder feed disposal is correct.The finally block disposes transposed/length tensors created in
_buildEncoderFeedsregardless of encoder success or failure.
686-729: Decoder output validation with proper disposal on error.The code validates
logits,outputState1, andoutputState2presence, disposing allocated resources before throwing. TheseenDecoderTensorsset prevents double-dispose of aliased outputs.
781-821: State management and frame advancement logic is correct.When emitting a token (non-blank), the old decoder state is disposed while keeping the new state. When blank, the new state is disposed to reuse the existing state. Frame advancement respects TDT duration semantics:
step > 0advances by step frames; blank oremittedOnFrame >= maxSymbolsPerStepadvances by 1 frame.
823-835: Finally block ensures complete resource cleanup.All allocated tensors (
targetLengthTensor,decoderState,encoderOutputs) are disposed. Theseenset in encoder output disposal handles potential tensor aliasing in session outputs.
948-955: Registry mappings are consistent.Both the
model_typekey (nemo-conformer-tdt) and architecture key (NemoConformerForTDT) are registered, aligning with theCUSTOM_MAPPINGinregistry.js(see context snippet 1).packages/transformers/src/models/models.js (1)
106-106: Re-export correctly placed.The export is alphabetically ordered with existing model exports.
packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js (4)
1-22: Sentence boundary constants are well-defined.The regex patterns and non-breaking period set cover common ASR edge cases (acronyms, honorifics, enumerations). The 3-second fallback gap threshold is reasonable for natural speech pauses.
28-42: Word joining handles punctuation correctly.Punctuation-only tokens are appended without space, while other tokens get a leading space. This produces natural text output from ASR word sequences.
81-116: Sentence boundary heuristic is conservative by design.The function favors under-segmentation:
- Strong endings (
!?…) always break- Periods require both non-breaking word exclusion AND capitalized next word
- Large gaps force breaks regardless of punctuation
This prevents false positives on abbreviations and enumerations.
166-178: Empty words fallback is correct.When
wordsis empty bututteranceTimestampexists, a single chunk with the provided text is returned. When both are absent, an empty array is returned. This aligns with call sites inpipeline_nemo_conformer_tdt.jswhereutteranceTimestampis null when words are empty (context snippet 1).packages/transformers/src/models/processors.js (1)
11-11: Re-export correctly placed.packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (5)
130-189: Test utilities are well-designed.
withNemoTensorOwnershipcorrectly sets the non-enumerable ownership flag used by the pipeline for tensor disposal decisions.makeUnitPipeprovides a clean mock that simulates the model.transcribe contract with configurable outputs.
191-236: API contract tests verify pipeline behavior.Tests confirm:
return_timestamps: false→{ text }return_timestamps: true→{ text, chunks }with sentence segmentsreturn_timestamps: 'word'→{ text, chunks }with word-level timestampsThe mock call assertions verify correct option forwarding.
278-382: Windowed transcription test validates timeOffset propagation.The test verifies that explicit
chunk_length_striggers sentence-windowed processing with correcttimeOffsetvalues at window boundaries. The mock outputs simulate realistic cross-boundary word overlaps.
384-466: Boundary retranscription test validates sentence replacement.When the first window's last sentence is incomplete, the second window provides the complete version. The test verifies the merged output contains the longer, correctly-ended sentence.
817-896: Tensor disposal tests validate ownership semantics.Three scenarios are tested:
cacheOwnsTensors=false→ pipeline disposes tensorscacheOwnsTensors=true→ pipeline preserves tensors (cache owns them)- Cache limits disable caching → pipeline disposes tensors
These tests ensure no memory leaks under different cache configurations.
packages/transformers/src/pipelines/index.js (2)
33-33: Import is correct.The import path matches the module location.
152-159: Model array registration is correct.Position in the array does not affect priority—model selection uses
cls.supports(model_type)matching (per context snippet frompipelines.js:196-211).NemoConformerForTDT.supports()returns true only fornemo-conformer-tdtmodel type.packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js (1)
8-18: Processor override is intentional.The
_callmethod explicitly delegates tofeature_extractoronly, bypassing the base class iteration logic (context snippet:processing_utils.js:117-124). This is correct for audio-only preprocessing where tokenization is handled separately in the transcription flow.packages/transformers/src/pipelines/automatic-speech-recognition.js (3)
6-8: Import is correct.Imports the pipeline runner from the Nemo TDT module.
158-159: Model type routing is correct.Routes
nemo-conformer-tdtmodel type to the dedicated handler method.
308-324: Pipeline integration is clean.The method delegates to
runNemoConformerTDTPipelinewith all required context: model, processor, tokenizer, audio, kwargs, and theprepareAudiosutility. This keeps the pipeline surface aligned with the shared ASR task contract while rich Nemo-specific outputs remain available via directmodel.transcribe()calls.
packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
Show resolved
Hide resolved
packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
Show resolved
Hide resolved
packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
Outdated
Show resolved
Hide resolved
...s/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
Show resolved
Hide resolved
427fb7f to
00b3d93
Compare
Keep the shared ASR pipeline suite focused on the public Nemo contract and move adapter-specific windowing, retranscription, cache-ownership, and disposal coverage into a dedicated Nemo pipeline test file. Narrow the source diff by removing explanatory Nemo comments and reverting unrelated upstream-only tweaks, while also fixing the review findings around cursor snap-forward merging, tokenizer vocab-shape handling, empty timestamp validation, and cache borrow/release semantics for active inference. Verification: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1817e370-5686-496d-8150-70b8d6dcce10
📒 Files selected for processing (11)
packages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Applied to files:
packages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Applied to files:
packages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Applied to files:
packages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Applied to files:
packages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Applied to files:
packages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.
Applied to files:
packages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T20:49:38.204Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
Applied to files:
packages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
🔇 Additional comments (4)
packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js (1)
12-25:⚠️ Potential issue | 🟠 MajorExpose a real release API for cache-owned feature outputs.
_call()borrows cache entries viaacquire(), but the returned object only carries the release hook on a hidden symbol. The pipeline knows that contract, directNemoConformerTDTFeatureExtractorcallers do not, so cache hits can leave entries permanently borrowed and undisposable.Proposed fix
function tagNemoFeatureOutputOwnership(value, cacheOwnsTensors, release = null) { Object.defineProperty(value, NEMO_FEATURE_OUTPUT_OWNERSHIP, { value: cacheOwnsTensors, enumerable: false, configurable: true, @@ if (release) { Object.defineProperty(value, NEMO_FEATURE_OUTPUT_RELEASE, { value: release, enumerable: false, configurable: true, }); + Object.defineProperty(value, 'release', { + value: release, + enumerable: false, + configurable: true, + }); + Object.defineProperty(value, 'dispose', { + value: release, + enumerable: false, + configurable: true, + }); } return value; }Also applies to: 161-179
⛔ Skipped due to learnings
Learnt from: ysdede Repo: ysdede/transformers.js PR: 12 File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135 Timestamp: 2026-03-05T23:43:17.261Z Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.Learnt from: ysdede Repo: ysdede/transformers.js PR: 8 File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25 Timestamp: 2026-03-03T23:29:56.832Z Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.Learnt from: ysdede Repo: ysdede/transformers.js PR: 14 File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282 Timestamp: 2026-03-08T20:49:38.204Z Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.Learnt from: ysdede Repo: ysdede/transformers.js PR: 12 File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356 Timestamp: 2026-03-05T23:43:18.376Z Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (1)
52-56:⚠️ Potential issue | 🟠 Major
get()returns cache-owned tensors with no lifetime protection.
get()hands out the raw cached value without incrementingborrowers. The next eviction orclear()can dispose that tensor while the caller still holds a reference, which makes this public API unsafe for tensor-backed entries.Proposed fix
/** * `@param` {string} key - * `@returns` {any|null} + * `@returns` {{ value: any, release: () => void } | null} */ get(key) { - const entry = this._touch(key); - if (!entry) return null; - return entry.value; + return this.acquire(key); }⛔ Skipped due to learnings
Learnt from: ysdede Repo: ysdede/transformers.js PR: 12 File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135 Timestamp: 2026-03-05T23:43:17.261Z Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.packages/transformers/src/pipelines/automatic-speech-recognition.js (1)
6-8: Nemo routing is cleanly isolated.Keeping the switch thin and delegating the model-specific path to
runNemoConformerTDTPipelinepreserves the shared ASR flow without duplicating audio-preparation logic.Also applies to: 158-159, 308-317
packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js (1)
643-704: The tensor-lifecycle coverage is solid.Exercising both dispose-owned and release-owned paths is the right guardrail for this adapter boundary.
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
Show resolved
Hide resolved
packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00b3d9346b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (1)
17-24:⚠️ Potential issue | 🟠 MajorHash the exact sample buffer, not lossy int16 bins.
This key aliases distinct
Float32Array/Float64Arraywaveforms that round into the same 16-bit values, so the feature cache can replay stale tensors for different audio. The full scan is fine; the lossy quantization is the bug.Proposed fix
export function createAudioCacheKey(audio, sampling_rate = 16000) { - // FNV-1a 32-bit over quantized values for deterministic cross-runtime keys. + // FNV-1a 32-bit over the exact sample bytes. let hash = 2166136261; hash ^= audio.length; hash = Math.imul(hash, 16777619); hash ^= sampling_rate; hash = Math.imul(hash, 16777619); - - // Hash all quantized samples to minimize false cache hits across waveforms. - for (let i = 0; i < audio.length; ++i) { - const sample = Number.isFinite(audio[i]) ? audio[i] : 0; - const q = Math.max(-32768, Math.min(32767, Math.round(sample * 32768))); - hash ^= q; + const bytes = new Uint8Array(audio.buffer, audio.byteOffset, audio.byteLength); + hash ^= bytes.length; + hash = Math.imul(hash, 16777619); + for (let i = 0; i < bytes.length; ++i) { + hash ^= bytes[i]; hash = Math.imul(hash, 16777619); } - return `${sampling_rate}:${audio.length}:${(hash >>> 0).toString(16)}`; + return `${sampling_rate}:${audio.constructor.name}:${audio.length}:${(hash >>> 0).toString(16)}`; }packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (1)
168-172:⚠️ Potential issue | 🟠 MajorFix the Nemo
returnMetricsassertions.These expectations lock in the wrong adapter contract. Once the implementation is corrected, they will fail or pressure the pipeline back to the wrong behavior.
Based on learnings, `packages/transformers/src/pipelines/automatic-speech-recognition.js` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline.Proposed fix
expect(calls[0]).toMatchObject({ returnTimestamps: false, returnWords: false, - returnMetrics: false, + returnMetrics: true, }); @@ expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, - returnMetrics: false, + returnMetrics: true, }); @@ expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, - returnMetrics: false, + returnMetrics: true, });Also applies to: 181-185, 197-200
packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js (1)
158-160:⚠️ Potential issue | 🟠 MajorUpdate these adapter-call expectations to
returnMetrics: true.These assertions are validating the wrong Nemo pipeline contract.
Based on learnings, `packages/transformers/src/pipelines/automatic-speech-recognition.js` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline.Proposed fix
- expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 0 }); - expect(calls[1]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 19.95 }); - expect(calls[2]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 37.9 }); + expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 0 }); + expect(calls[1]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 19.95 }); + expect(calls[2]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 37.9 }); @@ expect(call).toMatchObject({ returnTimestamps: true, returnWords: true, - returnMetrics: false, + returnMetrics: true, });Also applies to: 527-531
packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js (1)
173-179:⚠️ Potential issue | 🟠 MajorKeep
returnMetricsenabled in both Nemo transcribe paths.The adapter contract intentionally asks
model.transcribe()for metrics on every pipeline call. Hardcodingfalsehere changes that contract for both short-audio and sentence-windowed runs.Based on learnings, `packages/transformers/src/pipelines/automatic-speech-recognition.js` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline.Proposed fix
const output = await runNemoTranscribe(windowAudio, { tokenizer, returnTimestamps: true, returnWords: true, - returnMetrics: false, + returnMetrics: true, timeOffset: start_s, }); @@ const output = await runNemoTranscribe(aud, { tokenizer, returnTimestamps: wantTimestampChunks, returnWords: wantTimestampChunks, - returnMetrics: false, + returnMetrics: true, });Also applies to: 325-330
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cf06aa60-7fe0-40ed-9021-41c25cb59ca6
📒 Files selected for processing (11)
packages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/registry.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T20:49:38.204Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/registry.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/registry.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/registry.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/registry.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/registry.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/registry.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
🔇 Additional comments (3)
packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js (2)
571-650: The decoder lifetime coverage is strong.These cases pin down encoder failure cleanup and per-step auxiliary-output disposal, which are the easiest tensor-lifetime regressions to miss in
transcribe().
981-1074: The cache ownership contract is well covered.Replacement, eviction, borrowed-entry deferral, explicit no-cache mode, and oversize skips are all locked down here. That is the right surface to keep stable for
FeatureLRUCache.packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js (1)
696-704: The unique-disposal handling is solid.Deduping auxiliary decoder tensors and final encoder outputs avoids the double-dispose trap while still cleaning up every failure path.
Also applies to: 826-834
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
Outdated
Show resolved
Hide resolved
packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
Show resolved
Hide resolved
packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
Outdated
Show resolved
Hide resolved
Apply the remaining valid Nemo Conformer TDT review fixes without widening the shared ASR pipeline surface. - honor encoder_input_layout for canonical input_features feeds - keep borrowed cache entries counted until they are actually released - reject tokenizer-less non-empty word-offset reconstruction - raise the auto-window budget to match the minimum guaranteed cursor advance - add focused model and pipeline regressions for each fix Verified with: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Code Review SummaryStatus: No New Issues Found | Recommendation: Merge OverviewThis PR adds Nemo Conformer TDT ASR support with comprehensive implementation including model execution, feature extraction, decoding, pipeline integration, and extensive test coverage. The existing review comments have been thoroughly addressed in multiple fix commits ( Security ReviewNo concrete security issues identified:
Performance ReviewNo performance concerns identified:
Reliability ReviewCode demonstrates strong reliability patterns:
Test CoverageComprehensive test suite added:
Files Reviewed (12 files)
Previous Issues StatusAll previously identified issues have been addressed: |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js (1)
158-160:⚠️ Potential issue | 🟠 MajorFix the adapter contract assertions.
Line 158 and Line 527 still pin
returnMetrics: false, but the Nemo pipeline adapter intentionally forcesreturnMetrics: true. These tests are asserting the wrong contract.Based on learnings, `packages/transformers/src/pipelines/automatic-speech-recognition.js` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` for the Nemo Conformer TDT pipeline.Suggested patch
- expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 0 }); - expect(calls[1]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 19.95 }); - expect(calls[2]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 37.9 }); + expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 0 }); + expect(calls[1]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 19.95 }); + expect(calls[2]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 37.9 }); @@ expect(call).toMatchObject({ returnTimestamps: true, returnWords: true, - returnMetrics: false, + returnMetrics: true, });Also applies to: 527-531
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: be76d84a-ffb3-4764-90ac-fb5a629b177e
📒 Files selected for processing (11)
packages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📜 Review details
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Applied to files:
packages/transformers/src/models/modeling_utils.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Applied to files:
packages/transformers/src/models/modeling_utils.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.
Applied to files:
packages/transformers/src/models/modeling_utils.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-08T20:49:38.204Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/registry.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-08T21:41:31.847Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:17-24
Timestamp: 2026-03-08T21:41:31.847Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally quantizes each sample to int16 (via `Math.round(sample * 32768)`) before hashing. This is a deliberate design choice for deterministic cross-runtime key stability. The resulting aliasing of distinct waveforms that fall in the same int16 bins is an accepted tradeoff because feature caching is optional and off by default. Do not flag the int16 quantization as a collision risk or suggest hashing raw bytes instead.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.
Applied to files:
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
🔇 Additional comments (1)
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js (1)
422-445: Canonical encoder inputs now honorencoder_input_layout.This shared normalization path for
input_featuresandaudio_signalcloses the layout bypass and keeps the transpose/disposal logic in one place.
packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
Show resolved
Hide resolved
Restore the original cast spacing in the unrelated moonshine path so the Nemo PR does not carry an extra formatting-only diff in automatic-speech-recognition.js.
Resolve sparse tokenizer vocab fallback by deriving the runtime size from the maximum token id instead of counting entries. This keeps decoder sizing correct when tokenizer ids are non-contiguous. Tighten merged-word dedupe so punctuation-only overlaps are only collapsed when their raw normalized text also matches, which avoids dropping distinct punctuation tokens across window boundaries. Add focused Nemo model regressions and verify with: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Treat likely domain suffixes as continuations when tokenizer decoding inserts whitespace after a trailing period, so sequences like `LibriVox. org.` reconstruct as `LibriVox.org.` in detailed word offsets. Add a focused regression covering the split `.org` token pattern and verify with: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
This reverts commit 39e5cb1.
Summary
Supersedes
#13with the currentmain-based Nemo branch line.This PR adds NeMo Conformer TDT ASR support to
transformers.js, including model execution, feature extraction, decoding, reconstruction, pipeline wiring, registry integration, and Nemo-specific regression coverage.The Nemo pipeline is aligned to the shared
automatic-speech-recognitiontask contract, while richer directmodel.transcribe()outputs remain available for lower-level use.What Is Included
1. Model + Decoder
model.transcribe()support for text, timestamps, confidences, optional words and tokens, and optional metrics/debug payloads.2. Feature Extraction
3. ASR Pipeline Integration
AutomaticSpeechRecognitionPipelinedispatch.{ text }return_timestamps: true:{ text, chunks }with sentence-like finalized chunksreturn_timestamps: 'word':{ text, chunks }with word-level timestampsmodel.transcribe().4. Long-Audio Handling
chunk_length_sis used as the Nemo window-size override in pipeline mode.5. Word Reconstruction / Timestamp Grouping
score.48-year-oldwith0.5March20th,2021.6. Registry + Model File Resolution
encoder_modelanddecoder_model_merged.7. Follow-up Review Fixes
encoder_input_layoutfor canonicalinput_featuresfeeds.vocab_sizefrom the maximum tokenizer id so sparse vocabs do not undersize decoder logits.Regression Coverage
Added or updated tests in:
packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jsCoverage includes:
Upstream Sync Included
This branch was synced with
upstream/mainthrough commitf65a4c7c(merge commit49a4af8f).Relevant Nemo follow-up commits on top of that sync include:
ee819a1cfix(nemo-tdt): add supports() for ASR model class selection8dfccddcfeat(nemo-tdt): align asr pipeline outputs and long-audio handlingf59ba068feat(nemo-conformer-tdt): add sentence-based ASR pipeline chunking00b3d934fix(nemo): scope ASR tests and address review fixes07118c38fix(nemo-tdt): address follow-up review threads29f2baaffix(nemo-tdt): handle sparse vocab and merge dedupeValidation
Executed for this refresh:
node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"pnpm buildnpm run test:nemo:scientistsScope Boundary
This PR stays focused on Nemo Conformer TDT integration and the follow-up work needed to:
Direct
model.transcribe()remains the low-level API for advanced app-specific postprocessing.