Skip to content

feat: add Nemo Conformer TDT support (sentence-based pipeline refresh)#14

Closed
ysdede wants to merge 42 commits intomainfrom
v4-nemo-conformer-tdt-main-r3
Closed

feat: add Nemo Conformer TDT support (sentence-based pipeline refresh)#14
ysdede wants to merge 42 commits intomainfrom
v4-nemo-conformer-tdt-main-r3

Conversation

@ysdede
Copy link
Copy Markdown
Owner

@ysdede ysdede commented Mar 8, 2026

Summary

Supersedes #13 with the current main-based Nemo branch line.

This PR adds NeMo Conformer TDT ASR support to transformers.js, including model execution, feature extraction, decoding, reconstruction, pipeline wiring, registry integration, and Nemo-specific regression coverage.

The Nemo pipeline is aligned to the shared automatic-speech-recognition task contract, while richer direct model.transcribe() outputs remain available for lower-level use.

What Is Included

1. Model + Decoder

  • Added the NeMo Conformer TDT model implementation.
  • Implemented greedy token-and-duration transducer decoding.
  • Added model.transcribe() support for text, timestamps, confidences, optional words and tokens, and optional metrics/debug payloads.

2. Feature Extraction

  • Added Conformer-TDT-specific log-mel feature extraction.
  • Added optional temporal deltas and delta-delta features.
  • Added optional feature-cache utilities with tensor ownership and lifecycle handling.

3. ASR Pipeline Integration

  • Integrated the Conformer TDT model type into AutomaticSpeechRecognitionPipeline dispatch.
  • Aligned pipeline outputs with the shared ASR task shape:
    • default: { text }
    • return_timestamps: true: { text, chunks } with sentence-like finalized chunks
    • return_timestamps: 'word': { text, chunks } with word-level timestamps
  • Kept richer model-native outputs on direct model.transcribe().

4. Long-Audio Handling

  • Added automatic long-audio handling for Nemo pipeline calls above 180 seconds.
  • Replaced the old overlap-oriented long-audio path with sentence-cursor restart logic.
  • Long-audio windowing finalizes stable sentence-like segments, drops the immature trailing segment, and retranscribes from that segment start.
  • chunk_length_s is used as the Nemo window-size override in pipeline mode.

5. Word Reconstruction / Timestamp Grouping

  • Reworked word reconstruction to derive boundaries from the final decoded text instead of isolated token decoding only.
  • Improved segment grouping from timed words so sentence-like chunks are more stable than Whisper-style random splits.
  • Fixed spacing and boundary failures around punctuation-heavy and numeric outputs such as:
    • score.48-year-old
    • with0.5
    • March20th,2021.

6. Registry + Model File Resolution

  • Added model, processor, and feature extractor exports and mappings for Conformer TDT.
  • Added dual-artifact model file handling for encoder_model and decoder_model_merged.

7. Follow-up Review Fixes

  • Moved Nemo adapter-specific assertions out of the shared ASR pipeline test file and into a dedicated Nemo pipeline adapter suite.
  • Fixed pending-prefix preservation when cursor snapping restarts inside the trailing sentence.
  • Hardened vocab handling and validation in word-offset reconstruction.
  • Added cache borrow/release handling so evicted borrowed tensors are disposed only after release.
  • Honored encoder_input_layout for canonical input_features feeds.
  • Raised the auto-window budget to match the minimum guaranteed cursor advance.
  • Kept borrowed cache-entry bytes counted until the final release.
  • Rejected tokenizer-less non-empty word-offset reconstruction instead of silently dropping detail.
  • Derived fallback vocab_size from the maximum tokenizer id so sparse vocabs do not undersize decoder logits.
  • Kept punctuation-only merge dedupe from collapsing distinct overlapping tokens.

Regression Coverage

Added or updated tests in:

  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js

Coverage includes:

  • task-shaped pipeline outputs for default, sentence-chunk, and word-chunk modes
  • sentence-cursor long-audio windowing and retranscription
  • timestamp grouping and word-boundary reconstruction for punctuation and numeric tokens
  • encoder input layout handling for canonical feeds
  • sparse vocab fallback sizing
  • punctuation-only merge dedupe behavior
  • cache ownership, eviction, release, and accounting behavior
  • processor tensor lifetime behavior in the Nemo path

Upstream Sync Included

This branch was synced with upstream/main through commit f65a4c7c (merge commit 49a4af8f).

Relevant Nemo follow-up commits on top of that sync include:

  • ee819a1c fix(nemo-tdt): add supports() for ASR model class selection
  • 8dfccddc feat(nemo-tdt): align asr pipeline outputs and long-audio handling
  • f59ba068 feat(nemo-conformer-tdt): add sentence-based ASR pipeline chunking
  • 00b3d934 fix(nemo): scope ASR tests and address review fixes
  • 07118c38 fix(nemo-tdt): address follow-up review threads
  • 29f2baaf fix(nemo-tdt): handle sparse vocab and merge dedupe

Validation

Executed for this refresh:

  • node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
  • node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
  • pnpm build
  • npm run test:nemo:scientists

Scope Boundary

This PR stays focused on Nemo Conformer TDT integration and the follow-up work needed to:

  • align pipeline behavior with the shared ASR contract
  • improve long-audio handling in pipeline mode
  • improve word reconstruction and timestamp grouping
  • address targeted reviewer-reported Nemo correctness issues

Direct model.transcribe() remains the low-level API for advanced app-specific postprocessing.

ysdede added 30 commits March 1, 2026 16:56
…cache helpers

Carry over non-runtime typing fixes from the prior branch while intentionally excluding the WebGPU disable_prepacking workaround in session.js.\n\n- Cast dynamic model.transcribe access for Nemo TDT pipeline method checks/calls.\n- Cast Tensor data byteLength access in transducer cache utilities.\n- Add explicit tuple/object JSDoc annotations in transducer timestamp builder.\n\nThis keeps main-based v4 work clean with latest ORT-Web on origin/main and avoids retaining the temporary encoder prepacking workaround.
- Replace legacy per-feature flags (return_token_timestamps,
  return_word_timestamps, return_utterance_timestamp) with a layered API:
  return_timestamps (utterance-level), return_words, return_tokens
- Merge duplicate outputs: words absorbs word_timestamps,
  tokens absorbs token_timestamps and token_ids
- Add per-token confidence, word-level confidence aggregation,
  utterance_confidence, and confidence_scores summary
- Gate frame confidences behind returnFrameConfidences flag
- Add return_metrics with encode/decode/total timing and RTF
- Add debug flags: returnFrameIndices, returnLogProbs, returnTdtSteps
- Fix vocab Map handling in getIdToTokenMap and _resolveVocabSize
  (tokenizer.get_vocab() returns Map in WASM binding)
- Update ASR pipeline to wire timestamp_granularity to new model flags
- Format all changed files with Prettier per CONTRIBUTING.md
…ipeline

- Add roundTs() for millisecond-precision timestamp rounding at source

- Round all confidence averages to 6 decimal places

- Round per-token and per-word confidence values

- Remove timestamp_granularity and formatting helpers from pipeline

- Pipeline returns model.transcribe() output directly

- Auto-enable return_words and return_metrics when return_timestamps is true
…imestamps, honor return_metrics kwarg

- modeling_nemo_conformer_tdt: dispose logits and new decoder state tensors
  before throwing when logitsData.length < vocabSize to prevent resource leak
- modeling_nemo_conformer_tdt: move returnFrameConfidences output block outside
  the return_timestamps guard so frame/frame_avg are emitted independently
- automatic-speech-recognition: change return_metrics from hardcoded true to
  kwargs.return_metrics ?? false to respect user intent and avoid overhead
- Accept upstream restructuring: SUPPORTED_TASKS and pipeline imports moved
  from pipelines.js to pipelines/index.js
- Migrate NemoConformerForTDT registration to pipelines/index.js accordingly
- Add MODEL_TYPES.NemoConformerTDT (id=16) to modeling_utils
- Register NemoConformerForTDT in MODEL_TYPE_MAPPING, MODEL_NAME_TO_CLASS_MAPPING,
  and MODEL_CLASS_TO_NAME_MAPPING so the base class from_pretrained, ModelRegistry,
  and is_pipeline_cached all recognise the model correctly
- Add NemoConformerTDT case to get_model_files so progress_callback receives
  accurate file size totals for encoder_model.onnx + decoder_model_merged.onnx
Standardizes internal logging to follow the upstream convention introduced
in ModelRegistry refactor.
- Guard feature extractor against empty/short audio (NaN prevention)

- Move decoder tensor init inside try block for safe disposal

- Add architecture key to MODEL_TYPE_MAPPING

- Add input validation in buildTransducerDetailedOutputs

- Harden audio cache hash against NaN samples

- Add order validation in computeTemporalDeltas

- Restore pipeline: return_timestamps truthy => words + metrics always on
- Remove all timestamp_granularity tests (feature was removed)

- Fix option names: return_tokens, return_words, return_timestamps

- Fix output fields: tokens/words arrays, not token_ids/word_timestamps

- Verify pipeline passes return_words + return_metrics when timestamps on

- Add test: return_timestamps 'word' treated as truthy
Address reviewer findings except the return_metrics policy decision.

- Fix temporal delta concatenation to interleave per frame and add dtype validation.
- Validate preemphasis range and clamp normalization variance in feature extraction.
- Remove unsafe encoder layout inference; require explicit encoder_output_layout.
- Redesign decode loop to read frame data on-demand instead of eager frame materialization.
- Deduplicate word finalization and avoid zero-filling missing word confidences.
- Tighten tests for delta layout/type checks, explicit layout requirement, call counts, and naming accuracy.
Fixes high-impact issues found in PR review validation:\n- force NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT in registry overrides\n- ensure encoder outputs are disposed when pre-decode validation throws\n- remove stride sampling from audio cache key hashing to prevent false cache hits\n- use encoder_model selector key in get_model_files for Nemo per-component dtype/device overrides\n\nAlso adds targeted regression tests for mapping, disposal behavior, file selection, and cache key correctness.
- Clamp token end timestamps to encoder frame bounds during TDT decoding.\n- Validate FeatureLRUCache constructor limits to fail fast on invalid settings.\n- Add regression tests for timestamp clamping and cache limit validation.
Dispose intermediate tensors in computeTemporalDeltas concatenate paths and dispose replaced base input features when delta concatenation returns a new tensor.\n\nAdd regression tests that assert disposal behavior for delta concatenate flows and feature extractor reassignment.
Dispose non-essential Tensor outputs returned by decoder steps to prevent cumulative memory growth. Keep logits/state tensors alive for decoding/state transitions and dispose extras immediately.\n\nAdd regression test to assert auxiliary decoder tensor outputs are disposed each step.
Compute encoder length directly from attention_mask.data instead of attention_mask.tolist() to avoid large transient array allocations in ASR decode hot path.
Fail fast when duration logits are required but missing in decoder output, and enforce positive-integer vocab size at runtime config validation.

Validate prepared Nemo pipeline audio for non-empty finite samples before processor/model calls.

Add regression tests for missing duration logits and non-finite audio rejection.
Fix placeholder interpolation in _prepare_model_inputs error text.

Add fail-fast validation for Nemo delta_window and reject duplicate decoder output aliases in transducer io config.

Add regression tests for delta_window validation and duplicate decoder output alias rejection.
Validate transcribe timeOffset as finite and guard encoderOutputs cleanup path to avoid masking primary failures.

Align transducer_text JSDoc token type with runtime shape (include id).

Harden Parakeet feature extractor test by using direct mask data and explicit tensor disposal via try/finally; add timeOffset validation regression test.
- fail fast on missing decoder state outputs and invalid encoder layout enums\n- make FeatureLRUCache own cached tensor lifetimes (replace/evict/clear) with deduped disposal and deterministic size fallback\n- validate n_fft/win_length in Nemo feature extractor\n- align Nemo ASR pipeline docs with actual forwarded options\n- add regression coverage for runtime config validation, non-concatenated deltas/cache behavior, missing decoder state outputs, and cache disposal semantics\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt\n- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
Apply Gemini review nit in Nemo decode loop by replacing a redundant duration expression with Math.max(1, step).\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
Checklist (bot comment IDs):
- [x] 2892132356: guard tokenizer.get_vocab() return type before Object.keys in _resolveVocabSize.
- [x] 2892132367: treat zero cache limits as explicit no-cache mode; do not store/dispose just-produced values.
- [x] 2892132372: dispose processor tensors in Nemo ASR pipeline when cache does not own lifetimes.

Added regression tests for vocab resolution fallback, zero-limit cache semantics, and Nemo pipeline tensor ownership behavior.

Validation:
- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
- widen confidenceFromLogits input type to Tensor data arrays

- narrow feature_cache access with explicit typed cast in ASR pipeline
Checklist (bot comment IDs):
- [x] 2892287484: handle array-returning tokenizer vocab in _resolveVocabSize.
- [x] 2892322884: avoid disposing when re-setting the same object for an existing cache key.
- [x] 2892322906: skip caching oversized values to prevent insert-then-dispose of caller-owned tensors.
- [x] 2892322910: guard byteLength type in estimateSizeBytes.

Added regression tests for array vocab sizing, same-object set behavior, oversized value skipping, and non-numeric byteLength handling.

Validation:
- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an impressive and extensive pull request that adds end-to-end support for Nemo Conformer TDT models. The changes are well-structured, introducing new modules for feature extraction, model implementation, pipeline logic, and various utilities. The code demonstrates a strong focus on robustness, with thorough configuration validation, error handling, and careful memory management of tensors. The addition of comprehensive unit tests for the new components is also a significant strength. My review includes one suggestion to enhance type safety, which will improve the long-term maintainability of this new functionality.

kilo-code-bot[bot]

This comment was marked as resolved.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c7096cf9-11c9-489f-b30e-b2fd11b10b91

📥 Commits

Reviewing files that changed from the base of the PR and between f65a4c7 and f59ba06.

📒 Files selected for processing (21)
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.

Applied to files:

  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.

Applied to files:

  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.

Applied to files:

  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.

Applied to files:

  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.

Applied to files:

  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.

Applied to files:

  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.

Applied to files:

  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:55.984Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T22:59:55.984Z
Learning: When a model subclass overrides from_pretrained and bypasses the generic model-type branch, do not introduce or rely on a MODEL_TYPES entry for that subclass in the model registry. For this NemoConformerTDT case, in packages/transformers/src/utils/model_registry/get_model_files.js, avoid adding a NemoConformerTDT entry in the model type map; rely on the override logic in modeling_nemo_conformer_tdt.js. This keeps the registry explicit to the actual file and prevents unintended dispatch through the generic branch.

Applied to files:

  • packages/transformers/src/utils/model_registry/get_model_files.js
🔇 Additional comments (42)
packages/transformers/src/utils/model_registry/get_model_files.js (1)

180-183: LGTM: NemoConformerTDT file resolution is correctly configured.

The branch correctly loads encoder_model and decoder_model_merged artifacts without generation_config.json, matching the non-generative ASR model pattern. Placement between Chatterbox and AutoEncoder is logical.

packages/transformers/src/models/modeling_utils.js (2)

121-122: LGTM: MODEL_TYPES enum extension is correct.

Value 16 is unique and sequential. No corresponding MODEL_TYPE_CONFIG entry is needed since NemoConformerForTDT overrides from_pretrained and handles session construction directly. Based on learnings, this is the intended design.


880-883: LGTM: Error message refactored to template literals.

Semantically equivalent change; cleaner string construction.

packages/transformers/src/models/feature_extractors.js (1)

8-8: LGTM: Feature extractor re-export follows established pattern.

Alphabetical ordering maintained between moonshine and parakeet.

packages/transformers/src/models/registry.js (2)

44-44: LGTM: Encoder-only mapping entry enables AutoModel discovery.


584-587: LGTM: CUSTOM_MAPPING override correctly sets NemoConformerTDT model type.

The dual-registration approach is correct: encoder-only mapping provides AutoModel lookup while CUSTOM_MAPPING ensures the correct model type for two-artifact loading. Comment adequately explains the rationale.

packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js (1)

1-30: LGTM: Word deduplication logic is correct.

The normalization handles punctuation stripping and NFKC normalization appropriately. The deduplication correctly identifies overlapping adjacent words by normalized text and retains the longer-duration instance. Non-overlapping repeated words are correctly preserved.

packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js (2)

9-70: LGTM: Delta computation implementation is correct.

Input validation is comprehensive. The standard delta formula is correctly implemented with proper edge handling for boundary frames. Memory management is sound: intermediate tensors are properly disposed when concatenating. The recursive call for order=2 delta-delta computation is clean.


72-91: LGTM: Frame interleaving helper is correct.

Length validation prevents mismatched arrays. The interleaving logic correctly produces [T, F*N] output by copying each item's frame segment sequentially.

packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (3)

9-25: LGTM: Audio cache key generation is deterministic and collision-resistant.

FNV-1a hash over quantized 16-bit samples with sampling_rate and length ensures stable cross-runtime keys. Non-finite value handling at line 19 prevents NaN propagation.


31-139: LGTM: FeatureLRUCache implementation is correct with proper ownership semantics.

The cache correctly handles:

  • No-cache mode (max_entries=0 or max_size_mb=0)
  • Identical value refresh without disposal
  • Oversized value rejection
  • Existing value replacement with proper disposal
  • LRU eviction with accurate size tracking
  • Return value indicating retained ownership (line 108)

The _evict loop has proper termination via the undefined key check at line 132.


141-208: LGTM: Tensor lifecycle helpers are correctly scoped.

collectCachedTensors handles the known value structure (input_features, attention_mask, delta_features, delta_delta_features). tensorByteSize has appropriate fallbacks when byteLength is unavailable. Size estimation handles both raw Tensors and structured cache values.

packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js (5)

27-93: LGTM: Constructor validation and initialization are comprehensive.

All configuration parameters are validated with clear error messages. The centered Hann window construction and delta configuration are correctly handled. The warning for non-concatenated delta mode appropriately alerts users to the separate output fields.


100-136: LGTM: Feature bank extraction correctly applies preemphasis and centered STFT.

Cloning the waveform at line 108 prevents destructive modification of caller's data. Preemphasis applied in reverse order (line 110-112) is correct for in-place computation. Spectrogram parameters match NeMo conventions.


150-166: LGTM: Caching logic correctly tracks tensor ownership.

Cache key construction includes delta parameters ensuring cache isolation for different configurations. The cacheOwnsTensors return value from feature_cache.set() correctly drives ownership tagging, allowing pipelines to determine disposal responsibility.


168-241: LGTM: Extraction pipeline is correct with proper memory management.

Key observations:

  • The raw_features_length formula at lines 172-174 intentionally excludes the terminal padded frame (per learnings).
  • Per-feature normalization uses sample variance (N-1 divisor) which is appropriate.
  • Empty/short audio handling correctly skips normalization to avoid divide-by-zero.
  • unsqueeze_(0) at line 214 creates the [1, T, F] shape required by computeTemporalDeltas.
  • When delta concatenation produces a new Tensor, the original input_features is properly disposed at line 229.

243-249: LGTM: Cache control methods are correctly implemented.

Direct delegation to the underlying cache with null safety.

packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js (8)

27-83: Utility functions are correct.

The argmax implementation handles offset/length properly. The confidenceFromLogits uses the numerically stable log-sum-exp trick for softmax computation. The toInt handles bigint conversion from ONNX int64 outputs.


85-235: Configuration validation is thorough.

The validation covers:

  • Decoder config (num_layers, hidden_size) integrity
  • I/O name uniqueness constraints
  • Session existence and expected I/O names
  • Layout and dtype validations

The error messages provide actionable guidance for model exporters. Approved.


296-316: Session construction aligns with registry contract.

The explicit keys encoder_model and decoder_model_merged match the get_model_files.js branch for MODEL_TYPES.NemoConformerTDT. The error wrapping provides context when session loading fails.


607-616: Encoder feed disposal is correct.

The finally block disposes transposed/length tensors created in _buildEncoderFeeds regardless of encoder success or failure.


686-729: Decoder output validation with proper disposal on error.

The code validates logits, outputState1, and outputState2 presence, disposing allocated resources before throwing. The seenDecoderTensors set prevents double-dispose of aliased outputs.


781-821: State management and frame advancement logic is correct.

When emitting a token (non-blank), the old decoder state is disposed while keeping the new state. When blank, the new state is disposed to reuse the existing state. Frame advancement respects TDT duration semantics: step > 0 advances by step frames; blank or emittedOnFrame >= maxSymbolsPerStep advances by 1 frame.


823-835: Finally block ensures complete resource cleanup.

All allocated tensors (targetLengthTensor, decoderState, encoderOutputs) are disposed. The seen set in encoder output disposal handles potential tensor aliasing in session outputs.


948-955: Registry mappings are consistent.

Both the model_type key (nemo-conformer-tdt) and architecture key (NemoConformerForTDT) are registered, aligning with the CUSTOM_MAPPING in registry.js (see context snippet 1).

packages/transformers/src/models/models.js (1)

106-106: Re-export correctly placed.

The export is alphabetically ordered with existing model exports.

packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js (4)

1-22: Sentence boundary constants are well-defined.

The regex patterns and non-breaking period set cover common ASR edge cases (acronyms, honorifics, enumerations). The 3-second fallback gap threshold is reasonable for natural speech pauses.


28-42: Word joining handles punctuation correctly.

Punctuation-only tokens are appended without space, while other tokens get a leading space. This produces natural text output from ASR word sequences.


81-116: Sentence boundary heuristic is conservative by design.

The function favors under-segmentation:

  1. Strong endings (!?…) always break
  2. Periods require both non-breaking word exclusion AND capitalized next word
  3. Large gaps force breaks regardless of punctuation

This prevents false positives on abbreviations and enumerations.


166-178: Empty words fallback is correct.

When words is empty but utteranceTimestamp exists, a single chunk with the provided text is returned. When both are absent, an empty array is returned. This aligns with call sites in pipeline_nemo_conformer_tdt.js where utteranceTimestamp is null when words are empty (context snippet 1).

packages/transformers/src/models/processors.js (1)

11-11: Re-export correctly placed.

packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (5)

130-189: Test utilities are well-designed.

withNemoTensorOwnership correctly sets the non-enumerable ownership flag used by the pipeline for tensor disposal decisions. makeUnitPipe provides a clean mock that simulates the model.transcribe contract with configurable outputs.


191-236: API contract tests verify pipeline behavior.

Tests confirm:

  • return_timestamps: false{ text }
  • return_timestamps: true{ text, chunks } with sentence segments
  • return_timestamps: 'word'{ text, chunks } with word-level timestamps

The mock call assertions verify correct option forwarding.


278-382: Windowed transcription test validates timeOffset propagation.

The test verifies that explicit chunk_length_s triggers sentence-windowed processing with correct timeOffset values at window boundaries. The mock outputs simulate realistic cross-boundary word overlaps.


384-466: Boundary retranscription test validates sentence replacement.

When the first window's last sentence is incomplete, the second window provides the complete version. The test verifies the merged output contains the longer, correctly-ended sentence.


817-896: Tensor disposal tests validate ownership semantics.

Three scenarios are tested:

  1. cacheOwnsTensors=false → pipeline disposes tensors
  2. cacheOwnsTensors=true → pipeline preserves tensors (cache owns them)
  3. Cache limits disable caching → pipeline disposes tensors

These tests ensure no memory leaks under different cache configurations.

packages/transformers/src/pipelines/index.js (2)

33-33: Import is correct.

The import path matches the module location.


152-159: Model array registration is correct.

Position in the array does not affect priority—model selection uses cls.supports(model_type) matching (per context snippet from pipelines.js:196-211). NemoConformerForTDT.supports() returns true only for nemo-conformer-tdt model type.

packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js (1)

8-18: Processor override is intentional.

The _call method explicitly delegates to feature_extractor only, bypassing the base class iteration logic (context snippet: processing_utils.js:117-124). This is correct for audio-only preprocessing where tokenization is handled separately in the transcription flow.

packages/transformers/src/pipelines/automatic-speech-recognition.js (3)

6-8: Import is correct.

Imports the pipeline runner from the Nemo TDT module.


158-159: Model type routing is correct.

Routes nemo-conformer-tdt model type to the dedicated handler method.


308-324: Pipeline integration is clean.

The method delegates to runNemoConformerTDTPipeline with all required context: model, processor, tokenizer, audio, kwargs, and the prepareAudios utility. This keeps the pipeline surface aligned with the shared ASR task contract while rich Nemo-specific outputs remain available via direct model.transcribe() calls.

Repository owner deleted a comment from gemini-code-assist bot Mar 8, 2026
@ysdede ysdede force-pushed the v4-nemo-conformer-tdt-main-r3 branch from 427fb7f to 00b3d93 Compare March 8, 2026 20:51
Keep the shared ASR pipeline suite focused on the public Nemo contract and move adapter-specific windowing, retranscription, cache-ownership, and disposal coverage into a dedicated Nemo pipeline test file.

Narrow the source diff by removing explanatory Nemo comments and reverting unrelated upstream-only tweaks, while also fixing the review findings around cursor snap-forward merging, tokenizer vocab-shape handling, empty timestamp validation, and cache borrow/release semantics for active inference.

Verification:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Repository owner deleted a comment from kilo-code-bot bot Mar 8, 2026
@ysdede
Copy link
Copy Markdown
Owner Author

ysdede commented Mar 8, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 8, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1817e370-5686-496d-8150-70b8d6dcce10

📥 Commits

Reviewing files that changed from the base of the PR and between f59ba06 and 427fb7f.

📒 Files selected for processing (11)
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.

Applied to files:

  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.

Applied to files:

  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.

Applied to files:

  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.

Applied to files:

  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.

Applied to files:

  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.

Applied to files:

  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T20:49:38.204Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.

Applied to files:

  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
🔇 Additional comments (4)
packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js (1)

12-25: ⚠️ Potential issue | 🟠 Major

Expose a real release API for cache-owned feature outputs.

_call() borrows cache entries via acquire(), but the returned object only carries the release hook on a hidden symbol. The pipeline knows that contract, direct NemoConformerTDTFeatureExtractor callers do not, so cache hits can leave entries permanently borrowed and undisposable.

Proposed fix
 function tagNemoFeatureOutputOwnership(value, cacheOwnsTensors, release = null) {
     Object.defineProperty(value, NEMO_FEATURE_OUTPUT_OWNERSHIP, {
         value: cacheOwnsTensors,
         enumerable: false,
         configurable: true,
@@
     if (release) {
         Object.defineProperty(value, NEMO_FEATURE_OUTPUT_RELEASE, {
             value: release,
             enumerable: false,
             configurable: true,
         });
+        Object.defineProperty(value, 'release', {
+            value: release,
+            enumerable: false,
+            configurable: true,
+        });
+        Object.defineProperty(value, 'dispose', {
+            value: release,
+            enumerable: false,
+            configurable: true,
+        });
     }
     return value;
 }

Also applies to: 161-179

⛔ Skipped due to learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (1)

52-56: ⚠️ Potential issue | 🟠 Major

get() returns cache-owned tensors with no lifetime protection.

get() hands out the raw cached value without incrementing borrowers. The next eviction or clear() can dispose that tensor while the caller still holds a reference, which makes this public API unsafe for tensor-backed entries.

Proposed fix
     /**
      * `@param` {string} key
-     * `@returns` {any|null}
+     * `@returns` {{ value: any, release: () => void } | null}
      */
     get(key) {
-        const entry = this._touch(key);
-        if (!entry) return null;
-        return entry.value;
+        return this.acquire(key);
     }
⛔ Skipped due to learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
packages/transformers/src/pipelines/automatic-speech-recognition.js (1)

6-8: Nemo routing is cleanly isolated.

Keeping the switch thin and delegating the model-specific path to runNemoConformerTDTPipeline preserves the shared ASR flow without duplicating audio-preparation logic.

Also applies to: 158-159, 308-317

packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js (1)

643-704: The tensor-lifecycle coverage is solid.

Exercising both dispose-owned and release-owned paths is the right guardrail for this adapter boundary.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00b3d9346b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (1)

17-24: ⚠️ Potential issue | 🟠 Major

Hash the exact sample buffer, not lossy int16 bins.

This key aliases distinct Float32Array/Float64Array waveforms that round into the same 16-bit values, so the feature cache can replay stale tensors for different audio. The full scan is fine; the lossy quantization is the bug.

Proposed fix
 export function createAudioCacheKey(audio, sampling_rate = 16000) {
-    // FNV-1a 32-bit over quantized values for deterministic cross-runtime keys.
+    // FNV-1a 32-bit over the exact sample bytes.
     let hash = 2166136261;
     hash ^= audio.length;
     hash = Math.imul(hash, 16777619);
     hash ^= sampling_rate;
     hash = Math.imul(hash, 16777619);
-
-    // Hash all quantized samples to minimize false cache hits across waveforms.
-    for (let i = 0; i < audio.length; ++i) {
-        const sample = Number.isFinite(audio[i]) ? audio[i] : 0;
-        const q = Math.max(-32768, Math.min(32767, Math.round(sample * 32768)));
-        hash ^= q;
+    const bytes = new Uint8Array(audio.buffer, audio.byteOffset, audio.byteLength);
+    hash ^= bytes.length;
+    hash = Math.imul(hash, 16777619);
+    for (let i = 0; i < bytes.length; ++i) {
+        hash ^= bytes[i];
         hash = Math.imul(hash, 16777619);
     }
-    return `${sampling_rate}:${audio.length}:${(hash >>> 0).toString(16)}`;
+    return `${sampling_rate}:${audio.constructor.name}:${audio.length}:${(hash >>> 0).toString(16)}`;
 }
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (1)

168-172: ⚠️ Potential issue | 🟠 Major

Fix the Nemo returnMetrics assertions.

These expectations lock in the wrong adapter contract. Once the implementation is corrected, they will fail or pressure the pipeline back to the wrong behavior.

Proposed fix
         expect(calls[0]).toMatchObject({
           returnTimestamps: false,
           returnWords: false,
-          returnMetrics: false,
+          returnMetrics: true,
         });
@@
         expect(calls[0]).toMatchObject({
           returnTimestamps: true,
           returnWords: true,
-          returnMetrics: false,
+          returnMetrics: true,
         });
@@
         expect(calls[0]).toMatchObject({
           returnTimestamps: true,
           returnWords: true,
-          returnMetrics: false,
+          returnMetrics: true,
         });
Based on learnings, `packages/transformers/src/pipelines/automatic-speech-recognition.js` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline.

Also applies to: 181-185, 197-200

packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js (1)

158-160: ⚠️ Potential issue | 🟠 Major

Update these adapter-call expectations to returnMetrics: true.

These assertions are validating the wrong Nemo pipeline contract.

Proposed fix
-      expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 0 });
-      expect(calls[1]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 19.95 });
-      expect(calls[2]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 37.9 });
+      expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 0 });
+      expect(calls[1]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 19.95 });
+      expect(calls[2]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 37.9 });
@@
         expect(call).toMatchObject({
           returnTimestamps: true,
           returnWords: true,
-          returnMetrics: false,
+          returnMetrics: true,
         });
Based on learnings, `packages/transformers/src/pipelines/automatic-speech-recognition.js` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline.

Also applies to: 527-531

packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js (1)

173-179: ⚠️ Potential issue | 🟠 Major

Keep returnMetrics enabled in both Nemo transcribe paths.

The adapter contract intentionally asks model.transcribe() for metrics on every pipeline call. Hardcoding false here changes that contract for both short-audio and sentence-windowed runs.

Proposed fix
         const output = await runNemoTranscribe(windowAudio, {
             tokenizer,
             returnTimestamps: true,
             returnWords: true,
-            returnMetrics: false,
+            returnMetrics: true,
             timeOffset: start_s,
         });
@@
         const output = await runNemoTranscribe(aud, {
             tokenizer,
             returnTimestamps: wantTimestampChunks,
             returnWords: wantTimestampChunks,
-            returnMetrics: false,
+            returnMetrics: true,
         });
Based on learnings, `packages/transformers/src/pipelines/automatic-speech-recognition.js` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline.

Also applies to: 325-330


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cf06aa60-7fe0-40ed-9021-41c25cb59ca6

📥 Commits

Reviewing files that changed from the base of the PR and between 427fb7f and 00b3d93.

📒 Files selected for processing (11)
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T20:49:38.204Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
🔇 Additional comments (3)
packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js (2)

571-650: The decoder lifetime coverage is strong.

These cases pin down encoder failure cleanup and per-step auxiliary-output disposal, which are the easiest tensor-lifetime regressions to miss in transcribe().


981-1074: The cache ownership contract is well covered.

Replacement, eviction, borrowed-entry deferral, explicit no-cache mode, and oversize skips are all locked down here. That is the right surface to keep stable for FeatureLRUCache.

packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js (1)

696-704: The unique-disposal handling is solid.

Deduping auxiliary decoder tensors and final encoder outputs avoids the double-dispose trap while still cleaning up every failure path.

Also applies to: 826-834

Apply the remaining valid Nemo Conformer TDT review fixes without widening the shared ASR pipeline surface.

- honor encoder_input_layout for canonical input_features feeds
- keep borrowed cache entries counted until they are actually released
- reject tokenizer-less non-empty word-offset reconstruction
- raise the auto-window budget to match the minimum guaranteed cursor advance
- add focused model and pipeline regressions for each fix

Verified with:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot bot commented Mar 8, 2026

Code Review Summary

Status: No New Issues Found | Recommendation: Merge

Overview

This PR adds Nemo Conformer TDT ASR support with comprehensive implementation including model execution, feature extraction, decoding, pipeline integration, and extensive test coverage. The existing review comments have been thoroughly addressed in multiple fix commits (07118c38, 00b3d934, 29f2baaf).

Security Review

No concrete security issues identified:

  • Input validation is properly implemented (validate_audio_inputs, validateNemoAudio, config validation in feature extractor)
  • No injection risks found in text processing or caching
  • Cache key generation uses safe FNV-1a hashing without exposing sensitive data
  • Tensor disposal is properly managed to prevent resource leaks

Performance Review

No performance concerns identified:

  • Feature LRU cache with proper eviction policy (max_entries, max_size_mb limits)
  • Efficient delta feature computation with O(TFwindow) complexity
  • Sentence-windowing algorithm with bounded iteration (maxWindows limit)
  • Proper tensor lifecycle management prevents memory leaks

Reliability Review

Code demonstrates strong reliability patterns:

  • Comprehensive validation for all config parameters (n_fft, win_length, preemphasis, delta_order, sampling_rate)
  • Proper error handling with descriptive messages
  • Edge case handling in windowing logic (cursor snap, pending word merge)
  • Sentence segmentation handles edge cases (acronyms, enumerations, punctuation)

Test Coverage

Comprehensive test suite added:

  • Feature extraction tests (80/128 mel bins, delta/delta-delta, concatenation modes)
  • Pipeline tests (auto-windowing, sentence reconstruction, word merging)
  • Cache tests (eviction, memory limits, tensor ownership, borrow/release)
  • Integration tests in ASR pipeline

Files Reviewed (12 files)

  • src/models/feature_extractors.js - Registry update
  • src/models/modeling_utils.js - Model type registration (NemoConformerTDT: 16)
  • src/models/models.js - Model export
  • src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js - Feature extractor with caching
  • src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js - ONNX transducer implementation
  • src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js - Pipeline adapter with auto-windowing
  • src/models/nemo_conformer_tdt/transducer_cache.js - LRU cache with ownership semantics
  • src/models/nemo_conformer_tdt/transducer_deltas.js - Delta feature computation
  • src/models/nemo_conformer_tdt/transducer_word_offsets.js - Word/timestamp reconstruction
  • src/models/nemo_conformer_tdt/transducer_segment_offsets.js - Sentence segmentation
  • src/models/nemo_conformer_tdt/transducer_window_merge.js - Word deduplication
  • tests/pipelines/test_pipelines_nemo_conformer_tdt.js - Full pipeline test suite

Previous Issues Status

All previously identified issues have been addressed:

  • ✅ Pipeline issue at line 216 - Fixed in commit 07118c3
  • ✅ Word offset issue at line 38 - Confirmed as correct behavior
  • ✅ Cache borrow/release handling - Fixed
  • ✅ Sparse vocab fallback sizing - Fixed in commit 29f2baa

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js (1)

158-160: ⚠️ Potential issue | 🟠 Major

Fix the adapter contract assertions.

Line 158 and Line 527 still pin returnMetrics: false, but the Nemo pipeline adapter intentionally forces returnMetrics: true. These tests are asserting the wrong contract.

Suggested patch
-      expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 0 });
-      expect(calls[1]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 19.95 });
-      expect(calls[2]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: false, timeOffset: 37.9 });
+      expect(calls[0]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 0 });
+      expect(calls[1]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 19.95 });
+      expect(calls[2]).toMatchObject({ returnTimestamps: true, returnWords: true, returnMetrics: true, timeOffset: 37.9 });
@@
         expect(call).toMatchObject({
           returnTimestamps: true,
           returnWords: true,
-          returnMetrics: false,
+          returnMetrics: true,
         });
Based on learnings, `packages/transformers/src/pipelines/automatic-speech-recognition.js` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` for the Nemo Conformer TDT pipeline.

Also applies to: 527-531


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: be76d84a-ffb3-4764-90ac-fb5a629b177e

📥 Commits

Reviewing files that changed from the base of the PR and between f59ba06 and 07118c3.

📒 Files selected for processing (11)
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📜 Review details
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.

Applied to files:

  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.

Applied to files:

  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.

Applied to files:

  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-08T20:49:38.204Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:38.204Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
📚 Learning: 2026-03-08T21:41:31.847Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:17-24
Timestamp: 2026-03-08T21:41:31.847Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally quantizes each sample to int16 (via `Math.round(sample * 32768)`) before hashing. This is a deliberate design choice for deterministic cross-runtime key stability. The resulting aliasing of distinct waveforms that fall in the same int16 bins is an accepted tradeoff because feature caching is optional and off by default. Do not flag the int16 quantization as a collision risk or suggest hashing raw bytes instead.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.

Applied to files:

  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
🔇 Additional comments (1)
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js (1)

422-445: Canonical encoder inputs now honor encoder_input_layout.

This shared normalization path for input_features and audio_signal closes the layout bypass and keeps the transpose/disposal logic in one place.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

Restore the original cast spacing in the unrelated moonshine path so the Nemo PR does not carry an extra formatting-only diff in automatic-speech-recognition.js.
Repository owner deleted a comment from chatgpt-codex-connector bot Mar 8, 2026
Resolve sparse tokenizer vocab fallback by deriving the runtime size from the maximum token id instead of counting entries. This keeps decoder sizing correct when tokenizer ids are non-contiguous.

Tighten merged-word dedupe so punctuation-only overlaps are only collapsed when their raw normalized text also matches, which avoids dropping distinct punctuation tokens across window boundaries.

Add focused Nemo model regressions and verify with:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
coderabbitai[bot]

This comment was marked as resolved.

ysdede added 2 commits March 9, 2026 01:49
Treat likely domain suffixes as continuations when tokenizer decoding inserts whitespace after a trailing period, so sequences like `LibriVox. org.` reconstruct as `LibriVox.org.` in detailed word offsets.

Add a focused regression covering the split `.org` token pattern and verify with:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Repository owner deleted a comment from chatgpt-codex-connector bot Mar 9, 2026
Repository owner deleted a comment from coderabbitai bot Mar 9, 2026
Repository owner deleted a comment from chatgpt-codex-connector bot Mar 9, 2026
Repository owner deleted a comment from chatgpt-codex-connector bot Mar 9, 2026
Repository owner deleted a comment from coderabbitai bot Mar 9, 2026
@ysdede ysdede closed this Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant