Skip to content

feat: add Nemo Conformer TDT support (pipeline-aligned refresh)#13

Closed
ysdede wants to merge 35 commits intomainfrom
v4-nemo-conformer-tdt-main-r2
Closed

feat: add Nemo Conformer TDT support (pipeline-aligned refresh)#13
ysdede wants to merge 35 commits intomainfrom
v4-nemo-conformer-tdt-main-r2

Conversation

@ysdede
Copy link
Copy Markdown
Owner

@ysdede ysdede commented Mar 7, 2026

Summary

Supersedes closed PR #12 with the latest follow-up work on the same main-based branch line.

This PR adds end-to-end Conformer TDT ASR support to transformers.js (browser + Node.js), including model execution, feature extraction, decoding/reconstruction, pipeline wiring, registry integration, and regression coverage.

It also includes the recent pipeline-alignment work so Conformer TDT behaves more like the shared automatic-speech-recognition task contract while preserving the richer direct model.transcribe() API.

What Is Included

1. Conformer TDT Model + Decoder

  • Added Conformer TDT model implementation.
  • Implemented greedy token-and-duration transducer decoding.
  • Added model.transcribe() output contract for text, timestamps, confidences, optional words/tokens, and optional metrics/debug payloads.

2. Feature Extraction

  • Added Conformer-TDT-specific log-mel feature extraction.
  • Added optional temporal deltas and delta-delta features.
  • Added optional feature cache utilities with ownership/lifecycle safety checks.

3. ASR Pipeline Integration

  • Integrated the Conformer TDT model type into AutomaticSpeechRecognitionPipeline dispatch.
  • Aligned pipeline outputs with the shared ASR task shape:
    • default: { text }
    • return_timestamps: true: { text, chunks } with segment-level timestamps
    • return_timestamps: 'word': { text, chunks } with word-level timestamps
  • Kept the richer model-native outputs on direct model.transcribe().
  • Added tensor lifetime handling for processor outputs in the Conformer TDT path.

4. Long-Audio Handling

  • Added manual windowing support through chunk_length_s / stride_length_s.
  • Added automatic long-audio windowing for pipeline calls above 180 seconds.
  • Current automatic policy uses 90 second chunks with 10 second stride.
  • Added window merge helpers that preserve global timestamp offsets through timeOffset.

5. Word Reconstruction / Detokenization Improvements

  • Reworked word reconstruction to derive boundaries from the final decoded text instead of isolated token decoding only.
  • Fixed spacing/boundary failures around numeric and punctuation-heavy outputs such as:
    • score.48-year-old
    • with0.5
    • March20th,2021.
  • Split text/word/segment/window helpers into model-local modules for easier reuse and review.

6. Registry + Model File Resolution

  • Added model/processor/feature extractor exports and mappings for Conformer TDT.
  • Added dual-artifact model file handling (encoder_model, decoder_model_merged).

7. Regression Coverage

Added or updated tests in:

  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js

Coverage now includes:

  • task-shaped pipeline outputs for default / segment / word timestamp modes
  • overlapping window merge behavior
  • automatic long-audio windowing
  • word-boundary reconstruction for punctuation and numeric tokens

Upstream Sync Included

This branch was synced with upstream/main through commit f65a4c7c (merge commit 49a4af8f).

Included upstream commits:

  • 2120d13e [deno] Support both wgpu and dawn webgpu backends (#1546)
  • a289e5c3 Add support for new Qwen VL models (#1551)
  • b5b51ca9 [version] Update to 4.0.0-next.5
  • 2a210d3e [deno via CDN] Fix simultaneous multi-session loading (#1556)
  • e60a6ee3 Use ModelRegistry for pipeline file loading (#1555)
  • 4331d723 Support PKV cached generation for Qwen-VL models (#1557)
  • cd155a05 fix: prevent partial file reads during concurrent downloads (#1548)
  • 30773fb7 Fix WASM factory blob URL loading (#1558)
  • f65a4c7c feat: add fast boolean is_cached / is_pipeline_cached (#1559)

Conformer TDT adaptation after upstream sync:

  • ee819a1c fix(nemo-tdt): add supports() for ASR model class selection
  • 8dfccddc feat(nemo-tdt): align asr pipeline outputs and long-audio handling

Validation

Executed during this refresh:

  • pnpm --filter @huggingface/transformers build
  • node --experimental-vm-modules .\node_modules\jest\bin\jest.js tests/models.test.js -t "keeps word boundaries from the final decoded text for numeric and punctuation tokens|Nemo Conformer TDT" --runInBand
  • node --experimental-vm-modules .\node_modules\jest\bin\jest.js tests/pipelines.test.js -t "nemo-conformer-tdt \(unit\)" --runInBand

Scope Boundary

This PR keeps focus on Conformer TDT integration and the follow-up work needed to:

  • align pipeline behavior with the shared ASR contract
  • improve long-audio handling in pipeline mode
  • improve word reconstruction and timestamp grouping

NeMo is still relevant as the source architecture/provenance for this implementation, but the user-facing scope of the PR is Conformer TDT support in transformers.js.

Architectural Review: Conformer TDT ASR Integration

Risk Summary: MEDIUM-HIGH

This PR introduces substantial new functionality (Conformer TDT end-to-end ASR, feature extraction with caching, windowed decoding, word-level timestamp reconstruction) with several correctness and reliability issues that require fixes. The design is generally sound but has critical gaps in validation, error recovery, and edge case handling.


CRITICAL FINDINGS

[High] transducer_word_offsets.js - Silent text-token alignment failures without validation

  • Problem: consumeAlignedTokenText() skips whitespace at the cursor, then returns the consumed segment. However, there is no validation that the consumed text from fullText actually matches the expected tokenText. If alignment fails (e.g., tokenizer produces a token that doesn't exist at that position in the decoded text), the function silently continues with misaligned boundaries.
  • Evidence: Lines in buildTransducerWordOffsets() call consumeAlignedTokenText(fullText, textCursor, clean) but never validate that aligned.text === clean or that alignment succeeded. If fullText diverges from token sequence, words get wrong timestamps.
  • Impact: Word-level timestamps become incorrect; confidence scores attributed to wrong words; transcripts appear corrupted to users; silent failures are hardest to debug.
  • Recommended Fix: In consumeAlignedTokenText(), after consuming text from fullText, assert that consumed segment matches tokenText (accounting for whitespace normalization). Throw descriptive error with context (fullText slice, tokenText, cursor position) if mismatch detected. Add debug logging for alignment skips.

[High] modeling_nemo_conformer_tdt.js - Frame index bounds check missing in decoder loop

  • Problem: The main decoding loop reads frames via _getFrameData(encoderOutput, frameIndex, reusableFrame) without bounds verification. The loop increments frameIndex dynamically based on duration steps, but if duration calculation produces invalid values (e.g., step > remaining frames), the code reads past encoder output bounds. The code does compute frameCount correctly but never asserts frameIndex < frameCount before accessing frame data.
  • Evidence: Line computing const frameData = this._getFrameData(encoderOutput, frameIndex, reusableFrame); has no prior bounds check. In _getFrameData(), for 'BDT' layout, it computes frame[d] = data[d * T + frameIndex] without clamping frameIndex. For 'BTD' layout, it uses data.subarray(offset, offset + D) where offset = frameIndex * D, which can exceed array bounds if frameIndex >= T.
  • Impact: Out-of-bounds reads produce garbage frame data, decoder produces invalid output, timestamps become NaN, model returns corrupted results silently. This is a critical data integrity issue.
  • Recommended Fix: (1) After each frame index update in the decode loop, assert frameIndex < frameCount, throwing error if violated. (2) In _getFrameData(), clamp frameIndex: frameIndex = Math.min(frameIndex, T - 1) at minimum, or throw error if frameIndex >= T. (3) Add runtime assertion that frameIndex is non-negative and integer.

[High] transducer_window_merge.js - Deduplication assumes strict equality, creates duplicates

  • Problem: dedupeMergedTokens() deduplicates by checking if tokens have overlapping timestamps: startTime < otherEndTime && endTime > otherStartTime. However, if two tokens are identical (same ID and text) but have slightly different timestamps (e.g., [1.0-1.1] vs [1.05-1.15] due to floating-point rounding or frame boundary variance), they are NOT deduplicated because the function only looks for exact timestamp overlap, not timestamp proximity. This causes duplicate words in output.
  • Evidence: Lines in dedupeMergedTokens() check only token.id === prev.id && token.raw_token === prev.raw_token plus time overlap; no tolerance for near-identical timestamps. If two windows decode the same word with slightly offset times (common in overlapping window scenarios), both instances survive.
  • Impact: Transcripts contain repeated words in overlapping window regions; word-level timestamps have duplicates; confidence averaging becomes unreliable; word counts inflated.
  • Recommended Fix: (1) Add epsilon tolerance parameter (default 0.05s) for timestamp comparison. (2) When merging, treat tokens/words as identical if ID/text match AND start times differ by < epsilon. (3) Keep earliest start_time and latest end_time when merging duplicates.

[Medium] feature_extraction_nemo_conformer_tdt.js - Per-frame normalization doesn't guard against zero variance

  • Problem: The normalization logic computes mean and variance across frames, then scales each frame: (frame - mean) / (variance + epsilon). However, there is no epsilon guard before division. If all frames are identical (variance = 0), or if input is very short (single frame), the denominator can be zero or near-zero, producing Infinity or NaN.
  • Evidence: Looking at the feature extraction code, while there is a check if (n_frames > 0), there's no check for variance > epsilon before the scaling division. For very short audio (< 10 frames), variance is likely to be small.
  • Impact: NaN in feature output, downstream model receives invalid inputs, produces garbage output or crashes.
  • Recommended Fix: Add epsilon guard in normalization: const scaled = n_frames > 1 ? (frame - mean) / Math.max(variance, 1e-8) : frame. Log warning for very short inputs (< 5 frames).

[Medium] transducer_window_merge.js - Window specifications insufficient bounds validation

  • Problem: buildNemoWindowSpecs() enforces chunk_length_s > 2 * stride_length_s but does not validate that parameters are positive, reasonable, or that resulting windows actually cover the full audio. If chunk_length_s = 1.0 and audio duration is 180s, only ~180 windows are created; if stride is too large, gaps are possible at the end.
  • Evidence: The function checks if (chunk_length_s <= 0) but does not validate final chunk_length_s is within safe bounds (e.g., 0.1-300s). No post-generation check that windows[-1].end >= audio.length.
  • Impact: Silent window gaps, missed transcription segments at audio boundaries, incomplete output.
  • Recommended Fix: (1) Add parameter validation: chunk_length_s must be in [0.1, 300], stride_length_s in (0, chunk_length_s). (2) After window generation, assert last window covers full audio or extend it. (3) Log warnings if stride > chunk_length_s / 4 (excessive overlap).

[Medium] modeling_nemo_conformer_tdt.js - Timestamp validation insufficient downstream

  • Problem: The code validates timeOffset upfront: if (!Number.isFinite(timeOffset)). However, downstream timestamp arithmetic does not re-validate intermediate results. If encoder outputs contain NaN (due to upstream bugs), or if log-probabilities compute -Infinity, downstream Math.max(), Math.min(), or arithmetic can silently propagate NaN.
  • Evidence: Lines computing timestamps do not re-validate after arithmetic: tokenTimestamps.push([roundTs(frameIndex * frameTime + timeOffset), roundTs(endFrame * frameTime + timeOffset)]). If frameTime or frameIndex are NaN, result is NaN. No assertion post-arithmetic.
  • Impact: Silent NaN in output timestamps, breaks downstream applications expecting valid numbers, makes debugging harder.
  • Recommended Fix: After every timestamp arithmetic operation, assert isFinite(result), throwing descriptive error if not. Add debug logging for NaN/Infinity detection with context.

[Medium] transducer_deltas.js - Delta order validation incomplete

  • Problem: computeTemporalDeltas() accepts order parameter (1 or 2). The code throws if order is not 1 or 2, which is good. However, edge cases are not handled: (1) if input_features has fewer frames than window (e.g., window=3 but input has 2 frames), delta computation produces incorrect or empty results; (2) concatenation happens without checking if output tensor dimensions are sensible.
  • Evidence: The function computes deltas via for (let i = window; i < T; ++i) but if T < window, the loop never runs and returns empty delta tensor. Concatenation proceeds anyway, resulting in mismatched dimensions.
  • Impact: Unexpected output shapes, feature dimension misalignment, downstream model failures.
  • Recommended Fix: Add check: if input frames < 2 * window, either skip delta computation or use zero-padding. Validate output tensor shape before concatenation.

MEDIUM-SEVERITY FINDINGS (Design & Maintainability)

[Medium] Cache ownership semantics unclear and error-prone

  • Problem: The pipeline sets cacheOwnsTensors = !!(featureCache && featureCache.max_entries > 0 && featureCache.max_size_mb > 0) to decide whether to dispose tensors. However, if cache configuration changes at runtime, or if eviction happens mid-transcription, tensor ownership boundaries become unclear. Also, if cache eviction occurs while a tensor is being used by the model, use-after-free is possible.
  • Evidence: In pipeline_nemo_conformer_tdt.js, tensors are disposed only if !cacheOwnsTensors. But FeatureLRUCache can evict entries at any time via _evict(), potentially disposing tensors that are still referenced.
  • Impact: Potential use-after-free bugs (rare but catastrophic), memory leaks if cache state is misunderstood, unpredictable tensor lifecycle.
  • Recommended Fix: (1) Add reference counting or epoch-based lifecycle management to FeatureLRUCache to prevent eviction of in-use entries. (2) Document clear ownership boundary: "Cache owns tensors after set(); caller must not retain references." (3) Add assertion checks that tensors are not accessed after cache eviction.

[Medium] Window merging lacks comprehensive end-to-end test coverage

  • Problem: The windowing and merging logic involves multiple state machines (window generation, time-based filtering, deduplication). Current tests do not cover: (1) zero-length audio, (2) single-frame audio, (3) window boundaries at exact segment ends, (4) overlapping windows with conflicting tokens at boundaries, (5) edge cases where stride > chunk (invalid config), (6) audio shorter than a single window.
  • Impact: Silent failures in edge cases, unpredictable behavior in production with unusual audio durations, hard-to-reproduce bugs.
  • Recommended Fix: Add property-based tests: generate random audio lengths, chunk sizes, and strides; verify (1) no gaps in merged output, (2) no unexpected duplicates, (3) total transcribed duration matches input, (4) window boundaries align properly.

[Low] Insufficient resource cleanup in exception paths

  • Problem: The main transcribe() method has a single large try-catch wrapping the entire decoding loop. If an exception occurs mid-loop (e.g., during decoder session run), tensors allocated in previous iterations (frameData, frameTensor, tokenTensor, etc.) are disposed in finally block, which is correct. However, if an exception occurs during encoder run, encoderOutputs and intermediate tensors are cleaned up, which is also correct. No obvious leaks in the explicit try-finally paths, but implicit assumptions about RAII could fail if new code paths are added.
  • Impact: Potential memory leaks if exception handling is modified without careful review.
  • Recommended Fix: Use try-finally blocks around all tensor allocations. Maintain explicit lists of tensors to be disposed in finally blocks rather than relying on scope cleanup.

[Low] Error messages lack context and observability

  • Problem: Error messages are inconsistent: some include full context (encoder output dims, available outputs), others are generic ("Expected encoder output"). No structured logging or debug-level instrumentation at validation points.
  • Impact: Difficult to diagnose production issues; unclear what failed and why.
  • Recommended Fix: Create a centralized logging utility with consistent formatting. Include validation context (actual vs. expected values) in all error messages.

INTEGRATION FINDINGS

[Medium] Pipeline assumes processor and tokenizer availability without upfront validation

  • Problem: runNemoConformerTDTPipeline() checks for missing components mid-execution: if (!model.transcribe) throw ..., if (!processor) throw ..., if (!tokenizer) throw .... These checks occur after pipeline initialization, meaning errors surface late.
  • Evidence: In pipeline_nemo_conformer_tdt.js, validation happens inside the async function, not at pipeline construction.
  • Impact: Unclear failure modes, hard to understand at which stage setup failed.
  • Recommended Fix: Add upfront validation in _call_nemo_conformer_tdt() before calling pipeline, or in pipeline constructor.

[Low] Test coverage for error paths and edge cases

  • The test suite covers happy paths and some error cases (invalid config, missing vocab), but does not cover: (1) very long audio (>10 hours), (2) audio with long silence at edges, (3) audio with extreme dynamic range, (4) tokenizer returning empty vocab, (5) decoder returning NaN or Infinity logits.
  • Recommended Fix: Add regression tests for edge cases above.

POSITIVE FINDINGS (No Issues)

  • FNV-1a hash implementation is correct: Math.imul(hash, 16777619) properly applies the FNV prime in 32-bit space. Hash collision risk is low for typical audio cache use.
  • Tensor disposal in main decode loop is well-handled with explicit try-finally blocks around decoder calls.
  • Configuration validation is comprehensive: decoder IO alias conflicts checked, layout constraints validated, vocab size resolution has fallbacks.
  • Pipeline tensor ownership is mostly clear: explicit checks for cache ownership determine disposal behavior, which is reasonable if ownership boundaries are documented.

MERGE DECISION: NEEDS CHANGES

This PR introduces several critical correctness issues:

  1. Text-token alignment (silent mismatches → wrong word boundaries)
  2. Frame indexing bounds (out-of-bounds reads → garbage output)
  3. Window deduplication (duplicates in output → inflated word counts)
  4. Normalization robustness (zero variance → NaN in features)
  5. Timestamp validation downstream (silent NaN propagation → invalid results)

These are not style or performance issues—they are bugs with real production impact. Most fixes are 2-5 lines (bounds check, epsilon guard, dedup tolerance).

Blockers (must fix before merge):

  • Add bounds checking for frame index in decoder loop
  • Add validation to text-token alignment (throw on mismatch)
  • Add epsilon-based tolerance to dedup logic
  • Add variance epsilon guard in normalization
  • Add downstream timestamp validation

Non-blockers (follow-up issues acceptable, but should be tracked):

  • Cache ownership lifecycle management
  • Comprehensive window merging test coverage
  • Resource cleanup documentation

Do not merge without addressing the HIGH-severity frame indexing and text alignment issues. These are correctness bugs that will cause silent data corruption.

ysdede added 30 commits March 1, 2026 16:56
…cache helpers

Carry over non-runtime typing fixes from the prior branch while intentionally excluding the WebGPU disable_prepacking workaround in session.js.\n\n- Cast dynamic model.transcribe access for Nemo TDT pipeline method checks/calls.\n- Cast Tensor data byteLength access in transducer cache utilities.\n- Add explicit tuple/object JSDoc annotations in transducer timestamp builder.\n\nThis keeps main-based v4 work clean with latest ORT-Web on origin/main and avoids retaining the temporary encoder prepacking workaround.
- Replace legacy per-feature flags (return_token_timestamps,
  return_word_timestamps, return_utterance_timestamp) with a layered API:
  return_timestamps (utterance-level), return_words, return_tokens
- Merge duplicate outputs: words absorbs word_timestamps,
  tokens absorbs token_timestamps and token_ids
- Add per-token confidence, word-level confidence aggregation,
  utterance_confidence, and confidence_scores summary
- Gate frame confidences behind returnFrameConfidences flag
- Add return_metrics with encode/decode/total timing and RTF
- Add debug flags: returnFrameIndices, returnLogProbs, returnTdtSteps
- Fix vocab Map handling in getIdToTokenMap and _resolveVocabSize
  (tokenizer.get_vocab() returns Map in WASM binding)
- Update ASR pipeline to wire timestamp_granularity to new model flags
- Format all changed files with Prettier per CONTRIBUTING.md
…ipeline

- Add roundTs() for millisecond-precision timestamp rounding at source

- Round all confidence averages to 6 decimal places

- Round per-token and per-word confidence values

- Remove timestamp_granularity and formatting helpers from pipeline

- Pipeline returns model.transcribe() output directly

- Auto-enable return_words and return_metrics when return_timestamps is true
…imestamps, honor return_metrics kwarg

- modeling_nemo_conformer_tdt: dispose logits and new decoder state tensors
  before throwing when logitsData.length < vocabSize to prevent resource leak
- modeling_nemo_conformer_tdt: move returnFrameConfidences output block outside
  the return_timestamps guard so frame/frame_avg are emitted independently
- automatic-speech-recognition: change return_metrics from hardcoded true to
  kwargs.return_metrics ?? false to respect user intent and avoid overhead
- Accept upstream restructuring: SUPPORTED_TASKS and pipeline imports moved
  from pipelines.js to pipelines/index.js
- Migrate NemoConformerForTDT registration to pipelines/index.js accordingly
- Add MODEL_TYPES.NemoConformerTDT (id=16) to modeling_utils
- Register NemoConformerForTDT in MODEL_TYPE_MAPPING, MODEL_NAME_TO_CLASS_MAPPING,
  and MODEL_CLASS_TO_NAME_MAPPING so the base class from_pretrained, ModelRegistry,
  and is_pipeline_cached all recognise the model correctly
- Add NemoConformerTDT case to get_model_files so progress_callback receives
  accurate file size totals for encoder_model.onnx + decoder_model_merged.onnx
Standardizes internal logging to follow the upstream convention introduced
in ModelRegistry refactor.
- Guard feature extractor against empty/short audio (NaN prevention)

- Move decoder tensor init inside try block for safe disposal

- Add architecture key to MODEL_TYPE_MAPPING

- Add input validation in buildTransducerDetailedOutputs

- Harden audio cache hash against NaN samples

- Add order validation in computeTemporalDeltas

- Restore pipeline: return_timestamps truthy => words + metrics always on
- Remove all timestamp_granularity tests (feature was removed)

- Fix option names: return_tokens, return_words, return_timestamps

- Fix output fields: tokens/words arrays, not token_ids/word_timestamps

- Verify pipeline passes return_words + return_metrics when timestamps on

- Add test: return_timestamps 'word' treated as truthy
Address reviewer findings except the return_metrics policy decision.

- Fix temporal delta concatenation to interleave per frame and add dtype validation.
- Validate preemphasis range and clamp normalization variance in feature extraction.
- Remove unsafe encoder layout inference; require explicit encoder_output_layout.
- Redesign decode loop to read frame data on-demand instead of eager frame materialization.
- Deduplicate word finalization and avoid zero-filling missing word confidences.
- Tighten tests for delta layout/type checks, explicit layout requirement, call counts, and naming accuracy.
Fixes high-impact issues found in PR review validation:\n- force NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT in registry overrides\n- ensure encoder outputs are disposed when pre-decode validation throws\n- remove stride sampling from audio cache key hashing to prevent false cache hits\n- use encoder_model selector key in get_model_files for Nemo per-component dtype/device overrides\n\nAlso adds targeted regression tests for mapping, disposal behavior, file selection, and cache key correctness.
- Clamp token end timestamps to encoder frame bounds during TDT decoding.\n- Validate FeatureLRUCache constructor limits to fail fast on invalid settings.\n- Add regression tests for timestamp clamping and cache limit validation.
Dispose intermediate tensors in computeTemporalDeltas concatenate paths and dispose replaced base input features when delta concatenation returns a new tensor.\n\nAdd regression tests that assert disposal behavior for delta concatenate flows and feature extractor reassignment.
Dispose non-essential Tensor outputs returned by decoder steps to prevent cumulative memory growth. Keep logits/state tensors alive for decoding/state transitions and dispose extras immediately.\n\nAdd regression test to assert auxiliary decoder tensor outputs are disposed each step.
Compute encoder length directly from attention_mask.data instead of attention_mask.tolist() to avoid large transient array allocations in ASR decode hot path.
Fail fast when duration logits are required but missing in decoder output, and enforce positive-integer vocab size at runtime config validation.

Validate prepared Nemo pipeline audio for non-empty finite samples before processor/model calls.

Add regression tests for missing duration logits and non-finite audio rejection.
Fix placeholder interpolation in _prepare_model_inputs error text.

Add fail-fast validation for Nemo delta_window and reject duplicate decoder output aliases in transducer io config.

Add regression tests for delta_window validation and duplicate decoder output alias rejection.
Validate transcribe timeOffset as finite and guard encoderOutputs cleanup path to avoid masking primary failures.

Align transducer_text JSDoc token type with runtime shape (include id).

Harden Parakeet feature extractor test by using direct mask data and explicit tensor disposal via try/finally; add timeOffset validation regression test.
- fail fast on missing decoder state outputs and invalid encoder layout enums\n- make FeatureLRUCache own cached tensor lifetimes (replace/evict/clear) with deduped disposal and deterministic size fallback\n- validate n_fft/win_length in Nemo feature extractor\n- align Nemo ASR pipeline docs with actual forwarded options\n- add regression coverage for runtime config validation, non-concatenated deltas/cache behavior, missing decoder state outputs, and cache disposal semantics\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt\n- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
Apply Gemini review nit in Nemo decode loop by replacing a redundant duration expression with Math.max(1, step).\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
Checklist (bot comment IDs):
- [x] 2892132356: guard tokenizer.get_vocab() return type before Object.keys in _resolveVocabSize.
- [x] 2892132367: treat zero cache limits as explicit no-cache mode; do not store/dispose just-produced values.
- [x] 2892132372: dispose processor tensors in Nemo ASR pipeline when cache does not own lifetimes.

Added regression tests for vocab resolution fallback, zero-limit cache semantics, and Nemo pipeline tensor ownership behavior.

Validation:
- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
- widen confidenceFromLogits input type to Tensor data arrays

- narrow feature_cache access with explicit typed cast in ASR pipeline
Checklist (bot comment IDs):
- [x] 2892287484: handle array-returning tokenizer vocab in _resolveVocabSize.
- [x] 2892322884: avoid disposing when re-setting the same object for an existing cache key.
- [x] 2892322906: skip caching oversized values to prevent insert-then-dispose of caller-owned tensors.
- [x] 2892322910: guard byteLength type in estimateSizeBytes.

Added regression tests for array vocab sizing, same-object set behavior, oversized value skipping, and non-numeric byteLength handling.

Validation:
- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
sourcery-ai[bot]

This comment was marked as resolved.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 7, 2026

📝 Walkthrough

Walkthrough

This PR introduces a complete Nemo Conformer TDT (Transducer with Decoder Training) ASR model integration, including feature extraction with configurable caching, encoder/decoder orchestration with frame-based decoding, processing pipeline, and pipeline dispatch. New model type registered and ASR pipeline extended with Nemo support.

Changes

Cohort / File(s) Summary
Model Registry & Type Registration
packages/transformers/src/models/modeling_utils.js, packages/transformers/src/models/registry.js, packages/transformers/src/models/models.js
Added NemoConformerTDT model type constant (value 16), registered NemoConformerForTDT in encoder-only mappings and custom model registry, updated model exports.
Feature Extraction
packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
New NemoConformerTDTFeatureExtractor class with log-Mel spectral computation, configurable Mel bin counts (80/128), optional delta/delta-delta features, preemphasis support, frame normalization, and LRU cache integration for multi-call optimization.
Model Architecture & Transcription
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
Complete transducer implementation with encoder/decoder orchestration, frame-based greedy decoding loop, timestamp/confidence computation, session lifecycle management (from_pretrained, disposal), and extensive runtime validation (vocab size, token IDs, layout constraints, dtype matching).
Processing & Pipeline Integration
packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js, packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
New processor delegating to feature extractor, pipeline orchestrating audio chunking/windowing, cache lifecycle, multi-window result merging, and flexible output formats (text/tokens/words/timestamps).
Public Exports & Integration
packages/transformers/src/models/feature_extractors.js, packages/transformers/src/models/processors.js, packages/transformers/src/pipelines/automatic-speech-recognition.js, packages/transformers/src/pipelines/index.js
Extended public API surface with re-exports, added nemo-conformer-tdt dispatch case to ASR pipeline, integrated model into SUPPORTED_TASKS registry.
Utility Modules
packages/transformers/src/models/nemo_conformer_tdt/transducer_*.js, packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
Helper modules for audio caching (FNV-1a hashing, LRU eviction with byte-size limits), temporal deltas (delta/delta-delta with optional concatenation), text decoding and word alignment (token-to-text mapping, per-token/per-word confidence), window specifications and result merging (time-bound filtering, deduplication), segment chunking (Nemo sentence boundaries, gap-based splitting). Barrel file consolidates exports.
Model File Loading
packages/transformers/src/utils/model_registry/get_model_files.js
Added conditional branch to load encoder_model and decoder_model_merged for NemoConformerTDT type.
Tests
packages/transformers/tests/models/nemo_conformer_tdt/test_*.js, packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
Comprehensive test suites validating feature extraction (Mel bins, delta outputs, caching, parameter bounds), model runtime (vocab/token validation, greedy decoding, timestamp computation, tensor lifecycle), and pipeline (audio windowing, result merging, cache ownership semantics).

Sequence Diagram(s)

sequenceDiagram
    participant App
    participant Pipeline as ASR Pipeline
    participant Processor
    participant FeatureExtractor
    participant Model as Nemo Conformer TDT
    participant Encoder
    participant Decoder
    participant Tokenizer

    App->>Pipeline: audio + decode_options
    Pipeline->>Processor: audio
    Processor->>FeatureExtractor: audio
    FeatureExtractor->>FeatureExtractor: compute log-Mel spectrogram
    FeatureExtractor->>FeatureExtractor: apply optional deltas
    FeatureExtractor->>FeatureExtractor: frame normalization
    FeatureExtractor-->>Processor: input_features, attention_mask
    Processor-->>Pipeline: processed inputs
    Pipeline->>Model: inputs → transcribe()
    Model->>Encoder: input_features, attention_mask
    Encoder-->>Model: encoder_output, frame_count
    Model->>Model: frame-based greedy decoding loop
    loop Per Frame
        Model->>Decoder: frame_data, decoder_state
        Decoder-->>Model: logits, updated_state
        Model->>Model: argmax token, compute confidence
        Model->>Model: update timestamp, accumulate token
    end
    Model->>Tokenizer: token_ids
    Tokenizer-->>Model: decoded text
    Model->>Model: align tokens to text → word objects
    Model->>Model: (optional) segment words → chunks
    Model-->>Pipeline: transcription result (text, tokens, words, timestamps)
    Pipeline-->>App: final output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add Nemo Conformer TDT support (pipeline-aligned refresh)' directly describes the primary changeset: comprehensive end-to-end Nemo Conformer TDT ASR model integration with feature extraction, decoding, pipeline alignment, and related infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch v4-nemo-conformer-tdt-main-r2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Align the Nemo ASR pipeline with the shared task contract by returning
text-only results by default and chunk-based timestamps for segment and
word modes. Add automatic long-audio windowing, decoded-text-driven word
reconstruction, and model-local helpers for window merge and chunk
assembly.

Also add regression coverage for numeric/punctuation word boundaries,
windowed merge behavior, and auto-windowed long-form pipeline decoding.
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands transformers.js by introducing comprehensive, end-to-end support for Nemo Conformer TDT Automatic Speech Recognition models. It enables seamless integration of these models across browser and Node.js environments, providing both a standardized ASR pipeline interface and access to the model's advanced native capabilities. The changes include specialized feature extraction, efficient long-audio processing, and refined word reconstruction, all backed by extensive testing to ensure reliability and accuracy.

Highlights

  • Nemo Conformer TDT Integration: Added full support for Nemo Conformer TDT ASR models, including model execution, feature extraction, and greedy token-and-duration transducer decoding.
  • Pipeline Alignment: Integrated Nemo models into the AutomaticSpeechRecognitionPipeline, aligning outputs with the standard ASR task contract while retaining richer native model.transcribe() API capabilities.
  • Enhanced Audio Processing: Implemented robust long-audio handling with manual and automatic windowing, and improved word reconstruction to accurately handle punctuation and numeric tokens.
  • Feature Extraction & Caching: Introduced Nemo-specific log-mel feature extraction, optional temporal delta features, and an LRU cache for extracted features to optimize performance.
  • Registry & Testing: Updated the model registry for dual-artifact loading and added comprehensive regression tests for the new Nemo features and pipeline behavior.
Changelog
  • packages/transformers/src/models/feature_extractors.js
    • Exported the new Nemo Conformer TDT feature extractor.
  • packages/transformers/src/models/modeling_utils.js
    • Updated MODEL_TYPES to include NemoConformerTDT and improved an error message.
  • packages/transformers/src/models/models.js
    • Exported the new Nemo Conformer TDT modeling file.
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
    • Implemented the Nemo Conformer TDT feature extractor, supporting log-mel spectrograms, preemphasis, normalization, and optional temporal delta features with caching.
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
    • Added the core Nemo Conformer TDT model, enabling greedy token-and-duration transducer decoding and managing encoder/decoder sessions.
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
    • Created a pipeline adapter for Nemo Conformer TDT, handling audio validation, windowing, and result merging for ASR task compatibility.
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
    • Introduced the Nemo Conformer TDT processor, integrating feature extraction and tokenization.
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
    • Developed an LRU cache for extracted features, including audio key generation and tensor lifecycle management.
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
    • Added functionality to compute temporal deltas and delta-deltas for input features.
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
    • Provided utilities for joining timed words, building word chunks, and segmenting text based on linguistic and temporal breaks.
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
    • Implemented functions for decoding token IDs into text and generating detailed word/token outputs.
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
    • Added logic for creating and merging windowed audio segments for long-audio processing.
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
    • Implemented utilities for resolving token text, aligning with full text, and building word offsets with confidence.
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
    • Created a consolidated export file for all Nemo Conformer TDT-related utilities.
  • packages/transformers/src/models/processors.js
    • Exported the new Nemo Conformer TDT processor.
  • packages/transformers/src/models/registry.js
    • Configured the model registry to correctly identify and load dual ONNX artifacts for Nemo Conformer TDT models.
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
    • Integrated Nemo Conformer TDT models into the ASR pipeline dispatch, adding a dedicated call handler.
  • packages/transformers/src/pipelines/index.js
    • Registered NemoConformerForTDT as a supported model for the automatic-speech-recognition task.
  • packages/transformers/src/utils/model_registry/get_model_files.js
    • Modified the model file retrieval logic to support the dual encoder_model and decoder_model_merged files required by Nemo Conformer TDT.
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
    • Added new unit tests for the Nemo Conformer TDT feature extractor, verifying mel bin configurations, delta feature computation, and feature caching behavior.
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
    • Added new unit tests for the Nemo Conformer TDT model, covering greedy decoding, timestamp handling, confidence aggregation, and various configuration validations.
  • packages/transformers/tests/models/parakeet/test_feature_extraction_parakeet.js
    • Updated tests for the Parakeet feature extractor.
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
    • Expanded pipeline tests to include Nemo Conformer TDT, validating output formats, windowed processing, and tensor resource management.
Activity
  • This pull request is a refresh, superseding a previous closed PR (feat: add Nemo Conformer TDT support (upstream-synced) #12), indicating continued development and refinement of the feature.
  • The branch was recently synchronized with upstream/main (commit f65a4c7, merge 49a4af8), ensuring it includes the latest core updates and bug fixes.
  • Specific Nemo adaptations were applied post-sync, addressing model selection and pipeline output consistency.
  • Extensive validation was performed, including pnpm build and targeted Jest tests for Nemo models and pipelines, confirming functionality and integration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

* `feature_size` (e.g. 80 or 128 mel bins via `preprocessor_config.json`).
*/
export class NemoConformerTDTFeatureExtractor extends FeatureExtractor {
constructor(config) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEDIUM: Missing validation for hop_length in constructor. Unlike n_fft and win_length which are validated (lines 20-33), hop_length is used directly in _extract (line 163) without validation. Invalid hop_length (0, negative, or non-integer) would cause runtime errors or incorrect feature dimensions.

Recommended fix: Add validation in constructor:

if (!Number.isInteger(this.config.hop_length) || this.config.hop_length <= 0) {
    throw new Error(
        `NemoConformerTDTFeatureExtractor expected \`hop_length\` as a positive integer, got ${this.config.hop_length}.`
    );
}

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot bot commented Mar 7, 2026

Code Review Summary

Status: 5 Issues Already Flagged | Recommendation: Address existing issues before merge

Overview

Severity Count
CRITICAL 0
WARNING 0
MEDIUM 5

Additional Observations

After thorough analysis of this PR adding Nemo Conformer TDT support, I found the following:

Test Coverage (Good):

  • Comprehensive unit tests for feature extraction, model inference, pipeline integration
  • Tests for cache behavior, delta computations, error handling
  • Pipeline tests cover chunked decoding, word timestamp reconstruction

Security Review:

  • No concrete security vulnerabilities identified
  • Input validation present for audio samples (finite values check)
  • Tensor disposal properly managed to prevent memory leaks

Performance Review:

  • Audio cache key computation iterates over all samples (transducer_cache.js:18) - acceptable trade-off for cache accuracy
  • Delta computations use efficient nested loops with proper memory management
  • Feature cache LRU eviction implemented correctly

Reliability:

  • Extensive error handling throughout model, pipeline, and feature extraction
  • Config validation at construction time for critical parameters
  • Runtime validation for session I/O names

One Additional Note (Not in Diff):

  • The sampling_rate and feature_size config parameters are used without validation in the feature extractor constructor. While sampling_rate is validated in the pipeline before use, adding validation at construction time would provide earlier failure for misconfiguration.
Files Reviewed (10+ files)
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js - Already flagged
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js - Already flagged
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js - Already flagged
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js - Already flagged
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/feature_extractors.js (export addition)
  • packages/transformers/src/models/modeling_utils.js (MODEL_TYPES addition)
  • packages/transformers/src/models/models.js (export addition)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an extensive and well-executed pull request that adds end-to-end support for Nemo Conformer TDT models. The changes are well-structured, with clear separation of concerns across new modules for feature extraction, model implementation, pipeline integration, and various transducer-related utilities. The implementation of the TDT decoder, long-audio windowing, and the improved word reconstruction logic is particularly impressive and robust. The addition of comprehensive unit tests, including mocked components for isolated testing, significantly increases confidence in the new functionality.

My review found the code to be of high quality. I have only a couple of minor suggestions regarding error handling style for consistency. Overall, great work on this significant feature addition!

if (stride_length_s === null) {
stride_length_s = chunk_length_s / 6;
} else if (!(stride_length_s >= 0)) {
throw Error('`stride_length_s` must be non-negative.');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and best practice, it's better to use new Error() to create an error object instance rather than calling Error() as a function.

Suggested change
throw Error('`stride_length_s` must be non-negative.');
throw new Error('`stride_length_s` must be non-negative.');

throw Error('`stride_length_s` must be non-negative.');
}
if (chunk_length_s <= 2 * stride_length_s) {
throw Error('`chunk_length_s` must be larger than `2 * stride_length_s` for Nemo windowed decoding.');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and best practice, it's better to use new Error() to create an error object instance rather than calling Error() as a function.

Suggested change
throw Error('`chunk_length_s` must be larger than `2 * stride_length_s` for Nemo windowed decoding.');
throw new Error('`chunk_length_s` must be larger than `2 * stride_length_s` for Nemo windowed decoding.');

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8dfccddc4d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +95 to +99
if (!cacheOwnsTensors) {
const seen = new Set();
for (const value of Object.values(inputs ?? {})) {
if (value instanceof Tensor && !seen.has(value)) {
value.dispose();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Dispose uncached processor tensors when feature cache skips entry

runNemoConformerTDTPipeline decides tensor ownership from cache limits (max_entries/max_size_mb) instead of whether the current feature extraction result was actually cached. FeatureLRUCache.set explicitly skips caching oversized values while leaving ownership with the caller, so this if (!cacheOwnsTensors) guard can skip disposal for large windows and leak input_features/attention_mask tensors across calls when caching is enabled but entries do not fit.

Useful? React with 👍 / 👎.

Comment on lines +429 to +431
if (name === 'input_features') {
feeds[name] = inputFeatures;
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor encoder_input_layout for input_features feeds

This branch always forwards input_features as-is and bypasses transducer.encoder_input_layout; transpose handling is only applied in the separate audio_signal branch. That means configs that legitimately specify encoder_input_layout: 'BFT' but use input_features as the encoder input name will silently receive BTF tensors and produce incorrect decoding despite passing config validation.

Useful? React with 👍 / 👎.

Remove the standalone parakeet feature extractor test from this branch.
It exercises an existing parakeet_ctc path that is outside the scope of
Conformer TDT integration and makes the PR look broader than it is.
@ysdede
Copy link
Copy Markdown
Owner Author

ysdede commented Mar 7, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 7, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 39b36607-a412-4780-b0dc-0545844a16f2

📥 Commits

Reviewing files that changed from the base of the PR and between f65a4c7 and 8dfccdd.

📒 Files selected for processing (23)
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/models/parakeet/test_feature_extraction_parakeet.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.

Applied to files:

  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.

Applied to files:

  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/tests/models/parakeet/test_feature_extraction_parakeet.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.

Applied to files:

  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.

Applied to files:

  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.

Applied to files:

  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.

Applied to files:

  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.

Applied to files:

  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:55.984Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T22:59:55.984Z
Learning: When a model subclass overrides from_pretrained and bypasses the generic model-type branch, do not introduce or rely on a MODEL_TYPES entry for that subclass in the model registry. For this NemoConformerTDT case, in packages/transformers/src/utils/model_registry/get_model_files.js, avoid adding a NemoConformerTDT entry in the model type map; rely on the override logic in modeling_nemo_conformer_tdt.js. This keeps the registry explicit to the actual file and prevents unintended dispatch through the generic branch.

Applied to files:

  • packages/transformers/src/utils/model_registry/get_model_files.js
🔇 Additional comments (31)
packages/transformers/tests/models/parakeet/test_feature_extraction_parakeet.js (3)

1-4: Imports are correct.

Relative paths are appropriate for the test file location.


5-20: Test setup is well-structured.

Config values are consistent with standard Parakeet feature extraction parameters. The beforeAll pattern correctly scopes extractor instantiation.


22-47: Test logic is sound with proper resource management.

The try/finally pattern ensures tensor disposal regardless of assertion outcome. Shape and validity assertions align with the expected output structure from ParakeetFeatureExtractor._call.

One observation on line 35: attention_mask.data is a BigInt64Array. The Number(x) coercion works correctly for mask values (0n/1n), but be aware this pattern would silently lose precision for values exceeding Number.MAX_SAFE_INTEGER. Not an issue here given the mask domain.

packages/transformers/src/models/modeling_utils.js (1)

121-121: LGTM: New model type enum value.

The NemoConformerTDT: 16 addition correctly extends the MODEL_TYPES enum. Based on learnings, NemoConformerForTDT overrides from_pretrained and explicitly calls constructSessions, so no corresponding MODEL_TYPE_CONFIG entry is required.

packages/transformers/src/models/feature_extractors.js (1)

8-8: LGTM: Feature extractor export.

Standard barrel re-export following existing alphabetical ordering conventions.

packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js (1)

1-19: LGTM: Processor implementation.

The processor correctly delegates audio preprocessing to the feature extractor. The tokenizer_class declaration is standard even if unused in _call—it ensures the tokenizer is loaded and available for downstream model operations.

packages/transformers/src/models/processors.js (1)

11-11: LGTM: Processor export.

Standard barrel re-export following alphabetical ordering.

packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (2)

9-25: LGTM: Cache key generation.

FNV-1a hash implementation is correct. The full sample iteration is intentional per documented design to minimize cache-key collisions.


167-177: Hardcoded field names are tightly coupled to feature extractor output.

The collectCachedTensors function hardcodes field names (input_features, attention_mask, delta_features, delta_delta_features) that must match the feature extractor's return type. This coupling is acceptable given the cache is purpose-built for Nemo Conformer TDT, but any future changes to the feature extractor output schema will require parallel updates here.

packages/transformers/src/models/models.js (1)

106-106: LGTM: Model export.

Standard barrel re-export following alphabetical ordering.

packages/transformers/src/utils/model_registry/get_model_files.js (1)

180-182: LGTM: NemoConformerTDT file resolution.

The file list correctly specifies encoder_model and decoder_model_merged, matching the session construction in NemoConformerForTDT.from_pretrained. The absence of generation_config.json is appropriate for transducer-based models.

packages/transformers/src/pipelines/index.js (2)

33-33: LGTM: Import for pipeline integration.

Direct import of NemoConformerForTDT from the modeling module is correct for explicit model class registration.


154-154: LGTM: Pipeline model registration.

Adding NemoConformerForTDT to the automatic-speech-recognition model array enables the pipeline to accept and dispatch Nemo Conformer TDT models. The _call method in AutomaticSpeechRecognitionPipeline already handles the nemo-conformer-tdt model_type case via _call_nemo_conformer_tdt.

packages/transformers/src/pipelines/automatic-speech-recognition.js (2)

6-8: LGTM - Clean pipeline integration.

The Nemo Conformer TDT routing and delegation to runNemoConformerTDTPipeline is correctly implemented. The prepareAudios function is passed as a reference without binding, matching the expected call signature in the pipeline implementation.

Also applies to: 158-159, 308-324


341-341: Minor formatting adjustment - no functional change.

packages/transformers/src/models/registry.js (1)

44-44: LGTM - Correct registry wiring for dual-artifact model.

The encoder-only mapping enables AutoModel lookup, while the CUSTOM_MAPPING override ensures MODEL_TYPES.NemoConformerTDT (value 16) is used at runtime for proper two-artifact file resolution (encoder_model + decoder_model_merged).

Also applies to: 584-586

packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js (2)

9-13: LGTM - Correct tokenizer interface usage.

The guard for empty arrays prevents the tokenizer's "must be a non-empty array" error. The { skip_special_tokens: true } option matches the expected signature per PreTrainedTokenizer.decode().


28-31: LGTM - Clean delegation to word offset builder.

packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (4)

129-180: Unit test mocks are adequate for pipeline behavior testing.

The simplified mock processor returning { input_features: {} } is acceptable since model.transcribe() is also mocked and doesn't consume the inputs. These tests verify pipeline orchestration, not model inference.


229-315: LGTM - Window merging test with correct offset calculation.

The test correctly expects timeOffset: 1 for the second window: with chunk_length_s=2 and stride_length_s=0.5, jump = 32000 - 16000 = 16000 samples = 1 second.


396-473: LGTM - Auto-window test with correct offset progression.

For 300s audio with NEMO_AUTO_CHUNK_LENGTH_S=90 and NEMO_AUTO_STRIDE_LENGTH_S=10: jump = 1440000 - 320000 = 1120000 samples = 70s. Four windows at offsets [0, 70, 140, 210] is correct.


482-604: LGTM - Tensor lifecycle tests verify disposal semantics.

Tests correctly verify disposal behavior across three scenarios: cache disabled (disposes), cache enabled with ownership (no dispose), and cache limits disabling caching (disposes).

packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js (2)

9-70: LGTM - Correct delta coefficient computation.

The implementation follows the standard regression-based delta formula with proper boundary clamping. Denominator 2 * sum(n^2) matches the normalization convention. Disposal of intermediate tensors is handled correctly.


72-91: LGTM - Frame-by-frame interleaving for concatenated output.

The interleaving produces [batch, T, F*items.length] shape as expected, with proper validation of input array lengths.

packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js (3)

100-131: Test correctly verifies internal disposal behavior.

The monkey-patching of Tensor.prototype.dispose is appropriate for verifying that the feature extractor properly disposes intermediate tensors when producing concatenated delta outputs. The finally block ensures proper restoration.


133-188: LGTM - Cache behavior tests verify tensor identity and lifecycle.

Tests correctly verify that cached results return the same tensor references (===) and that cache statistics are properly maintained.


190-225: LGTM - Validation tests cover construction and runtime errors.

Preemphasis validation at runtime and delta_window/n_fft/win_length validation at construction time are properly tested.

packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js (1)

1-8: LGTM - Clean barrel consolidating Nemo Conformer TDT exports.

Re-exports correctly aggregate all internal utilities for backwards-compatible imports.

packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js (3)

11-57: LGTM - Correct window spec generation.

The validation chain (chunk_length_s > 2 * stride_length_s) and window iteration with jump = window - 2 * stride correctly produces overlapping windows for Nemo TDT decoding.


59-113: LGTM - Overlap filtering and deduplication logic.

shouldKeepTimedItem uses midpoint-based filtering for stride regions. Deduplication prefers longer-duration items on collision, which is a reasonable heuristic for window boundary artifacts.


120-179: LGTM - Window result merging correctly aggregates and deduplicates.

The merge function properly filters by stride boundaries, deduplicates, reconstructs text from words (preferred) or tokens (fallback), and extracts utterance timestamps.

Comment on lines +20 to +44
if (!Number.isInteger(this.config.n_fft) || this.config.n_fft <= 0) {
throw new Error(
`NemoConformerTDTFeatureExtractor expected \`n_fft\` as a positive integer, got ${this.config.n_fft}.`,
);
}
if (
!Number.isInteger(this.config.win_length) ||
this.config.win_length <= 0 ||
this.config.win_length > this.config.n_fft
) {
throw new Error(
`NemoConformerTDTFeatureExtractor expected \`win_length\` in [1, n_fft], got win_length=${this.config.win_length}, n_fft=${this.config.n_fft}.`,
);
}

// Prefer given `mel_filters` from preprocessor_config.json, or calculate them if they don't exist.
this.config.mel_filters ??= mel_filter_bank(
Math.floor(1 + this.config.n_fft / 2), // num_frequency_bins
this.config.feature_size, // num_mel_filters
0.0, // min_frequency
this.config.sampling_rate / 2, // max_frequency
this.config.sampling_rate, // sampling_rate
'slaney', // norm
'slaney', // mel_scale
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on the remaining spectrogram parameters.

Lines 36-43 and 163 depend on feature_size, sampling_rate, and hop_length, but only n_fft and win_length are validated. A malformed preprocessor_config.json can currently produce NaN mel filters or invalid frame stepping instead of a clear configuration error.

Proposed fix
         if (
             !Number.isInteger(this.config.win_length) ||
             this.config.win_length <= 0 ||
             this.config.win_length > this.config.n_fft
         ) {
             throw new Error(
                 `NemoConformerTDTFeatureExtractor expected \`win_length\` in [1, n_fft], got win_length=${this.config.win_length}, n_fft=${this.config.n_fft}.`,
             );
         }
+        if (!Number.isInteger(this.config.hop_length) || this.config.hop_length <= 0) {
+            throw new Error(
+                `NemoConformerTDTFeatureExtractor expected \`hop_length\` as a positive integer, got ${this.config.hop_length}.`,
+            );
+        }
+        if (!Number.isFinite(this.config.sampling_rate) || this.config.sampling_rate <= 0) {
+            throw new Error(
+                `NemoConformerTDTFeatureExtractor expected \`sampling_rate\` as a positive number, got ${this.config.sampling_rate}.`,
+            );
+        }
+        if (!Number.isInteger(this.config.feature_size) || this.config.feature_size <= 0) {
+            throw new Error(
+                `NemoConformerTDTFeatureExtractor expected \`feature_size\` as a positive integer, got ${this.config.feature_size}.`,
+            );
+        }

Also applies to: 162-163

Comment on lines +128 to +145
const output = await runNemoTranscribe(window.audio, {
tokenizer,
return_timestamps: true,
return_words: true,
return_tokens: true,
return_metrics: false,
timeOffset: window.start_s,
});
windowResults.push({ window, output });
}

const merged = mergeNemoWindowResults(tokenizer, windowResults);
const result = { text: merged.text || windowResults.map((x) => x.output.text ?? '').join(' ').trim() };
if (wantWordTimestamps) {
result.chunks = buildWordChunks(merged.words);
} else if (wantTimestampChunks) {
result.chunks = buildNemoSegmentChunks(merged.words, merged.utterance_timestamp, result.text);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not strip the native Nemo payload out of pipeline responses.

Lines 133 and 154 force return_metrics: false, and Lines 140-145 / 157-162 rebuild the response as { text, chunks? }. That drops is_final, utterance_confidence, utterance_timestamp, confidence_scores, and the metrics payload from the public pipeline result. Preserve the model.transcribe() object in the unchunked path, and aggregate native fields instead of discarding them in the chunked path.

Based on learnings, _call_nemo_conformer_tdt intentionally hardcodes return_metrics: true as part of the Nemo Conformer TDT pipeline contract.

Also applies to: 150-163

Comment on lines +126 to +136
_evict() {
const max_bytes = this.max_size_mb * 1024 * 1024;
while (this.cache.size > this.max_entries || this.current_size_bytes > max_bytes) {
const oldest_key = this.cache.keys().next().value;
if (oldest_key === undefined) break;
const oldest = this.cache.get(oldest_key);
this.cache.delete(oldest_key);
disposeCachedValue(oldest?.value);
this.current_size_bytes -= oldest?.size_bytes ?? 0;
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Defensive optional chaining is unnecessary but harmless.

Line 133-134 uses oldest?.value and oldest?.size_bytes defensively. Given the while condition guarantees cache.size > 0 when entering the loop body, and oldest_key is checked against undefined at line 130, oldest retrieved at line 131 will always exist. The optional chaining is superfluous but does not cause incorrect behavior.

Comment on lines +16 to +25
if (tokenizer?.get_vocab) {
const vocab = tokenizer.get_vocab();
const entries = vocab instanceof Map ? vocab.entries() : Object.entries(vocab);
for (const [token, id] of entries) {
if (Number.isInteger(id)) {
cached.set(id, token);
}
}
}
TOKEN_ID_TO_TEXT_CACHE.set(tokenizer, cached);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the file in question and related tokenizer code
find . -name "transducer_word_offsets.js" -o -name "tokenization_utils.js" -o -name "transducer_window_merge.js" | head -20

Repository: ysdede/transformers.js

Length of output: 277


🏁 Script executed:

# Get the full implementation of transducer_word_offsets.js
wc -l packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js

Repository: ysdede/transformers.js

Length of output: 148


🏁 Script executed:

# Read the full file to understand vocab cache logic
cat -n packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js

Repository: ysdede/transformers.js

Length of output: 8896


🏁 Script executed:

# Check tokenization_utils.js for get_vocab() implementation
wc -l packages/transformers/src/tokenization_utils.js

Repository: ysdede/transformers.js

Length of output: 117


🏁 Script executed:

# Find the get_vocab method in tokenization_utils.js
sed -n '1,300p' packages/transformers/src/tokenization_utils.js | cat -n | grep -A 20 "get_vocab"

Repository: ysdede/transformers.js

Length of output: 914


🏁 Script executed:

# Check how vocab data is actually structured in practice
grep -r "get_vocab" packages/transformers/src/ -A 5 -B 2 | head -60

Repository: ysdede/transformers.js

Length of output: 3626


🏁 Script executed:

# Search for tokenizer implementations and their get_vocab returns
find . -name "*.js" -type f | xargs grep -l "get_vocab" | head -15

Repository: ysdede/transformers.js

Length of output: 438


🏁 Script executed:

# Look for test files related to nemo_conformer_tdt
find . -name "*test*" -o -name "*spec*" | grep -i nemo

Repository: ysdede/transformers.js

Length of output: 259


🏁 Script executed:

# Check how tokenizers from huggingface/tokenizers library work
grep -r "get_vocab" packages/transformers/src/ -B 5 -A 10 | grep -A 8 "vocab\s*=" | head -40

Repository: ysdede/transformers.js

Length of output: 3133


🏁 Script executed:

# Look at specific tokenizer implementation classes
find . -path "*/models/*/tokenization*.js" -type f | head -10

Repository: ysdede/transformers.js

Length of output: 739


🏁 Script executed:

# Check if there are any test files that mock get_vocab
grep -r "get_vocab\|TOKEN_ID_TO_TEXT_CACHE" packages/ --include="*.js" | grep -i test

Repository: ysdede/transformers.js

Length of output: 1284


🏁 Script executed:

# Look at the test file in detail to see the tokenizer mocks
wc -l packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js

Repository: ysdede/transformers.js

Length of output: 159


🏁 Script executed:

# Extract the mock tokenizer definitions from test file
sed -n '1,200p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | cat -n

Repository: ysdede/transformers.js

Length of output: 7861


🏁 Script executed:

# Find all the get_vocab mocks in the test file
grep -n "get_vocab" packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js -B 3 -A 3

Repository: ysdede/transformers.js

Length of output: 1762


🏁 Script executed:

# Check what huggingface/tokenizers actually returns from get_vocab
# Search for actual tokenizer library usage
grep -r "from.*tokenizers\|require.*tokenizers" packages/transformers/src/ -A 5 | head -30

Repository: ysdede/transformers.js

Length of output: 2301


🏁 Script executed:

# Look for actual tokenizer instantiation to understand real vocab structure
grep -r "_tokenizer" packages/transformers/src/tokenization_utils.js -B 2 -A 5 | head -50

Repository: ysdede/transformers.js

Length of output: 2146


🏁 Script executed:

# Find the test case with rawById to see the full context
sed -n '650,700p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | cat -n

Repository: ysdede/transformers.js

Length of output: 1534


🏁 Script executed:

# Look for more context around that rawById test
sed -n '630,750p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | cat -n

Repository: ysdede/transformers.js

Length of output: 4074


🏁 Script executed:

# Search the modeling_nemo_conformer_tdt.js file for how get_vocab is used there
grep -n "get_vocab" packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js -B 3 -A 15

Repository: ysdede/transformers.js

Length of output: 793


🏁 Script executed:

# Check what buildTransducerDetailedOutputs does - does it use buildTransducerWordOffsets?
grep -n "buildTransducerDetailedOutputs\|buildTransducerWordOffsets" packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js | head -20

Repository: ysdede/transformers.js

Length of output: 371


🏁 Script executed:

# Look at transducer_text.js to see if it uses transducer_word_offsets
wc -l packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js

Repository: ysdede/transformers.js

Length of output: 139


🏁 Script executed:

# Get the implementation of buildTransducerDetailedOutputs
sed -n '1,150p' packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js | cat -n

Repository: ysdede/transformers.js

Length of output: 1626


🏁 Script executed:

# Look at the actual imports in transducer_text.js
head -30 packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js

Repository: ysdede/transformers.js

Length of output: 1407


🏁 Script executed:

# Now let's verify the exact behavior - check if the test passes with id-keyed vocab
# and whether the cache is actually empty or populated
grep -A 50 "keeps word boundaries from the final decoded text" packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | head -60

Repository: ysdede/transformers.js

Length of output: 1263


🏁 Script executed:

# Let's trace through what happens when Object.entries is called on rawById
python3 << 'EOF'
# Simulate what happens with id-keyed vocab
rawById = {
    "1": "▁score",
    "2": ".",
    "3": "48",
}

entries = list(rawById.items())
print("Object.entries equivalent:", entries)

# Simulate the loop in transducer_word_offsets.js
cached = {}
for token, id_val in entries:  # destructure as [token, id]
    print(f"token={token}, id={id_val}, isinstance(id, int)={isinstance(id_val, int)}")
    if isinstance(id_val, int):
        cached[id_val] = token
        
print("Cached result:", cached)
print("Cache is empty:", len(cached) == 0)
EOF

Repository: ysdede/transformers.js

Length of output: 303


🏁 Script executed:

# Verify by looking at what the test actually checks for raw_token in the output
sed -n '70,85p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js

Repository: ysdede/transformers.js

Length of output: 551


🏁 Script executed:

# Check if test validates raw_token field
grep -n "raw_token" packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js

Repository: ysdede/transformers.js

Length of output: 48


🏁 Script executed:

# Look at the actual raw_token usage in resolveTokenPiece more carefully
sed -n '35,52p' packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js

Repository: ysdede/transformers.js

Length of output: 710


🏁 Script executed:

# Check what happens in resolveTokenPiece when cache is empty - trace the fallback
sed -n '29,52p' packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js | cat -n

Repository: ysdede/transformers.js

Length of output: 1085


🏁 Script executed:

# Verify that the test doesn't actually check raw_token values
sed -n '650,720p' packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js | grep -A 30 "expect(output.tokens"

Repository: ysdede/transformers.js

Length of output: 484


Handle both token→id and id→token vocab shapes.

Line 19 assumes each entry is [token, id], leaving TOKEN_ID_TO_TEXT_CACHE empty when get_vocab() returns id-keyed objects like { 1: "▁score", 2: "." }. This breaks boundary marker detection (▁ / Ġ) in resolveTokenPiece(), degrading word segmentation in production. Normalize vocab orientation by detecting key type, not position.

Proposed fix
    cached = new Map();
    if (tokenizer?.get_vocab) {
        const vocab = tokenizer.get_vocab();
        const entries = vocab instanceof Map ? vocab.entries() : Object.entries(vocab);
-       for (const [token, id] of entries) {
-           if (Number.isInteger(id)) {
-               cached.set(id, token);
-           }
+       for (const [a, b] of entries) {
+           if (typeof a === 'string' && Number.isInteger(b)) {
+               cached.set(b, a);
+               continue;
+           }
+           const id = typeof a === 'number' ? a : Number(a);
+           if (Number.isInteger(id) && typeof b === 'string') {
+               cached.set(id, b);
+           }
        }
    }

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 816f581180

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +33 to +35
const window = Math.floor(sampling_rate * chunk_length_s);
const stride = Math.floor(sampling_rate * stride_length_s);
const jump = window - 2 * stride;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject chunk sizes that quantize to zero samples

buildNemoWindowSpecs floors chunk_length_s/stride_length_s to integer samples but never validates the quantized values; when chunk_length_s is very small (e.g. < 1 / sampling_rate), window and jump can become 0, so the while (true) loop never advances offset and the ASR pipeline hangs indefinitely instead of failing fast.

Useful? React with 👍 / 👎.

Comment on lines +76 to +79
prev &&
prev.text === word.text &&
word.start_time < prev.end_time
) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid collapsing repeated words during overlap de-dup

dedupeMergedWords drops any adjacent entries that share the same text and overlap in time, which can delete legitimate repetitions (e.g. "no no") near chunk boundaries when timestamp jitter causes slight overlap; this silently corrupts merged transcripts for windowed decoding, not just duplicate overlap artifacts.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js (1)

16-24: ⚠️ Potential issue | 🟠 Major

Handle both token→id and id→token vocab shapes.

This was flagged in a prior review iteration. The current implementation at line 19 assumes [token, id] ordering. When get_vocab() returns an id-keyed object (e.g., { 1: "▁score", 2: "." }), Object.entries yields [["1", "▁score"], ...] where the key is a stringified id. The destructure assigns token="1" and id="▁score", causing Number.isInteger(id) to fail and leaving the cache empty.

This breaks boundary marker detection (/Ġ) in resolveTokenPiece(), degrading word segmentation.

Proposed fix
     cached = new Map();
     if (tokenizer?.get_vocab) {
         const vocab = tokenizer.get_vocab();
-        const entries = vocab instanceof Map ? vocab.entries() : Object.entries(vocab);
-        for (const [token, id] of entries) {
-            if (Number.isInteger(id)) {
-                cached.set(id, token);
-            }
+        if (vocab instanceof Map) {
+            for (const [token, id] of vocab.entries()) {
+                if (Number.isInteger(id)) {
+                    cached.set(id, token);
+                }
+            }
+        } else if (Array.isArray(vocab)) {
+            vocab.forEach((token, id) => cached.set(id, token));
+        } else if (vocab && typeof vocab === 'object') {
+            for (const [key, val] of Object.entries(vocab)) {
+                if (typeof key === 'string' && Number.isInteger(val)) {
+                    cached.set(val, key);
+                } else {
+                    const numKey = Number(key);
+                    if (Number.isInteger(numKey) && typeof val === 'string') {
+                        cached.set(numKey, val);
+                    }
+                }
+            }
         }
     }

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cae8f5ec-d592-49c3-b335-8285e4155a59

📥 Commits

Reviewing files that changed from the base of the PR and between f65a4c7 and 816f581.

📒 Files selected for processing (22)
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
📜 Review details
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.

Applied to files:

  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/models.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.

Applied to files:

  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/models.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.

Applied to files:

  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/models.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.

Applied to files:

  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/models.js
📚 Learning: 2026-03-05T23:43:18.376Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/pipelines/automatic-speech-recognition.js:349-356
Timestamp: 2026-03-05T23:43:18.376Z
Learning: In `packages/transformers/src/pipelines/automatic-speech-recognition.js`, `_call_nemo_conformer_tdt` intentionally hardcodes `return_metrics: true` and ties `return_words` to `return_timestamps` as an explicit API contract for the Nemo Conformer TDT pipeline. Advanced decode/debug controls (e.g., return_tokens, return_metrics override) are intentionally exposed only through direct `model.transcribe()` calls, not through pipeline kwargs. Do not flag these as missing forwarding or hardcoding issues.

Applied to files:

  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/models.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.

Applied to files:

  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/models.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.

Applied to files:

  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
📚 Learning: 2026-03-03T22:59:55.984Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T22:59:55.984Z
Learning: When a model subclass overrides from_pretrained and bypasses the generic model-type branch, do not introduce or rely on a MODEL_TYPES entry for that subclass in the model registry. For this NemoConformerTDT case, in packages/transformers/src/utils/model_registry/get_model_files.js, avoid adding a NemoConformerTDT entry in the model type map; rely on the override logic in modeling_nemo_conformer_tdt.js. This keeps the registry explicit to the actual file and prevents unintended dispatch through the generic branch.

Applied to files:

  • packages/transformers/src/utils/model_registry/get_model_files.js
🔇 Additional comments (44)
packages/transformers/src/models/processors.js (1)

11-11: LGTM.

Export follows the established alphabetical ordering convention and correctly surfaces the Nemo Conformer TDT processor module.

packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js (2)

1-70: LGTM.

The delta computation implementation is correct:

  • Input validation is thorough (Tensor type, dims, window, order, dtype checks)
  • The finite difference formula with boundary clamping (lines 36-37) is standard
  • Memory management properly disposes intermediate tensors when concatenating
  • Recursive delta-delta computation via order=2 is clean

72-91: LGTM.

The interleaveByFrame helper correctly validates input lengths and performs frame-wise interleaving with proper offset calculations.

packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js (2)

130-149: LGTM.

Input validation is thorough: checks for empty inputs, length mismatches between token_ids/token_timestamps/token_confidences, and gracefully returns empty results for invalid tokenizer or empty arrays. Error messages include actual lengths for debugging.


160-202: LGTM.

The main loop correctly:

  • Resolves token pieces with boundary detection
  • Aligns decoded text against fullText for accurate character positions
  • Aggregates tokens into words based on boundary markers and whitespace
  • Handles optional confidence values with proper null checks and finite validation
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js (2)

9-13: LGTM.

Defensive handling: returns empty string for invalid/empty token_ids, falls back to space-joined tokens if tokenizer is unavailable.


28-31: LGTM.

Clean composition that decodes full text first then delegates to buildTransducerWordOffsets for structured output generation.

packages/transformers/src/utils/model_registry/get_model_files.js (1)

180-182: LGTM.

The NemoConformerTDT branch correctly loads encoder_model and decoder_model_merged files, matching the session keys expected by NemoConformerForTDT.from_pretrained(). No generation_config.json is needed since TDT uses greedy transducer decoding rather than autoregressive generation.

packages/transformers/src/models/models.js (1)

106-106: LGTM.

Export follows alphabetical ordering and correctly surfaces the Nemo Conformer TDT modeling module.

packages/transformers/src/models/feature_extractors.js (1)

8-8: LGTM.

Export follows alphabetical ordering and correctly surfaces the Nemo Conformer TDT feature extractor module.

packages/transformers/src/models/modeling_utils.js (2)

121-121: LGTM.

NemoConformerTDT: 16 is added to MODEL_TYPES. No corresponding entry in MODEL_TYPE_CONFIG is required since NemoConformerForTDT overrides from_pretrained and bypasses generic model-type branching logic.


880-882: LGTM.

Error message formatting simplified without functional change.

packages/transformers/src/models/registry.js (1)

44-44: Registry wiring is correct.

Dual registration in MODEL_MAPPING_NAMES_ENCODER_ONLY (for AutoModel lookup) and CUSTOM_MAPPING (for runtime model type override) matches the documented intent. The approach correctly forces MODEL_TYPES.NemoConformerTDT for the two-artifact loading path while preserving standard lookup.

Also applies to: 584-586

packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js (1)

8-18: Processor design is appropriate for ASR workflow.

The _call method correctly delegates to feature_extractor only. While tokenizer_class is declared, ASR tokenization occurs in model.transcribe(), not in the processor. The pipeline runner passes the tokenizer separately. This aligns with the ASR contract.

packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js (1)

1-227: Test coverage is comprehensive.

Tests validate mel bin configurations, delta feature modes, disposal semantics, cache behavior, and constructor validation. Proper cleanup with dispose() and clear_cache() prevents resource leaks across tests.

packages/transformers/src/pipelines/automatic-speech-recognition.js (2)

158-159: Nemo Conformer TDT dispatch is correctly implemented.

The switch case routes to _call_nemo_conformer_tdt, which delegates to runNemoConformerTDTPipeline with the correct options object structure. Per established API contract, advanced decode controls are intentionally exposed only through direct model.transcribe() calls.

Also applies to: 315-324


341-341: Trivial formatting change.

Type cast spacing adjustment from /** @type {Tensor} */ (outputs) to /** @type {Tensor} */(outputs). Non-functional.

packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js (3)

22-28: O(n) audio validation is justified for correctness.

While iterating all samples is expensive for long audio, catching non-finite values (NaN, Infinity) before inference prevents silent corruption or cryptic downstream errors. This is acceptable as a safety guardrail.


133-145: Pipeline returns task-shaped output per API contract.

Lines 133 and 154 set return_metrics: false, and results are rebuilt as { text, chunks? }. Per the established learning, this is intentional: the pipeline surface aligns with the shared ASR task contract, and rich native outputs are available through direct model.transcribe(). This is not a defect.

Also applies to: 150-163


90-105: Tensor disposal logic is correct.

The Set-based deduplication prevents double-dispose when the same Tensor instance appears in multiple input fields. The finally block ensures cleanup even on transcription failure.

packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (3)

129-180: Mock construction is well-designed.

The makeUnitPipe helper correctly mocks the model, processor, and tokenizer interfaces required by the pipeline. The calls array enables assertion of options passed to transcribe(). This pattern cleanly isolates unit tests from real model loading.


229-315: Window merge test validates deduplication and offset handling.

The test confirms overlapping windows produce correct merged output with unique words and proper timestamps. The timeOffset assertions (lines 306-313) verify windowing parameters are passed correctly.


482-604: Disposal semantics tests are thorough.

Three scenarios are covered: no cache (dispose), cache owns tensors (no dispose), and zero-limit cache (dispose). The monkey-patching approach with trackDispose correctly verifies disposal behavior without modifying production code.

packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js (4)

10-44: MockNemoConformerForTDT correctly scripts decoder behavior.

The mock class overrides _runEncoder and _runDecoder to return controlled tensors, enabling deterministic testing of the greedy decode loop. The decoderScript array provides step-by-step logits for validation.


648-712: Word boundary reconstruction test validates punctuation handling.

The test confirms buildTransducerDetailedOutputs correctly groups tokens into words based on sentencepiece markers () and preserves punctuation/numeric boundaries. Critical for ASR accuracy.


826-910: LRU cache tests validate eviction, disposal, and shared reference handling.

Tests confirm LRU eviction order, single disposal for replaced entries, and correct handling of shared tensor references across fields. The finally blocks ensure cleanup even on assertion failure.


791-810: Remove unused local_files_only parameter from test call.

The get_model_files function does not accept local_files_only in its parameter signature and does not forward it to AutoConfig.from_pretrained. The parameter is silently ignored. The test works correctly due to the pre-loaded config object, which short-circuits any file loading, not because of local_files_only. Remove this unused parameter to avoid misleading future readers about the function's capabilities.

			> Likely an incorrect or invalid review comment.
packages/transformers/src/pipelines/index.js (1)

33-33: Integration is correct.

NemoConformerForTDT correctly exports a static supports() method at line 246 that returns model_type === 'nemo-conformer-tdt'. The import at line 33 and task registration at line 154 follow the established pattern and are properly integrated.

packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js (3)

20-44: Fail fast on hop_length, sampling_rate, and feature_size validation.

Constructor validates n_fft and win_length but omits validation for hop_length, sampling_rate, and feature_size, all of which are used in _extract (line 163) and _extract_fbank_features (lines 109, 40-41). A malformed preprocessor_config.json with invalid values (zero, negative, or non-numeric) will produce incorrect frame counts, NaN mel filters, or division errors instead of a clear configuration error.


140-156: Cache key and ownership semantics are correctly implemented.

The cache key at line 144 incorporates audio hash, sampling rate, and delta configuration, ensuring distinct entries for different extraction parameters. The spread return ({ ...cached }) at lines 147/152 correctly returns a shallow copy while the cache retains ownership of the tensor instances. This aligns with the documented contract in the JSDoc (lines 137-138).


158-231: Extraction pipeline and tensor lifecycle are correctly managed.

The normalization logic (lines 175-199) correctly handles the divide-by-zero case for empty/short audio by checking features_length > 0. The delta computation path (lines 212-228) properly disposes the original input_features tensor when replaced by the concatenated delta result. The attention mask construction (lines 201-205) correctly uses BigInt64Array for int64 tensor dtype.

packages/transformers/src/models/nemo_conformer_tdt/utils_nemo_conformer_tdt.js (1)

1-8: Barrel file correctly consolidates Nemo Conformer TDT utilities.

Re-exports are properly structured for backwards compatibility. No circular dependency concerns since transducer_segment_offsets.js has no imports from this barrel.

packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js (3)

11-57: Window specification logic is correctly guarded against infinite loops.

The validation at lines 29-31 ensures chunk_length_s > 2 * stride_length_s, which guarantees jump = window - 2 * stride > 0. Combined with the is_last break condition at line 52, the while loop is guaranteed to terminate. The defensive NaN checks at lines 12 and 26 are appropriate.


70-113: Deduplication logic correctly handles window boundary overlaps.

Both dedupeMergedWords and dedupeMergedTokens use overlap detection (word.start_time < prev.end_time / token.start_time < prev.end_time) combined with identity checks (text/id+raw_token) to detect duplicates at window boundaries. Keeping the longer-duration item is a reasonable heuristic for windowed ASR merging.


120-178: Merged result construction handles edge cases appropriately.

The text reconstruction at lines 165-170 correctly falls back from words to tokens to empty string. The utterance_timestamp derivation (lines 171-176) gracefully handles empty results. The visibility filtering via shouldKeepTimedItem correctly uses midpoint-based inclusion to avoid double-counting items at stride boundaries.

packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (2)

9-25: Cache key generation correctly incorporates all audio samples.

The FNV-1a hash iterates over all samples with 16-bit quantization, which is a deliberate design choice to minimize false cache hits. The key format ${sampling_rate}:${audio.length}:${hash} provides good collision resistance.


65-107: Cache set() correctly handles edge cases and ownership transfer.

The zero-cache mode check (lines 67-72), oversized item handling (lines 88-96), and existing entry replacement (lines 98-102) are all correctly implemented. The size tracking and eviction trigger at line 106 maintain cache invariants.

packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js (2)

8-22: Word joining correctly handles punctuation spacing.

The regex at line 15 correctly identifies punctuation-only tokens that should attach without leading space. The accumulation logic handles the empty-text edge case at line 12.


49-87: Segment chunking logic handles break conditions and edge cases correctly.

The break conditions at lines 65-67 correctly combine sentence-ending punctuation detection and inter-word gap threshold. The gap calculation at line 64 uses Math.max(0, ...) to handle overlapping words. Empty input handling at lines 50-55 provides reasonable fallback behavior.

packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js (5)

85-235: Transducer configuration validation is thorough and defensive.

The resolveTransducerConfig function validates:

  • Decoder layer/hidden size (lines 97-104)
  • I/O name uniqueness (lines 110-129)
  • Session input/output availability (lines 131-171)
  • Runtime parameters with appropriate type checks and bounds (lines 173-215)

The error messages are actionable and reference the config path for remediation.


254-317: Session loading with proper error wrapping and model type validation.

The from_pretrained implementation correctly validates model type before session construction, rejects unsupported model_file_name overrides, and wraps constructSessions errors with helpful context about expected file structure.


640-823: Decoder loop tensor lifecycle is correctly managed with no infinite loop risk.

The try-finally block (lines 640-823) ensures targetLengthTensor, decoderState, and encoderOutputs are disposed even on error paths. The frame advancement logic guarantees termination:

  • step > 0: advances by step
  • step == 0 with non-blank: increments emittedOnFrame until maxSymbolsPerStep
  • Blank or max symbols reached: advances by 1

The decoder output disposal at lines 684-692 correctly handles aliased tensor references via Set.


684-692: Decoder output tensor disposal handles aliased outputs correctly.

Using a Set to track seen tensors before disposal prevents double-free when decoder session returns aliased output references. The exclusion of logits, outputState1, and outputState2 ensures they remain available for subsequent processing.


924-931: Registry mappings correctly register both model type and architecture keys.

Registering both 'nemo-conformer-tdt' (model_type) and 'NemoConformerForTDT' (architecture) ensures proper resolution via MODEL_TYPE_MAPPING. The bidirectional class-name mappings enable dynamic instantiation.

Comment on lines +81 to +88
const featureCache = /** @type {{ max_entries: number, max_size_mb: number }|null|undefined} */ (
/** @type {any} */ (processor.feature_extractor)?.feature_cache
);
const cacheOwnsTensors = !!(
featureCache &&
featureCache.max_entries > 0 &&
featureCache.max_size_mb > 0
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Cache ownership detection may produce false negatives.

The check featureCache.max_entries > 0 && featureCache.max_size_mb > 0 assumes both limits must be positive for caching to be active. If the feature extractor uses a different cache structure or only one limit, this could incorrectly trigger disposal of cache-owned tensors.

Consider defensive access:

Proposed fix
     const featureCache = /** `@type` {{ max_entries: number, max_size_mb: number }|null|undefined} */ (
         /** `@type` {any} */ (processor.feature_extractor)?.feature_cache
     );
-    const cacheOwnsTensors = !!(
-        featureCache &&
-        featureCache.max_entries > 0 &&
-        featureCache.max_size_mb > 0
-    );
+    const cacheOwnsTensors = !!(
+        featureCache &&
+        (featureCache.max_entries ?? 0) > 0 &&
+        (featureCache.max_size_mb ?? 0) > 0
+    );

Comment on lines +185 to +207
function estimateSizeBytes(value) {
if (value instanceof Tensor) {
return tensorByteSize(value);
}
if (value?.input_features instanceof Tensor) {
let bytes = tensorByteSize(value.input_features);
if (value.attention_mask instanceof Tensor) {
bytes += tensorByteSize(value.attention_mask);
}
if (value.delta_features instanceof Tensor) {
bytes += tensorByteSize(value.delta_features);
}
if (value.delta_delta_features instanceof Tensor) {
bytes += tensorByteSize(value.delta_delta_features);
}
return bytes;
}
const byteLength = value?.byteLength;
if (typeof byteLength === 'number' && Number.isFinite(byteLength) && byteLength >= 0) {
return byteLength;
}
return 0;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Document that estimateSizeBytes returns 0 for unknown types.

The function returns 0 for values that are neither Tensors nor objects with recognized tensor properties nor have a byteLength property. While the current usage with feature extraction outputs is safe, this could cause unbounded cache growth if misused with arbitrary objects. Consider adding a JSDoc comment clarifying the expected input types.

Documentation suggestion
+/**
+ * Estimate byte size of a cached value.
+ * Returns 0 for unrecognized types - callers must ensure only Tensor or
+ * Tensor-containing feature objects are cached.
+ * `@param` {any} value
+ * `@returns` {number}
+ */
 function estimateSizeBytes(value) {

Comment on lines +100 to +131
it(
"disposes replaced base features when concatenated delta output is used",
async () => {
const extractor = new NemoConformerTDTFeatureExtractor({
...base,
feature_size: 80,
delta_order: 1,
delta_window: 2,
delta_concatenate: true,
});

const originalDispose = Tensor.prototype.dispose;
let disposeCalls = 0;
Tensor.prototype.dispose = function () {
disposeCalls += 1;
return originalDispose.call(this);
};

let input_features;
try {
({ input_features } = await extractor(audio));
expect(input_features.dims[2]).toBe(80 * 2);
} finally {
Tensor.prototype.dispose = originalDispose;
input_features?.dispose();
}

// One dispose from computeTemporalDeltas intermediate tensor, one from replacing base features tensor.
expect(disposeCalls).toBe(2);
},
MAX_TEST_EXECUTION_TIME,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Disposal count assertion may be fragile.

The test asserts exactly 2 dispose calls. If the implementation changes (e.g., adds intermediate tensors), this will break. Consider asserting disposeCalls >= 2 or documenting the expected disposal sources explicitly in the test comment.

Comment on lines +79 to +82
it("maps NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT", () => {
expect(MODEL_TYPE_MAPPING.get("NemoConformerForTDT")).toBe(MODEL_TYPES.NemoConformerTDT);
expect(MODEL_TYPE_MAPPING.get("nemo-conformer-tdt")).toBe(MODEL_TYPES.NemoConformerTDT);
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assertion on line 81 will fail; MODEL_TYPE_MAPPING is keyed by class name, not model_type.

MODEL_TYPE_MAPPING is populated with class names as keys (e.g., "NemoConformerForTDT"), not model_type strings (e.g., "nemo-conformer-tdt"). Line 81 will return undefined.

Proposed fix
     it("maps NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT", () => {
       expect(MODEL_TYPE_MAPPING.get("NemoConformerForTDT")).toBe(MODEL_TYPES.NemoConformerTDT);
-      expect(MODEL_TYPE_MAPPING.get("nemo-conformer-tdt")).toBe(MODEL_TYPES.NemoConformerTDT);
     });

Alternatively, if testing model_type → class resolution is intended, use the appropriate mapping (MODEL_NAME_TO_CLASS_MAPPING or similar).

@ysdede
Copy link
Copy Markdown
Owner Author

ysdede commented Mar 8, 2026

Superseded by #14.

The replacement PR includes the latest sentence-based pipeline and long-audio fixes on top of the original NeMo Conformer TDT work.

@ysdede ysdede closed this Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant