Skip to content

[WIP] Add NeMo Conformer TDT ASR support#15

Open
ysdede wants to merge 42 commits intomainfrom
v4-nemo-conformer-tdt-main-r3
Open

[WIP] Add NeMo Conformer TDT ASR support#15
ysdede wants to merge 42 commits intomainfrom
v4-nemo-conformer-tdt-main-r3

Conversation

@ysdede
Copy link
Copy Markdown
Owner

@ysdede ysdede commented Mar 9, 2026

Summary

This PR adds NeMo Conformer TDT ASR support to transformers.js, including model execution, feature extraction, decoding, reconstruction, pipeline wiring, registry integration, and NeMo-specific regression coverage.

The NeMo pipeline is aligned with the shared automatic-speech-recognition task contract, while richer direct model.transcribe() outputs remain available for lower-level use.

This branch supersedes my earlier fork-side iteration and is rebased onto the current upstream/main line.

What is included

1. Model and decoder

  • Added the NeMo Conformer TDT model implementation.
  • Implemented greedy token-and-duration transducer decoding.
  • Added model.transcribe() support for text, timestamps, confidences, optional words and tokens, and optional metrics/debug payloads.

2. Feature extraction

  • Added Conformer-TDT-specific log-mel feature extraction.
  • Added optional temporal deltas and delta-delta features.
  • Added optional feature-cache utilities with tensor ownership and lifecycle handling.

3. ASR pipeline integration

  • Integrated the Conformer TDT model type into AutomaticSpeechRecognitionPipeline dispatch.
  • Aligned pipeline outputs with the shared ASR task shape:
    • default: { text }
    • return_timestamps: true: { text, chunks } with sentence-like finalized chunks
    • return_timestamps: 'word': { text, chunks } with word-level timestamps
  • Kept richer model-native outputs on direct model.transcribe().

4. Long-audio handling

  • Added automatic long-audio handling for NeMo pipeline calls above 180 seconds.
  • Replaced the older overlap-oriented long-audio path with sentence-cursor restart logic.
  • Long-audio windowing finalizes stable sentence-like segments, drops the immature trailing segment, and retranscribes from that segment start.
  • chunk_length_s is used as the NeMo window-size override in pipeline mode.

5. Word reconstruction and timestamp grouping

  • Reworked word reconstruction to derive boundaries from the final decoded text instead of isolated token decoding alone.
  • Improved segment grouping from timed words so sentence-like chunks are more stable than Whisper-style random splits.
  • Fixed spacing and boundary failures around punctuation-heavy and numeric outputs such as:
    • score.48-year-old
    • with0.5
    • March20th,2021.

6. Registry and model file resolution

  • Added model, processor, and feature extractor exports and mappings for Conformer TDT.
  • Added dual-artifact model file handling for encoder_model and decoder_model_merged.

7. Follow-up review fixes included in this branch

  • Moved NeMo adapter-specific assertions out of the shared ASR pipeline test file and into a dedicated NeMo pipeline adapter suite.
  • Fixed pending-prefix preservation when cursor snapping restarts inside the trailing sentence.
  • Hardened vocab handling and validation in word-offset reconstruction.
  • Added cache borrow/release handling so evicted borrowed tensors are disposed only after release.
  • Honored encoder_input_layout for canonical input_features feeds.
  • Raised the auto-window budget to match the minimum guaranteed cursor advance.
  • Kept borrowed cache-entry bytes counted until the final release.
  • Rejected tokenizer-less non-empty word-offset reconstruction instead of silently dropping detail.
  • Derived fallback vocab_size from the maximum tokenizer id so sparse vocabularies do not undersize decoder logits.
  • Kept punctuation-only merge dedupe from collapsing distinct overlapping tokens.

Regression coverage

Added or updated tests in:

  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js

Coverage includes:

  • task-shaped pipeline outputs for default, sentence-chunk, and word-chunk modes
  • sentence-cursor long-audio windowing and retranscription
  • timestamp grouping and word-boundary reconstruction for punctuation and numeric tokens
  • encoder input layout handling for canonical feeds
  • sparse vocab fallback sizing
  • punctuation-only merge dedupe behavior
  • cache ownership, eviction, release, and accounting behavior
  • processor tensor lifetime behavior in the NeMo path

Upstream sync included

This branch was synced with upstream/main through commit f65a4c7c via merge commit 49a4af8f.

Included upstream commits in that sync:

  • 2120d13e [deno] Support both wgpu and dawn webgpu backends (#1546)
  • a289e5c3 Add support for new Qwen VL models (Qwen2.5-VL, Qwen3-VL, Qwen3.5, and Qwen3.5 MoE) (#1551)
  • b5b51ca9 [version] Update to 4.0.0-next.5
  • 2a210d3e [deno via CDN] Fix simultaneous multi-session loading (e.g. VLMs) and support usage for image loading (#1556)
  • e60a6ee3 Use ModelRegistry for pipeline file loading (#1555)
  • 4331d723 Support PKV cached generation for Qwen-VL models (#1557)
  • cd155a05 fix: prevent partial file reads during concurrent downloads (#1548)
  • 30773fb7 Fix WASM factory blob URL loading (#1558)
  • f65a4c7c feat: add fast boolean is_cached / is_pipeline_cached (#1559)

NeMo adaptation after upstream sync

Relevant branch commits after that sync include:

  • ee819a1c fix(nemo-tdt): add supports() for ASR model class selection
  • a85dff25 fix(nemo-tdt): address PR #12 reviewer feedback
  • 8dfccddc feat(nemo-tdt): align asr pipeline outputs and long-audio handling
  • 816f5811 chore(tests): drop unrelated parakeet feature extractor coverage
  • f59ba068 feat(nemo-conformer-tdt): add sentence-based ASR pipeline chunking
  • 00b3d934 fix(nemo): scope ASR tests and address review fixes
  • 07118c38 fix(nemo-tdt): address follow-up review threads
  • 341df3d7 chore(asr): restore upstream cast spacing
  • 29f2baaf fix(nemo-tdt): handle sparse vocab and merge dedupe

Validation

Executed for this refresh:

  • node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
  • node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
  • pnpm build
  • pnpm format:check on files changed by this branch: OK

Note: repo-wide pnpm format:check currently reports unrelated formatting issues outside this branch, but the files touched by this PR pass Prettier checks.

A prebuilt demo for this branch is available at https://ysdede.github.io/tdt-webgpu-demo/ for interactive testing of the current behavior, parameters, outputs, and example usage.

Scope boundary

This PR stays focused on NeMo Conformer TDT integration and the follow-up work needed to:

  • align pipeline behavior with the shared ASR contract
  • improve long-audio handling in pipeline mode
  • improve word reconstruction and timestamp grouping
  • address targeted reviewer-reported NeMo correctness issues

Direct model.transcribe() remains the low-level API for advanced app-specific postprocessing.

Architectural Review: NeMo Conformer TDT ASR Integration

Risk Level: HIGH

Critical Issues

[Critical] Memory Leak: Undisposed Frame Tensors in Decoding Loop

File: modeling_nemo_conformer_tdt.js (lines ~450-500)

Problem:
In the transcribe() method's decoding loop, frameTensor is created on each iteration but never disposed. The loop allocates a new Tensor per frame without cleanup, accumulating memory pressure linearly with audio duration.

const frameTensor = this._createFrameTensor(frameData);  // Created each iteration
// ... decoder runs ...
// Only tokenTensor is disposed in finally block; frameTensor is NOT
tokenTensor.dispose();  // Disposes inner tensor but not frameTensor

Impact:
For a 1-hour audio file with 16kHz sampling and 10ms frame shifts (~360k frames), this allocates ~360k Tensors that never release their underlying memory. This will cause OOM crashes on long audio and fails the project's disposal hardening goal stated in PR objectives.

Fix:
Add disposal in the decoding loop's finally block or immediately after decoder runs:

let decoderOutput;
try {
    decoderOutput = await this._runDecoder(decoderFeeds);
} finally {
    tokenTensor.dispose();
    frameTensor.dispose();  // ADD THIS
}

[Critical] Incomplete Decoder Output Tensor Disposal

File: modeling_nemo_conformer_tdt.js (lines ~460-480)

Problem:
The code attempts to track and dispose decoder output tensors via seenDecoderTensors, but the logits tensor is disposed while intermediate state tensors may leak:

const logits = decoderOutput[io.decoder_output];
const outputState1 = decoderOutput[io.decoder_output_state_1];
const outputState2 = decoderOutput[io.decoder_output_state_2];
const seenDecoderTensors = new Set();
for (const value of Object.values(decoderOutput)) {
    if (!(value instanceof Tensor) || seenDecoderTensors.has(value)) continue;
    // ... disposal logic ...
}
logits.dispose();  // Only logits explicitly disposed

If decoderOutput contains additional Tensor values not enumerated in the loop condition, they remain undisposed. The set-based deduplication doesn't guarantee all tensors are cleaned.

Impact:
Accumulation of state tensors in long audio processing, eventually causing OOM.

Fix:
Explicitly dispose all three state tensors or refactor to track all decoder outputs:

outputState1?.dispose?.();
outputState2?.dispose?.();
// Or verify seenDecoderTensors loop actually covers all outputs

[High] Frame Reuse Logic Introduces Non-Deterministic State

File: modeling_nemo_conformer_tdt.js (lines ~335-350)

Problem:
The reusableFrame optimization reuses a Float32Array for BDT layout but creates a new one for BTD layout on every iteration:

const frameData = this._getFrameData(encoderOutput, frameIndex, reusableFrame);
if (this.transducer.encoder_output_layout === 'BDT') {
    reusableFrame = frameData;  // Reuse for next iteration
}
const frameTensor = this._createFrameTensor(frameData);

If _getFrameData returns the previous iteration's reusableFrame AND that data is modified by the encoder/decoder session (unlikely but possible with ONNX backends), the decoder receives corrupted data. More critically, the Array reference becomes ambiguous about ownership—is it still safe to reuse if the session holds a reference?

Impact:
Potential silent data corruption in long audio, producing incorrect transcriptions without error signals.

Fix:
Either always copy the frame data or document strict ownership semantics with the session layer. Remove reuse optimization unless profiling proves measurable benefit.


High-Severity Issues

[High] Numerical Stability in Confidence Calculation

File: modeling_nemo_conformer_tdt.js (lines ~48-70)

Problem:
The confidenceFromLogits() function computes softmax with log-sum-exp but does not validate input bounds:

function confidenceFromLogits(logits, tokenId, vocabSize) {
    let maxLogit = Number.NEGATIVE_INFINITY;
    for (let i = 0; i < vocabSize; ++i) {
        if (logits[i] > maxLogit) maxLogit = logits[i];
    }
    let expSum = 0;
    for (let i = 0; i < vocabSize; ++i) {
        expSum += Math.exp(logits[i] - maxLogit);  // Could underflow to 0 or overflow to Infinity
    }
    const logSumExp = maxLogit + Math.log(expSum);  // log(0) = -Infinity if expSum underflows
    const logProb = logits[tokenId] - logSumExp;    // -Infinity - (-Infinity) = NaN
    return { confidence: Math.exp(logProb), logProb };  // NaN or Infinity result
}

If logits contains extreme values (e.g., all logits < -700, causing exp() underflow to 0), then logSumExp becomes -Infinity, producing NaN confidence values that corrupt metrics and debug output.

Impact:
Silent production of NaN/Infinity in confidence scores; downstream code may accept these invalid values, producing unreliable transcription metrics.

Fix:
Add guards and use stable softmax:

if (!Number.isFinite(maxLogit)) return { confidence: 0, logProb: -Infinity };
if (expSum <= 0) return { confidence: 0, logProb: -Infinity };  // Prevent log(0)
return { confidence: Math.max(0, Math.min(1, Math.exp(logProb))), logProb };

[High] Timestamp Fallback Logic Produces Invalid Ranges

File: modeling_nemo_conformer_tdt.js (lines ~495-510)

Problem:
When token timestamps are empty, the code computes a fallback utterance timestamp:

const utteranceTimestamp =
    tokenTimestamps.length > 0
        ? [tokenTimestamps[0][0], tokenTimestamps[tokenTimestamps.length - 1][1]]
        : [roundTs(timeOffset), roundTs(frameCount * frameTime + timeOffset)];

If frameTime (computed as subsampling_factor * frame_shift_s) is not finite or is zero, then:

  • End timestamp equals start timestamp (invalid range)
  • roundTs() receives Infinity/NaN, producing invalid output

Additionally, the RTF calculation uses Math.max(utteranceDuration, 1e-8) as a guard against division by zero, but if utterance timestamps are both roundTs(timeOffset), the duration becomes 0 before the guard is applied.

Impact:
Invalid timestamp ranges (start >= end) propagate to pipeline output, breaking downstream segment processing and potentially causing index-out-of-bounds in audio windowing logic.

Fix:
Validate frameTime and fallback timestamp computation:

if (!Number.isFinite(frameTime) || frameTime <= 0) {
    throw new Error(`Invalid frame timing: frameTime=${frameTime}`);
}
const utteranceTimestamp = tokenTimestamps.length > 0 ? [...] : [roundTs(timeOffset), roundTs(timeOffset + Math.max(frameCount * frameTime, 1e-8))];

[High] Word-Timestamp Alignment Fragility

File: transducer_word_offsets.js (lines ~50-80)

Problem:
The consumeAlignedTokenText() function uses fullText.indexOf() as a fallback when exact token matching fails:

const next = fullText.indexOf(tokenText, cursor);
if (next !== -1 && /^\s*$/.test(fullText.slice(cursor, next))) {
    return { cursor: next + tokenText.length, text: ..., skippedWhitespace: true };
}

This heuristic can silently match the wrong token occurrence if the decoded text contains repeated substrings (e.g., "the the" → first token "the" matches first "the", second token "the" matches second "the", but what if tokenization is different?). No validation that the matched text is actually a valid token boundary.

Impact:
Word timestamps become misaligned with actual token positions, corrupting word-level output timestamps. User receives confident but incorrect word timings.

Fix:
Log skipped tokens and add validation; optionally use a more sophisticated alignment algorithm (e.g., longest common subsequence or Levenshtein-based matching for robustness).


[High] Audio Windowing Window Boundary Precision Loss

File: pipeline_nemo_conformer_tdt.js (lines ~100-130)

Problem:
Audio is sliced using floating-point sample indices computed from time values:

const start_sample = Math.floor(start_s * sampling_rate);
const end_sample = Math.ceil(end_s * sampling_rate);
const windowAudio = audio.subarray(start_sample, end_sample);

For a 16kHz audio file with chunk_length_s = 90:

  • 1st window: 0 to 1440000 samples
  • 2nd window: calculated start_s value may differ from 1st end due to floating-point rounding

This can cause gaps or overlaps in coverage, and more critically, cursor relocation logic (relocateNemoCursorToNearbyGap) operates on timestamps rounded to 3 decimals (roundTs), but the actual audio sample positions use floor/ceil, creating misalignment between logical window boundaries and actual sample positions.

Impact:
Long audio transcription produces gaps in coverage or duplicate segments due to window boundary misalignment.

Fix:
Maintain sample-based cursor state (integer indices) and convert to timestamps for external APIs only. Verify that consecutive windows have exact coverage without gaps.


[High] Missing Null/Undefined Guards in Confidence Aggregation

File: modeling_nemo_conformer_tdt.js (lines ~505-520)

Problem:
Confidence aggregation assumes arrays are non-empty but doesn't validate:

const utteranceConfidence =
    tokenConfidences && tokenConfidences.length > 0
        ? roundMetric(tokenConfidences.reduce((a, b) => a + b, 0) / tokenConfidences.length, 6)
        : null;

If tokenConfidences contains null/undefined values (from confidenceFromLogits returning NaN), the reduce produces NaN, which roundMetric() converts to 0:

function roundMetric(value, digits = 2) {
    if (!Number.isFinite(value)) return 0;  // Silently converts NaN/Infinity to 0
    // ...
}

This masks bugs: NaN inputs are silently replaced with 0, making it appear that all tokens had zero confidence.

Impact:
Incorrect confidence metrics in transcription results; user cannot detect when confidence calculation failed.

Fix:
Filter out non-finite confidence values before aggregation; log a warning:

const validConfs = tokenConfidences.filter(c => Number.isFinite(c));
const utteranceConfidence = validConfs.length > 0
    ? roundMetric(validConfs.reduce((a, b) => a + b, 0) / validConfs.length, 6)
    : null;

Medium-Severity Issues

[Medium] Feature Cache Ownership Semantics Unclear

File: feature_extraction_nemo_conformer_tdt.js + transducer_cache.js

Problem:
The cache uses symbols (NEMO_FEATURE_OUTPUT_OWNERSHIP, NEMO_FEATURE_OUTPUT_RELEASE) to signal ownership, but:

  1. The symbols are exported from feature_extractor but the cache doesn't export/use them directly
  2. Pipeline code checks inputs?.[NEMO_FEATURE_OUTPUT_RELEASE] but there's no guarantee the feature extractor actually attaches this symbol
  3. If caching is disabled, set() returns false and caller retains ownership, but the code doesn't document this contract clearly

Impact:
Developers may forget to dispose processor outputs if cache is disabled, or cache may not properly release tensors if the release callback isn't invoked by downstream code.

Fix:
Document ownership transfer explicitly; consider returning a wrapper object with guaranteed lifecycle management.


[Medium] Cache Eviction May Discard In-Use Entries

File: transducer_cache.js (lines ~170-190)

Problem:
The LRU eviction logic can evict entries that have pending borrows:

_deleteEntry(key, entry) {
    // ...
    if (entry.borrowers > 0) {
        entry.pendingDispose = true;  // Delays disposal but entry is already removed from cache
    }
}

If a new call to acquire() comes while pendingDispose is true, the entry is not in the cache, so acquire returns null, but a borrowed reference may still exist elsewhere. This creates ambiguity: is the caller responsible for disposal of a borrowed tensor that the cache says doesn't exist?

Impact:
Potential double-disposal or use-after-free if caller tries to dispose a tensor that cache has already marked for pending disposal.

Fix:
Prevent eviction of borrowed entries or document strict ordering: "Once acquired, caller must release before cache can evict."


[Medium] No Validation of Decoder IO Names at Construction

File: modeling_nemo_conformer_tdt.js (lines ~160-200)

Problem:
The resolveTransducerConfig() function is called in the constructor but doesn't validate that the session's actual input/output names match the configured IO mapping. A mismatch is only detected at runtime during the first _runDecoder() call, after initialization succeeds.

Impact:
Misconfiguration silently passes initialization and only fails during inference, wasting resources.

Fix:
Validate decoder session names in resolveTransducerConfig():

const decoderSession = sessions.decoder_model_merged;
const expectedNames = [io.decoder_encoder, io.decoder_token, ...];
const missingOutputs = expectedNames.filter(name => !decoderSession.outputNames.includes(name));
if (missingOutputs.length > 0) throw new Error(...);

Low-Severity Issues

[Low] Hardcoded Magic Numbers in Pipeline

File: pipeline_nemo_conformer_tdt.js (lines ~1-15)

Constants like NEMO_AUTO_WINDOW_THRESHOLD_S = 180, NEMO_AUTO_CHUNK_LENGTH_S = 90 are hardcoded but should be exposed as configurable options or documented in clear comments explaining why these specific values were chosen.

[Low] Test Coverage Gap: Error Paths

Tests for from_pretrained() with missing encoder/decoder files are not visible in the provided test output. This is a critical error path that should be validated.


Merge Decision: NEEDS CHANGES

This PR introduces significant functionality but has critical memory leak issues and numerical stability problems that will cause production failures:

  1. Blocker: Undisposed frame tensors in decoding loop (linear memory leak with audio duration)
  2. Blocker: Incomplete decoder output disposal (state tensor leaks)
  3. Blocker: Invalid timestamp computation when token stream is empty (breaks pipeline)
  4. Blocker: Silent NaN→0 conversion in confidence metrics (masks failures)

Required Actions Before Merge:

  • Fix frame tensor disposal in decoding loop (minimal change, high impact)
  • Audit and explicitly dispose all decoder output tensors
  • Add guards against NaN/Infinity in confidence and timestamp calculations
  • Validate utterance timestamp ranges before returning
  • Test error paths for from_pretrained with missing model files
  • Add integration test covering long audio (>5min) to detect memory leaks
  • Document feature cache ownership semantics clearly

Acceptable with follow-ups: No. Memory leaks are unacceptable in production code, especially in a framework handling user-provided audio.

ysdede added 30 commits March 1, 2026 16:56
…cache helpers

Carry over non-runtime typing fixes from the prior branch while intentionally excluding the WebGPU disable_prepacking workaround in session.js.\n\n- Cast dynamic model.transcribe access for Nemo TDT pipeline method checks/calls.\n- Cast Tensor data byteLength access in transducer cache utilities.\n- Add explicit tuple/object JSDoc annotations in transducer timestamp builder.\n\nThis keeps main-based v4 work clean with latest ORT-Web on origin/main and avoids retaining the temporary encoder prepacking workaround.
- Replace legacy per-feature flags (return_token_timestamps,
  return_word_timestamps, return_utterance_timestamp) with a layered API:
  return_timestamps (utterance-level), return_words, return_tokens
- Merge duplicate outputs: words absorbs word_timestamps,
  tokens absorbs token_timestamps and token_ids
- Add per-token confidence, word-level confidence aggregation,
  utterance_confidence, and confidence_scores summary
- Gate frame confidences behind returnFrameConfidences flag
- Add return_metrics with encode/decode/total timing and RTF
- Add debug flags: returnFrameIndices, returnLogProbs, returnTdtSteps
- Fix vocab Map handling in getIdToTokenMap and _resolveVocabSize
  (tokenizer.get_vocab() returns Map in WASM binding)
- Update ASR pipeline to wire timestamp_granularity to new model flags
- Format all changed files with Prettier per CONTRIBUTING.md
…ipeline

- Add roundTs() for millisecond-precision timestamp rounding at source

- Round all confidence averages to 6 decimal places

- Round per-token and per-word confidence values

- Remove timestamp_granularity and formatting helpers from pipeline

- Pipeline returns model.transcribe() output directly

- Auto-enable return_words and return_metrics when return_timestamps is true
…imestamps, honor return_metrics kwarg

- modeling_nemo_conformer_tdt: dispose logits and new decoder state tensors
  before throwing when logitsData.length < vocabSize to prevent resource leak
- modeling_nemo_conformer_tdt: move returnFrameConfidences output block outside
  the return_timestamps guard so frame/frame_avg are emitted independently
- automatic-speech-recognition: change return_metrics from hardcoded true to
  kwargs.return_metrics ?? false to respect user intent and avoid overhead
- Accept upstream restructuring: SUPPORTED_TASKS and pipeline imports moved
  from pipelines.js to pipelines/index.js
- Migrate NemoConformerForTDT registration to pipelines/index.js accordingly
- Add MODEL_TYPES.NemoConformerTDT (id=16) to modeling_utils
- Register NemoConformerForTDT in MODEL_TYPE_MAPPING, MODEL_NAME_TO_CLASS_MAPPING,
  and MODEL_CLASS_TO_NAME_MAPPING so the base class from_pretrained, ModelRegistry,
  and is_pipeline_cached all recognise the model correctly
- Add NemoConformerTDT case to get_model_files so progress_callback receives
  accurate file size totals for encoder_model.onnx + decoder_model_merged.onnx
Standardizes internal logging to follow the upstream convention introduced
in ModelRegistry refactor.
- Guard feature extractor against empty/short audio (NaN prevention)

- Move decoder tensor init inside try block for safe disposal

- Add architecture key to MODEL_TYPE_MAPPING

- Add input validation in buildTransducerDetailedOutputs

- Harden audio cache hash against NaN samples

- Add order validation in computeTemporalDeltas

- Restore pipeline: return_timestamps truthy => words + metrics always on
- Remove all timestamp_granularity tests (feature was removed)

- Fix option names: return_tokens, return_words, return_timestamps

- Fix output fields: tokens/words arrays, not token_ids/word_timestamps

- Verify pipeline passes return_words + return_metrics when timestamps on

- Add test: return_timestamps 'word' treated as truthy
Address reviewer findings except the return_metrics policy decision.

- Fix temporal delta concatenation to interleave per frame and add dtype validation.
- Validate preemphasis range and clamp normalization variance in feature extraction.
- Remove unsafe encoder layout inference; require explicit encoder_output_layout.
- Redesign decode loop to read frame data on-demand instead of eager frame materialization.
- Deduplicate word finalization and avoid zero-filling missing word confidences.
- Tighten tests for delta layout/type checks, explicit layout requirement, call counts, and naming accuracy.
Fixes high-impact issues found in PR review validation:\n- force NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT in registry overrides\n- ensure encoder outputs are disposed when pre-decode validation throws\n- remove stride sampling from audio cache key hashing to prevent false cache hits\n- use encoder_model selector key in get_model_files for Nemo per-component dtype/device overrides\n\nAlso adds targeted regression tests for mapping, disposal behavior, file selection, and cache key correctness.
- Clamp token end timestamps to encoder frame bounds during TDT decoding.\n- Validate FeatureLRUCache constructor limits to fail fast on invalid settings.\n- Add regression tests for timestamp clamping and cache limit validation.
Dispose intermediate tensors in computeTemporalDeltas concatenate paths and dispose replaced base input features when delta concatenation returns a new tensor.\n\nAdd regression tests that assert disposal behavior for delta concatenate flows and feature extractor reassignment.
Dispose non-essential Tensor outputs returned by decoder steps to prevent cumulative memory growth. Keep logits/state tensors alive for decoding/state transitions and dispose extras immediately.\n\nAdd regression test to assert auxiliary decoder tensor outputs are disposed each step.
Compute encoder length directly from attention_mask.data instead of attention_mask.tolist() to avoid large transient array allocations in ASR decode hot path.
Fail fast when duration logits are required but missing in decoder output, and enforce positive-integer vocab size at runtime config validation.

Validate prepared Nemo pipeline audio for non-empty finite samples before processor/model calls.

Add regression tests for missing duration logits and non-finite audio rejection.
Fix placeholder interpolation in _prepare_model_inputs error text.

Add fail-fast validation for Nemo delta_window and reject duplicate decoder output aliases in transducer io config.

Add regression tests for delta_window validation and duplicate decoder output alias rejection.
Validate transcribe timeOffset as finite and guard encoderOutputs cleanup path to avoid masking primary failures.

Align transducer_text JSDoc token type with runtime shape (include id).

Harden Parakeet feature extractor test by using direct mask data and explicit tensor disposal via try/finally; add timeOffset validation regression test.
- fail fast on missing decoder state outputs and invalid encoder layout enums\n- make FeatureLRUCache own cached tensor lifetimes (replace/evict/clear) with deduped disposal and deterministic size fallback\n- validate n_fft/win_length in Nemo feature extractor\n- align Nemo ASR pipeline docs with actual forwarded options\n- add regression coverage for runtime config validation, non-concatenated deltas/cache behavior, missing decoder state outputs, and cache disposal semantics\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt\n- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
Apply Gemini review nit in Nemo decode loop by replacing a redundant duration expression with Math.max(1, step).\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
Checklist (bot comment IDs):
- [x] 2892132356: guard tokenizer.get_vocab() return type before Object.keys in _resolveVocabSize.
- [x] 2892132367: treat zero cache limits as explicit no-cache mode; do not store/dispose just-produced values.
- [x] 2892132372: dispose processor tensors in Nemo ASR pipeline when cache does not own lifetimes.

Added regression tests for vocab resolution fallback, zero-limit cache semantics, and Nemo pipeline tensor ownership behavior.

Validation:
- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
- widen confidenceFromLogits input type to Tensor data arrays

- narrow feature_cache access with explicit typed cast in ASR pipeline
Checklist (bot comment IDs):
- [x] 2892287484: handle array-returning tokenizer vocab in _resolveVocabSize.
- [x] 2892322884: avoid disposing when re-setting the same object for an existing cache key.
- [x] 2892322906: skip caching oversized values to prevent insert-then-dispose of caller-owned tensors.
- [x] 2892322910: guard byteLength type in estimateSizeBytes.

Added regression tests for array vocab sizing, same-object set behavior, oversized value skipping, and non-numeric byteLength handling.

Validation:
- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
ysdede added 12 commits March 6, 2026 01:44
Align the Nemo ASR pipeline with the shared task contract by returning
text-only results by default and chunk-based timestamps for segment and
word modes. Add automatic long-audio windowing, decoded-text-driven word
reconstruction, and model-local helpers for window merge and chunk
assembly.

Also add regression coverage for numeric/punctuation word boundaries,
windowed merge behavior, and auto-windowed long-form pipeline decoding.
Remove the standalone parakeet feature extractor test from this branch.
It exercises an existing parakeet_ctc path that is outside the scope of
Conformer TDT integration and makes the PR look broader than it is.
Use conservative sentence boundaries for pipeline timestamps and long-audio cursoring in the NeMo Conformer TDT pipeline. This keeps the HF-style pipeline contract while replacing the old fixed-window merge path with sentence-driven retranscription.

Also remove dead NeMo window-merge helpers, delete the obsolete compatibility barrel, and extend the model and pipeline tests around cache handling, timestamps, and long-audio behavior.
Keep the shared ASR pipeline suite focused on the public Nemo contract and move adapter-specific windowing, retranscription, cache-ownership, and disposal coverage into a dedicated Nemo pipeline test file.

Narrow the source diff by removing explanatory Nemo comments and reverting unrelated upstream-only tweaks, while also fixing the review findings around cursor snap-forward merging, tokenizer vocab-shape handling, empty timestamp validation, and cache borrow/release semantics for active inference.

Verification:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Apply the remaining valid Nemo Conformer TDT review fixes without widening the shared ASR pipeline surface.

- honor encoder_input_layout for canonical input_features feeds
- keep borrowed cache entries counted until they are actually released
- reject tokenizer-less non-empty word-offset reconstruction
- raise the auto-window budget to match the minimum guaranteed cursor advance
- add focused model and pipeline regressions for each fix

Verified with:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Restore the original cast spacing in the unrelated moonshine path so the Nemo PR does not carry an extra formatting-only diff in automatic-speech-recognition.js.
Resolve sparse tokenizer vocab fallback by deriving the runtime size from the maximum token id instead of counting entries. This keeps decoder sizing correct when tokenizer ids are non-contiguous.

Tighten merged-word dedupe so punctuation-only overlaps are only collapsed when their raw normalized text also matches, which avoids dropping distinct punctuation tokens across window boundaries.

Add focused Nemo model regressions and verify with:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Treat likely domain suffixes as continuations when tokenizer decoding inserts whitespace after a trailing period, so sequences like `LibriVox. org.` reconstruct as `LibriVox.org.` in detailed word offsets.

Add a focused regression covering the split `.org` token pattern and verify with:
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"
- node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
sourcery-ai[bot]

This comment was marked as resolved.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive Nemo Conformer TDT integration, adding a new encoder-only ASR model type with feature extraction, transducer-based decoding, audio caching, delta feature computation, and pipeline support. Includes corresponding processor, registry updates, and extensive test coverage.

Changes

Cohort / File(s) Summary
Core Model Implementation
packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js, packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js, packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
Adds NemoConformerTDTPreTrainedModel and NemoConformerForTDT with frame-by-frame transducer decoding, encoder/decoder session management, duration-based token alignment, and optional timestamp/word extraction. Feature extractor provides log-Mel spectrogram computation with preemphasis, optional caching, delta/delta-delta support, and per-frame normalization. Processor delegates to feature extractor.
Transducer Utility Modules
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js, transducer_deltas.js, transducer_word_offsets.js, transducer_text.js, transducer_segment_offsets.js, transducer_window_merge.js
Implements LRU caching with ownership tracking and size enforcement, temporal delta computation with optional concatenation, tokenizer-agnostic word-boundary extraction from token sequences, text decoding and detailed output construction, sentence-boundary heuristics for segmentation, and word deduplication with normalization. Shared infrastructure for transducer pipeline.
Pipeline Integration
packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js, packages/transformers/src/pipelines/automatic-speech-recognition.js, packages/transformers/src/pipelines/index.js
Adds runNemoConformerTDTPipeline with auto-windowing transcription, word merging, segment deduplication, and timestamp alignment. Integrates into AutomaticSpeechRecognitionPipeline via routing case and adds NemoConformerForTDT to supported models list.
Registry and Module Exports
packages/transformers/src/models/modeling_utils.js, feature_extractors.js, models.js, processors.js, registry.js, packages/transformers/src/utils/model_registry/get_model_files.js
Registers NemoConformerTDT model type (ID 16) in MODEL_TYPES enumeration. Adds re-exports for feature extractor, model, and processor modules. Updates registry with MODE_MAPPING_NAMES_ENCODER_ONLY and CUSTOM_MAPPING entries. Extends model file resolution to load encoder_model and decoder_model_merged for this type.
Test Suites
packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js, test_modeling_nemo_conformer_tdt.js, packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js, test_pipelines_automatic_speech_recognition.js
Comprehensive test coverage for feature extraction (delta modes, caching, ownership semantics), model decoding with edge cases (missing outputs, vocab resolution, resource disposal), pipeline windowing and merging logic, and integration into ASR pipeline with mock components.

Sequence Diagram(s)

sequenceDiagram
    participant Audio as Audio Input
    participant Processor as NemoConformerTDTProcessor
    participant FeatureExtractor as NemoConformerTDTFeatureExtractor
    participant Model as NemoConformerForTDT
    participant Encoder as ONNX Encoder Session
    participant Decoder as ONNX Decoder Session
    participant Output as Transcription Output

    Audio->>Processor: call(audio)
    Processor->>FeatureExtractor: call(audio)
    FeatureExtractor->>FeatureExtractor: compute spectrogram + preemphasis
    FeatureExtractor->>FeatureExtractor: apply mel-filterbank
    FeatureExtractor->>FeatureExtractor: normalize per-frame
    FeatureExtractor->>FeatureExtractor: optionally cache features
    FeatureExtractor-->>Processor: input_features + attention_mask
    Processor-->>Model: input_features + attention_mask
    
    Model->>Encoder: encode(input_features)
    Encoder-->>Model: encoder_outputs
    
    Model->>Model: initialize decoder state
    
    loop Frame-by-Frame Decoding
        Model->>Decoder: decode(encoder_frame, token, state)
        Decoder-->>Model: logits, output_states
        Model->>Model: argmax logit → token
        Model->>Model: extract duration from logits
        Model->>Model: advance frame by duration
    end
    
    Model->>Model: build word offsets from tokens + timestamps
    Model-->>Output: text + words + timestamps + confidence
Loading
sequenceDiagram
    participant Audio as Long Audio
    participant Pipeline as runNemoConformerTDTPipeline
    participant Processor as Processor
    participant Model as Model.transcribe()
    participant Segmenter as Segmenter
    participant Merger as dedupeMergedWords
    participant Output as Final Output

    Audio->>Pipeline: audio (streaming windowed)
    Pipeline->>Pipeline: partition into windows (90s default)
    
    loop For Each Window
        Pipeline->>Processor: process_window(audio_chunk)
        Processor-->>Pipeline: features
        
        Pipeline->>Model: transcribe(features)
        Model-->>Pipeline: {text, words[], utteranceTimestamp}
        
        Pipeline->>Merger: dedupe words from merged decoding
        Merger-->>Pipeline: deduplicated words[]
        
        Pipeline->>Segmenter: partition words into sentences
        Segmenter-->>Pipeline: segments[]
        
        Pipeline->>Pipeline: merge with previous segments
        Pipeline->>Pipeline: validate no overlap/gaps
    end
    
    Pipeline-->>Output: {text, chunks[timestamp, text]}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[WIP] Add NeMo Conformer TDT ASR support' clearly identifies the main feature being added: NeMo Conformer TDT automatic speech recognition support. It accurately describes the primary change across the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch v4-nemo-conformer-tdt-main-r3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ysdede
Copy link
Copy Markdown
Owner Author

ysdede commented Mar 9, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Repository owner deleted a comment from gemini-code-assist bot Mar 9, 2026
@ysdede ysdede marked this pull request as ready for review March 9, 2026 22:28
@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add NeMo Conformer TDT ASR support with pipeline integration and comprehensive testing

✨ Enhancement 🧪 Tests

Grey Divider

Walkthroughs

Description
• Adds comprehensive NeMo Conformer TDT ASR model support to transformers.js with full pipeline
  integration
• Implements NemoConformerForTDT model class with greedy token-and-duration transducer decoding
  and transcribe() method supporting timestamps, confidence scores, and detailed outputs
• Adds NemoConformerTDTFeatureExtractor with log-mel spectrogram extraction, temporal
  delta/delta-delta features, and LRU caching with tensor lifecycle management
• Integrates NeMo pipeline into AutomaticSpeechRecognitionPipeline with automatic long-audio
  windowing (180+ seconds) using sentence-cursor restart logic
• Implements word reconstruction from token sequences with boundary detection and punctuation
  handling via buildTransducerWordOffsets()
• Adds sentence-level segmentation heuristic with special handling for abbreviations, acronyms, and
  non-breaking periods
• Registers model, processor, and feature extractor in model registry with dual-artifact file
  resolution (encoder_model and decoder_model_merged)
• Includes comprehensive test coverage: model initialization/validation, TDT decoding, feature
  extraction, pipeline integration, long-audio handling, and tensor disposal
• Fixes from review: pending-prefix preservation, vocab handling validation, cache borrow/release,
  encoder input layout, auto-window budget, and punctuation-only deduplication
Diagram
flowchart LR
  A["Audio Input"] -->|Feature Extraction| B["NemoConformerTDTFeatureExtractor"]
  B -->|Log-mel + Deltas| C["Feature Cache"]
  C -->|Cached Features| D["NemoConformerForTDT Model"]
  D -->|Encoder/Decoder| E["TDT Greedy Decoding"]
  E -->|Token Sequences| F["buildTransducerWordOffsets"]
  F -->|Word Boundaries| G["partitionNemoWordsIntoSegments"]
  G -->|Sentence Chunks| H["ASR Pipeline Output"]
  I["Long Audio 180s+"] -->|Auto-window| J["Sentence-cursor Restart"]
  J -->|Windowed Segments| D
Loading

Grey Divider

File Changes

1. packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js 🧪 Tests +1215/-0

NemoConformerForTDT model and utility test suite

• Comprehensive test suite for NemoConformerForTDT model with 1215 lines covering model
 initialization, validation, and TDT decoding
• Tests for greedy token-and-duration transducer decoding with timestamp and confidence calculations
• Utility function tests for word reconstruction, temporal deltas, audio caching, and LRU cache
 management
• Regression tests for encoder/decoder output validation, state management, and tensor disposal

packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js


2. packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js ✨ Enhancement +958/-0

NemoConformerForTDT model implementation with TDT decoding

• Implements NemoConformerForTDT class extending NemoConformerTDTPreTrainedModel with TDT greedy
 decoding
• Provides transcribe() method supporting timestamps, word/token-level outputs, confidence scores,
 and metrics
• Handles encoder/decoder session management with configurable I/O names and layout transformations
• Validates runtime configuration including vocab size, blank token ID, and duration indices

packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js


3. packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js ✨ Enhancement +345/-0

NemoConformerForTDT ASR pipeline with long-audio windowing

• Implements runNemoConformerTDTPipeline() for ASR task integration with automatic long-audio
 windowing
• Handles sentence-level segmentation with cursor-based restart logic for audio exceeding 180
 seconds
• Supports word-level and sentence-level timestamp chunks with deduplication and gap detection
• Manages feature cache ownership and tensor lifecycle through pipeline execution

packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js


View more (19)
4. packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js ✨ Enhancement +266/-0

NemoConformerForTDT feature extraction with caching

• Implements NemoConformerTDTFeatureExtractor for log-mel spectrogram extraction with NeMo-style
 preprocessing
• Supports optional temporal delta and delta-delta features with configurable concatenation
• Integrates LRU cache for feature caching with ownership tagging and lifecycle management
• Includes preemphasis, mel-filtering, and feature normalization aligned with Parakeet model
 requirements

packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js


5. packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js 🧪 Tests +77/-1

NemoConformerForTDT ASR pipeline integration tests

• Adds NemoConformerForTDT-specific pipeline tests covering timestamp modes (disabled,
 sentence-level, word-level)
• Tests verify correct delegation to model.transcribe() with appropriate option flags
• Validates output shape for text-only, sentence chunks, and word chunks modes
• Removes unused long audio test data variable

packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js


6. packages/transformers/src/models/modeling_utils.js ⚙️ Configuration changes +1/-0

Add NemoConformerTDT model type constant

• Adds NemoConformerTDT: 16 to MODEL_TYPES enum for model type registration

packages/transformers/src/models/modeling_utils.js


7. packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js 🧪 Tests +793/-0

NeMo Conformer TDT pipeline adapter test suite

• Added comprehensive test suite for NeMo Conformer TDT pipeline adapter with 13 test cases
• Tests cover sentence chunking, long-audio windowing with configurable chunk_length_s,
 boundary-truncated sentence replacement, and pending prefix preservation
• Tests validate word reconstruction from merged words, auto-windowing for 300-second audio, and
 non-finite audio sample rejection
• Tests verify tensor ownership and disposal behavior with feature cache enabled/disabled

packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js


8. packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js 🧪 Tests +311/-0

NeMo Conformer TDT feature extractor test coverage

• Added 14 test cases for NemoConformerTDTFeatureExtractor covering mel-bin configurations (80,
 128)
• Tests validate temporal delta and delta-delta feature computation with concatenation and
 non-concatenation modes
• Tests verify feature cache behavior including cache hits, ownership marking, and eviction policies
• Tests validate configuration constraints for preemphasis, delta_window, n_fft, and win_length
 parameters

packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js


9. packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js ✨ Enhancement +231/-0

Token-to-word boundary reconstruction for transducers

• Implemented buildTransducerWordOffsets() function to reconstruct word boundaries from token
 sequences with timestamps
• Added tokenizer-agnostic token piece resolution with boundary marker detection (▁, Ġ, Ċ)
• Implemented text alignment logic to handle spacing and punctuation around tokens
• Supports optional token confidence aggregation with word-level averaging

packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js


10. packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js ✨ Enhancement +259/-0

LRU cache for audio feature extraction

• Implemented FeatureLRUCache class for caching extracted audio features with configurable
 size/entry limits
• Added createAudioCacheKey() function using FNV-1a hashing for deterministic cache keys
• Implemented borrower tracking and deferred disposal for cached tensors with pending eviction
• Supports cache statistics and explicit cache clearing with proper tensor lifecycle management

packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js


11. packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js ✨ Enhancement +178/-0

Sentence-level segmentation from word timestamps

• Implemented conservative sentence segmentation heuristic for ASR word timestamps
• Added partitionNemoWordsIntoSegments() to group words into sentence-like chunks based on
 punctuation and capitalization
• Implemented shouldEndSentenceAfterWord() with special handling for abbreviations, acronyms, and
 non-breaking periods
• Provides buildNemoSegmentChunks() to convert word segments into timestamp-annotated chunks

packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js


12. packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js ✨ Enhancement +91/-0

Temporal delta feature computation

• Implemented computeTemporalDeltas() function for computing delta and delta-delta features from
 [1, T, F] tensors
• Supports configurable delta order (1 or 2) and window size with proper normalization
• Implements frame-wise interleaving for concatenated delta output
• Properly disposes intermediate tensors when concatenating delta features

packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js


13. packages/transformers/src/pipelines/automatic-speech-recognition.js ✨ Enhancement +16/-0

NeMo Conformer TDT pipeline integration

• Added dispatch case for nemo-conformer-tdt model type in _call() method
• Integrated runNemoConformerTDTPipeline() as new ASR pipeline handler
• Imported NeMo-specific pipeline implementation from dedicated module

packages/transformers/src/pipelines/automatic-speech-recognition.js


14. packages/transformers/src/utils/model_registry/get_model_files.js ✨ Enhancement +5/-2

Dual-artifact model file resolution for NeMo

• Added model file resolution for NemoConformerTDT type requiring dual artifacts: encoder_model
 and decoder_model_merged
• Fixed indentation inconsistency in warning message formatting

packages/transformers/src/utils/model_registry/get_model_files.js


15. packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js ✨ Enhancement +30/-0

Text decoding and detailed output generation

• Implemented decodeTransducerText() for converting token IDs to final transcription text
• Implemented buildTransducerDetailedOutputs() to generate word/token outputs with optional
 confidence aggregation
• Delegates word offset reconstruction to buildTransducerWordOffsets() utility

packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js


16. packages/transformers/src/models/registry.js ⚙️ Configuration changes +2/-0

NeMo Conformer TDT model registry entries

• Added nemo-conformer-tdt to MODEL_MAPPING_NAMES_ENCODER_ONLY registry with model class
 NemoConformerForTDT
• Added NemoConformerForTDT to CUSTOM_MAPPING with corresponding model files and type

packages/transformers/src/models/registry.js


17. packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js ✨ Enhancement +43/-0

Deduplication of merged window words

• Implemented dedupeMergedWords() function to remove duplicate words from merged window outputs
• Uses normalized text comparison with fallback to raw text for punctuation-only words
• Preserves longer duration word when duplicates overlap in time

packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js


18. packages/transformers/src/pipelines/index.js ⚙️ Configuration changes +2/-1

NeMo Conformer TDT pipeline registration

• Added NemoConformerForTDT import from nemo_conformer_tdt module
• Added NemoConformerForTDT to supported models list for automatic-speech-recognition task

packages/transformers/src/pipelines/index.js


19. packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js ✨ Enhancement +19/-0

NeMo Conformer TDT processor implementation

• Implemented NemoConformerTDTProcessor class extending Processor base class
• Configured tokenizer and feature extractor auto-discovery via AutoTokenizer and
 AutoFeatureExtractor
• Implemented _call() method delegating to feature extractor for audio preprocessing

packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js


20. packages/transformers/src/models/feature_extractors.js ⚙️ Configuration changes +1/-0

Feature extractor module exports

• Added export for NeMo Conformer TDT feature extraction module

packages/transformers/src/models/feature_extractors.js


21. packages/transformers/src/models/models.js ⚙️ Configuration changes +1/-0

Model module exports

• Added export for NeMo Conformer TDT modeling module

packages/transformers/src/models/models.js


22. packages/transformers/src/models/processors.js ⚙️ Configuration changes +1/-0

Processor module exports

• Added export for NeMo Conformer TDT processor module

packages/transformers/src/models/processors.js


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Mar 9, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Disposed tensors returned 🐞 Bug ⛯ Reliability
Description
NemoConformerTDTFeatureExtractor._call() returns extracted as caller-owned when
FeatureLRUCache.set() returns false, but FeatureLRUCache.set() may already have evicted and
disposed that same value, yielding disposed tensors to downstream code.
Code

packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js[R161-172]

+        if (this.feature_cache) {
+            const key = `${createAudioCacheKey(audio, this.config.sampling_rate)}:${this.delta_order}:${this.delta_window}:${this.delta_concatenate}`;
+            const cached = this.feature_cache.acquire(key);
+            if (cached) {
+                return tagNemoFeatureOutputOwnership({ ...cached.value }, true, cached.release);
+            }
+
+            const extracted = await this._extract(audio);
+            const cacheOwnsTensors = this.feature_cache.set(key, extracted);
+            if (!cacheOwnsTensors) {
+                return tagNemoFeatureOutputOwnership({ ...extracted }, false);
+            }
Evidence
The feature extractor assumes set(...)=false means it should return the just-created tensors as
caller-owned. However, the cache implementation can evict the just-inserted entry during _evict()
and dispose it in _deleteEntry() even though set() returns false; the repo’s own tests
explicitly assert this behavior (set returns false and the inserted tensor is disposed). Because
Tensor.dispose() is unguarded, returning disposed tensors can lead to use-after-dispose or
double-dispose failures downstream (e.g., the pipeline disposes non-cache-owned inputs).

packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js[158-178]
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js[83-128]
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js[145-176]
packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js[1114-1147]
packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js[288-299]
packages/transformers/src/utils/tensor.js[112-115]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`NemoConformerTDTFeatureExtractor` treats `FeatureLRUCache.set(...)=false` as “cache did not retain ownership, so the returned tensors are caller-owned and valid”. But the cache can return `false` after evicting and disposing the inserted value, which can cause downstream use-after-dispose/double-dispose.

### Issue Context
- `FeatureLRUCache.set()` currently calls `_evict()` and may dispose entries via `_deleteEntry()`.
- Repo tests already assert that `set()` can return `false` while disposing the inserted value.
- The feature extractor returns `extracted` when `set()` returns `false`, which becomes unsafe.

### Fix Focus Areas
- packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js[83-176]
- packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js[158-178]
- packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js[1114-1147]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Redundant audio full scans 🐞 Bug ➹ Performance
Description
The NeMo pipeline performs a full-sample finiteness scan and (when cache is enabled) a full-sample
hash pass to build the cache key, adding avoidable O(n) overhead for long audio.
Code

packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js[R22-37]

+function validateNemoAudio(audio, index) {
+    if (!(audio instanceof Float32Array || audio instanceof Float64Array)) {
+        throw new TypeError(
+            `Nemo Conformer TDT pipeline expected audio at index ${index} to be Float32Array or Float64Array.`,
+        );
+    }
+    if (audio.length === 0) {
+        throw new Error(`Nemo Conformer TDT pipeline expected non-empty audio at index ${index}.`);
+    }
+    for (let i = 0; i < audio.length; ++i) {
+        if (!Number.isFinite(audio[i])) {
+            throw new Error(
+                `Nemo Conformer TDT pipeline expected finite audio samples; found ${audio[i]} at index ${index}:${i}.`,
+            );
+        }
+    }
Evidence
validateNemoAudio iterates over all samples to ensure finiteness, and createAudioCacheKey
iterates over all samples to compute an FNV-1a hash. When feature caching is enabled, both passes
occur in the same pipeline call path (prepare -> validate -> processor -> feature extractor keying).

packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js[22-38]
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js[9-25]
packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js[161-163]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
There are two separate full O(n) loops over the audio samples: one for finiteness validation and one for cache-key hashing. This adds overhead for long audio.

### Issue Context
- `validateNemoAudio()` scans the entire audio array.
- `createAudioCacheKey()` scans the entire audio array.
- With feature cache enabled, both occur in the same request path.

### Fix Focus Areas
- packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js[22-38]
- packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js[9-25]
- packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js[161-163]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality by adding support for NeMo Conformer TDT ASR models. The changes are extensive, including new model and feature extractor implementations, pipeline integration, long-audio handling, and various utilities for decoding and post-processing. The code is well-structured and includes comprehensive test coverage. My review focuses on improving the maintainability of some of the larger, more complex methods by suggesting refactoring opportunities.

Comment on lines +197 to +225
// normalize mel features, ignoring padding
const sum = new Float64Array(num_features);
const sum_sq = new Float64Array(num_features);

for (let i = 0; i < features_length; ++i) {
const offset = i * num_features;
for (let j = 0; j < num_features; ++j) {
const val = features_data[offset + j];
sum[j] += val;
sum_sq[j] += val * val;
}
}

// Skip normalization for empty/very short audio to avoid NaN from divide-by-zero.
if (features_length > 0) {
// Calculate mean and standard deviation, then normalize
const divisor = features_length > 1 ? features_length - 1 : 1;
for (let j = 0; j < num_features; ++j) {
const mean = sum[j] / features_length;
const variance = (sum_sq[j] - features_length * mean * mean) / divisor;
const std = Math.sqrt(Math.max(variance, 0)) + EPSILON;
const inv_std = 1 / std;

for (let i = 0; i < features_length; ++i) {
const index = i * num_features + j;
features_data[index] = (features_data[index] - mean) * inv_std;
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve readability and separation of concerns, consider extracting this feature normalization logic into a separate private helper method. The current implementation is nested within the _extract method and is quite dense. A dedicated function, for example _normalizeFeatures(featuresData, numFeatures, featuresLength), would make the _extract method easier to follow.

Comment on lines +588 to +942
async transcribe(model_inputs, decode_options = {}) {
const {
tokenizer = null,
returnTimestamps: returnTimestampsOption,
return_timestamps: legacyReturnTimestamps,
returnWords: returnWordsOption,
return_words: legacyReturnWords,
returnTokens: returnTokensOption,
return_tokens: legacyReturnTokens,
returnMetrics: returnMetricsOption,
return_metrics: legacyReturnMetrics,
returnFrameConfidences = false,
returnFrameIndices = false,
returnLogProbs = false,
returnTdtSteps = false,
timeOffset = 0,
} = decode_options;
const returnTimestamps = returnTimestampsOption ?? legacyReturnTimestamps ?? true;
const returnWords = returnWordsOption ?? legacyReturnWords ?? false;
const returnTokens = returnTokensOption ?? legacyReturnTokens ?? false;
const returnMetrics = returnMetricsOption ?? legacyReturnMetrics ?? false;

if (!Number.isFinite(timeOffset)) {
throw new Error('NemoConformerForTDT.transcribe expected `timeOffset` to be a finite number.');
}
const totalStart = nowMs();
const io = this.transducer.io;
const vocabSize = this._resolveVocabSize(tokenizer);
this._validateRuntimeConfig(vocabSize);

const { feeds: encoderFeeds, disposables } = this._buildEncoderFeeds(model_inputs);
let encoderOutputs;
const encodeStart = nowMs();
try {
encoderOutputs = await this._runEncoder(encoderFeeds);
} finally {
for (const tensor of disposables) {
tensor.dispose();
}
}
const encodeMs = nowMs() - encodeStart;

let frameCount = 0;
let encoderOutput = null;
const frameTime = this.transducer.subsampling_factor * this.transducer.frame_shift_s;

const numLayers = this.transducer.decoder.num_layers;
const hiddenSize = this.transducer.decoder.hidden_size;
const blankId = this.transducer.blank_token_id;
const maxSymbolsPerStep = this.transducer.max_symbols_per_step;

const needConfidences = !!returnTimestamps;

/** @type {number[]} */
const tokenIds = [];
/** @type {[number, number][]} */
const tokenTimestamps = [];
/** @type {number[] | null} */
const tokenConfidences = needConfidences ? [] : null;
/** @type {Map<number, { sum: number, count: number }> | null} */
const frameConfidenceStats = returnFrameConfidences ? new Map() : null;
/** @type {number[] | null} */
const frameIndices = returnFrameIndices ? [] : null;
/** @type {number[] | null} */
const logProbs = returnLogProbs || needConfidences ? [] : null;
/** @type {number[] | null} */
const tdtSteps = returnTdtSteps ? [] : null;

let decoderState;
let targetLengthTensor;
let reusableFrame = null;

let emittedOnFrame = 0;
const decodeStart = nowMs();

try {
encoderOutput = this._getEncoderOutput(encoderOutputs);
frameCount = this._getEncoderFrameCount(encoderOutput);
decoderState = {
state1: new Tensor('float32', new Float32Array(numLayers * hiddenSize), [numLayers, 1, hiddenSize]),
state2: new Tensor('float32', new Float32Array(numLayers * hiddenSize), [numLayers, 1, hiddenSize]),
};

targetLengthTensor =
this.transducer.decoder_token_length_dtype === 'int64'
? new Tensor('int64', BigInt64Array.from([1n]), [1])
: new Tensor('int32', new Int32Array([1]), [1]);

for (let frameIndex = 0; frameIndex < frameCount; ) {
const frameData = this._getFrameData(encoderOutput, frameIndex, reusableFrame);
if (this.transducer.encoder_output_layout === 'BDT') {
reusableFrame = frameData;
}
const frameTensor = this._createFrameTensor(frameData);
const prevTokenId = tokenIds.length > 0 ? tokenIds[tokenIds.length - 1] : blankId;
const tokenTensor =
this.transducer.decoder_token_dtype === 'int64'
? new Tensor('int64', BigInt64Array.from([BigInt(prevTokenId)]), [1, 1])
: new Tensor('int32', new Int32Array([prevTokenId]), [1, 1]);

const decoderFeeds = {
[io.decoder_encoder]: frameTensor,
[io.decoder_token]: tokenTensor,
[io.decoder_token_length]: targetLengthTensor,
[io.decoder_state_1]: decoderState.state1,
[io.decoder_state_2]: decoderState.state2,
};

let decoderOutput;
try {
decoderOutput = await this._runDecoder(decoderFeeds);
} finally {
tokenTensor.dispose();
frameTensor.dispose();
}

const logits = decoderOutput[io.decoder_output];
const outputState1 = decoderOutput[io.decoder_output_state_1];
const outputState2 = decoderOutput[io.decoder_output_state_2];
const seenDecoderTensors = new Set();
for (const value of Object.values(decoderOutput)) {
if (!(value instanceof Tensor) || seenDecoderTensors.has(value)) continue;
seenDecoderTensors.add(value);
if (value === logits || value === outputState1 || value === outputState2) {
continue;
}
value.dispose();
}
if (!(logits instanceof Tensor)) {
this._disposeDecoderState(
{
state1: outputState1,
state2: outputState2,
},
decoderState,
);
throw new Error(
`Nemo Conformer TDT decoder output "${io.decoder_output}" was not returned by the session.`,
);
}
if (!(outputState1 instanceof Tensor) || !(outputState2 instanceof Tensor)) {
logits.dispose();
this._disposeDecoderState(
{
state1: outputState1,
state2: outputState2,
},
decoderState,
);
throw new Error(
`Nemo Conformer TDT decoder state outputs "${io.decoder_output_state_1}" and "${io.decoder_output_state_2}" were not returned by the session.`,
);
}
const logitsData = logits.data;
if (logitsData.length < vocabSize) {
logits.dispose();
this._disposeDecoderState(
{
state1: outputState1,
state2: outputState2,
},
decoderState,
);
throw new Error(
`Nemo Conformer TDT decoder output is too small (${logitsData.length}) for vocab_size=${vocabSize}.`,
);
}
const tokenId = argmax(logitsData, 0, vocabSize);
const durationStart = this.transducer.duration_start_index ?? vocabSize;
const hasDurationLogits = logitsData.length > durationStart;
if (this.transducer.duration_start_index != null && !hasDurationLogits) {
logits.dispose();
this._disposeDecoderState(
{
state1: outputState1,
state2: outputState2,
},
decoderState,
);
throw new Error(
`Nemo Conformer TDT decoder output is missing duration logits: expected values beyond index ${durationStart - 1}, got length=${logitsData.length}.`,
);
}
const step = hasDurationLogits
? argmax(logitsData, durationStart, logitsData.length - durationStart) - durationStart
: 0;
if (tdtSteps) {
tdtSteps.push(step);
}

const maybeConfidence =
needConfidences || returnLogProbs || returnFrameConfidences
? confidenceFromLogits(logitsData, tokenId, vocabSize)
: null;
if (frameConfidenceStats && maybeConfidence) {
const stats = frameConfidenceStats.get(frameIndex);
if (stats) {
stats.sum += maybeConfidence.confidence;
stats.count += 1;
} else {
frameConfidenceStats.set(frameIndex, { sum: maybeConfidence.confidence, count: 1 });
}
}

const newState = {
state1: outputState1,
state2: outputState2,
};

if (tokenId !== blankId) {
this._disposeDecoderState(decoderState, newState);
decoderState = newState;

tokenIds.push(tokenId);
// TDT duration convention: step=0 means "stay on current frame" (duration index 0 = no advance).
// We still associate the token with this frame, so durationFrames is at least 1.
const durationFrames = Math.max(1, step);
const endFrame = Math.min(frameCount, frameIndex + durationFrames);
tokenTimestamps.push([
roundTs(frameIndex * frameTime + timeOffset),
roundTs(endFrame * frameTime + timeOffset),
]);
if (tokenConfidences && maybeConfidence) {
tokenConfidences.push(maybeConfidence.confidence);
}
if (frameIndices) {
frameIndices.push(frameIndex);
}
if (logProbs && maybeConfidence) {
logProbs.push(maybeConfidence.logProb);
}
emittedOnFrame += 1;
} else {
this._disposeDecoderState(newState, decoderState);
}

logits.dispose();

if (step > 0) {
frameIndex += step;
emittedOnFrame = 0;
} else if (tokenId === blankId || emittedOnFrame >= maxSymbolsPerStep) {
frameIndex += 1;
emittedOnFrame = 0;
}
}
} finally {
if (targetLengthTensor) targetLengthTensor.dispose();
if (decoderState) this._disposeDecoderState(decoderState);
if (encoderOutputs && typeof encoderOutputs === 'object') {
const seen = new Set();
for (const value of Object.values(encoderOutputs)) {
if (value instanceof Tensor && !seen.has(value)) {
value.dispose();
seen.add(value);
}
}
}
}
const decodeMs = nowMs() - decodeStart;

const tokenizeStart = nowMs();
const text = decodeTransducerText(tokenizer, tokenIds);
const needDetailed = returnTimestamps && (returnWords || returnTokens);
const detailed = needDetailed
? buildTransducerDetailedOutputs(tokenizer, tokenIds, tokenTimestamps, tokenConfidences)
: null;
const tokenizeMs = nowMs() - tokenizeStart;

/** @type {any} */
const result = { text, isFinal: true };
const utteranceConfidence =
tokenConfidences && tokenConfidences.length > 0
? roundMetric(tokenConfidences.reduce((a, b) => a + b, 0) / tokenConfidences.length, 6)
: null;
const utteranceTimestamp =
tokenTimestamps.length > 0
? /** @type {[number, number]} */ ([
tokenTimestamps[0][0],
tokenTimestamps[tokenTimestamps.length - 1][1],
])
: /** @type {[number, number]} */ ([roundTs(timeOffset), roundTs(frameCount * frameTime + timeOffset)]);
const averageLogProb =
logProbs && logProbs.length > 0
? roundMetric(logProbs.reduce((a, b) => a + b, 0) / logProbs.length, 6)
: null;

if (returnTimestamps) {
result.utteranceTimestamp = utteranceTimestamp;

if (detailed) {
if (returnWords) result.words = detailed.words;
if (returnTokens) result.tokens = detailed.tokens;
}

result.confidence = {
utterance: utteranceConfidence,
wordAverage: detailed?.wordAverage != null ? roundMetric(detailed.wordAverage, 6) : null,
averageLogProb,
};
}

// Frame confidences are independent of return_timestamps — emit whenever requested.
if (returnFrameConfidences && frameConfidenceStats && frameConfidenceStats.size > 0) {
const frameConfidences = [];
for (const { sum, count } of frameConfidenceStats.values()) {
frameConfidences.push(sum / count);
}
result.confidence = {
...(result.confidence ?? {}),
frames: frameConfidences,
frameAverage: roundMetric(frameConfidences.reduce((a, b) => a + b, 0) / frameConfidences.length, 6),
};
}

if (!returnTimestamps && averageLogProb != null) {
result.confidence = {
...(result.confidence ?? {}),
averageLogProb,
};
}

const debug = {};
if (returnFrameIndices) {
debug.frameIndices = frameIndices;
}
if (returnLogProbs) {
debug.logProbs = logProbs;
}
if (returnTdtSteps) {
debug.tdtSteps = tdtSteps;
}
if (Object.keys(debug).length > 0) {
result.debug = debug;
}

if (returnMetrics) {
const totalMs = nowMs() - totalStart;
const utteranceDuration = result.utteranceTimestamp
? Math.max(result.utteranceTimestamp[1] - result.utteranceTimestamp[0], 1e-8)
: Math.max(frameCount * frameTime, 1e-8);
const rtf = totalMs / 1000 / utteranceDuration;
result.metrics = {
preprocessMs: 0.0,
encodeMs: roundMetric(encodeMs, 2),
decodeMs: roundMetric(decodeMs, 2),
tokenizeMs: roundMetric(tokenizeMs, 2),
totalMs: roundMetric(totalMs, 2),
rtf: roundMetric(rtf, 4),
rtfX: roundMetric(1 / Math.max(rtf, 1e-8), 2),
};
}

return result;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The transcribe method is quite long and complex, which could impact readability and future maintenance. I recommend refactoring it by breaking it down into smaller, more focused private methods. For instance, the main decoding loop (lines 676-833) could be extracted into a _greedyDecodeLoop method, and the result construction logic (lines 849-941) could be moved to a _buildTranscribeOutput method. This would make the overall structure of the transcribe method clearer and easier to understand.

this.cache.delete(key);
this.cache.set(key, existing);
return true;
} else {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRITICAL: Potential memory leak - when existing?.value === value but existing.size_bytes > max_bytes, the existing entry is deleted and false is returned, but the caller's value is neither cached nor returned to the caller. The caller expects to retain ownership (since false is returned), but the value is lost.

The fix should either:

  1. Store the value in cache regardless and return ownership status appropriately, or
  2. Return the value to the caller when the cache cannot store it

@kilo-code-bot
Copy link
Copy Markdown

kilo-code-bot bot commented Mar 9, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 1
WARNING 0
SUGGESTION 0
Issue Details (click to expand)

CRITICAL

File Line Issue
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js 100 Memory leak in FeatureLRUCache.set() - when existing?.value === value but existing.size_bytes > max_bytes, the existing entry is deleted and false is returned, but the caller's value is neither cached nor returned to the caller. The caller expects to retain ownership (since false is returned), but the value is lost.

Detailed Analysis

1. Correctness Review

CRITICAL - Memory Leak in Cache (Line 94-103)

The FeatureLRUCache.set() method has a critical bug in the handling of same-object-reference values:

if (existing?.value === value) {
    if (existing.size_bytes <= max_bytes) {
        // Refresh recency - works correctly
        this.cache.delete(key);
        this.cache.set(key, existing);
        return true;
    } else {
        this._deleteEntry(key, existing);  // Deletes existing
        return false;  // Returns false but caller\'s value is lost!
    }
}

When the same object is passed again but the cache entry size exceeds the limit:

  1. The existing entry is deleted
  2. False is returned (indicating cache doesn't own the value)
  3. But the caller's value is neither cached nor returned - it's lost forever

Recommended Fix:

} else {
    // Cannot refresh same-value entry due to size limit
    // Caller retains ownership, value is not lost
    this._deleteEntry(key, existing);
    // Note: caller still has their value reference
    return false;
}

Wait - actually, the value IS the same reference as existing.value, so the caller still has their reference. The issue is that _deleteEntry might dispose the value (since it thinks the cache owns it). Let me re-analyze...

Actually looking more closely at the logic:

  • existing?.value === value means the SAME object
  • When we call _deleteEntry(key, existing), it disposes the value
  • But the caller also has a reference to the same value
  • So this isn't actually a leak - the caller still has their reference

Let me re-evaluate... Actually, this might be intentional behavior since the caller and cache share the same reference. The comment says "Refresh recency for unchanged value without invalidating caller-owned references" - so it seems like the design expects the caller to retain their reference.

Re-assessment: This might NOT be a bug after all - it's returning false to indicate the cache doesn't own the value (since it couldn't store it), but the caller still has their reference to the same object. The existing entry being deleted is the old cached copy, not the caller's value.

However, there's still a subtle issue: when existing?.value === value and we delete the entry, we should NOT dispose the value since the caller still holds a reference to it. Let me check _deleteEntry...

Looking at _deleteEntry:

_deleteEntry(key, entry) {
    const current = this.cache.get(key);
    if (current !== entry) {
        return;
    }
    this.cache.delete(key);
    if (entry.borrowers > 0) {
        entry.pendingDispose = true;
    } else {
        this.current_size_bytes -= entry.size_bytes;
        disposeCachedValue(entry.value);  // This disposes!
    }
}

Yes! This IS a bug - when existing?.value === value, the value is the SAME object the caller holds. But _deleteEntry will dispose it (unless there are borrowers), leaving the caller with a disposed/invalid object.

This is a real bug that needs fixing.

2. Security Review

No concrete security issues identified in this diff. The code properly validates audio inputs, uses type checking, and doesn't appear to have injection vulnerabilities.

3. Performance Review

The implementation includes several good performance practices:

  • Feature LRU caching with configurable size limits
  • Object reuse in decoder loop (reusableFrame)
  • Memory-efficient tensor disposal patterns
  • Quantized audio hashing for cache keys

No major performance concerns detected.

4. Reliability Review

  • Comprehensive error handling with informative messages
  • Input validation for audio samples (finite value check)
  • Proper tensor disposal in finally blocks
  • Cache borrow/release pattern for tensor lifetime management

The memory issue noted above is the main reliability concern.

5. Test Coverage

Comprehensive test coverage including:

  • Unit tests for the cache (FeatureLRUCache)
  • Pipeline tests for various scenarios (windowing, chunking, merging)
  • Tests for tensor disposal and ownership transfer
  • Tests for edge cases (empty audio, NaN values, oversized cache entries)

Merge Recommendation

Request Changes - The memory issue in the cache needs to be addressed before merge. The bug could cause callers to hold references to disposed tensors when same-value entries are evicted due to size limits.

Files Reviewed (8 files)
  • packages/transformers/src/models/feature_extractors.js - Export added
  • packages/transformers/src/models/modeling_utils.js - Model type added
  • packages/transformers/src/models/models.js - Model export added
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js - New model
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js - New pipeline
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js - Cache with bug
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js - Feature extractor
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js - Tests

Comment on lines +161 to +172
if (this.feature_cache) {
const key = `${createAudioCacheKey(audio, this.config.sampling_rate)}:${this.delta_order}:${this.delta_window}:${this.delta_concatenate}`;
const cached = this.feature_cache.acquire(key);
if (cached) {
return tagNemoFeatureOutputOwnership({ ...cached.value }, true, cached.release);
}

const extracted = await this._extract(audio);
const cacheOwnsTensors = this.feature_cache.set(key, extracted);
if (!cacheOwnsTensors) {
return tagNemoFeatureOutputOwnership({ ...extracted }, false);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Disposed tensors returned 🐞 Bug ⛯ Reliability

NemoConformerTDTFeatureExtractor._call() returns extracted as caller-owned when
FeatureLRUCache.set() returns false, but FeatureLRUCache.set() may already have evicted and
disposed that same value, yielding disposed tensors to downstream code.
Agent Prompt
### Issue description
`NemoConformerTDTFeatureExtractor` treats `FeatureLRUCache.set(...)=false` as “cache did not retain ownership, so the returned tensors are caller-owned and valid”. But the cache can return `false` after evicting and disposing the inserted value, which can cause downstream use-after-dispose/double-dispose.

### Issue Context
- `FeatureLRUCache.set()` currently calls `_evict()` and may dispose entries via `_deleteEntry()`.
- Repo tests already assert that `set()` can return `false` while disposing the inserted value.
- The feature extractor returns `extracted` when `set()` returns `false`, which becomes unsafe.

### Fix Focus Areas
- packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js[83-176]
- packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js[158-178]
- packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js[1114-1147]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2dee9b19-9b6b-4dc5-9cff-42ab817a924e

📥 Commits

Reviewing files that changed from the base of the PR and between f65a4c7 and 495bab5.

📒 Files selected for processing (22)
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📜 Review details
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.

Applied to files:

  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.

Applied to files:

  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.

Applied to files:

  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.

Applied to files:

  • packages/transformers/src/models/modeling_utils.js
  • packages/transformers/src/models/processors.js
  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T20:49:40.469Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:40.469Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.

Applied to files:

  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.

Applied to files:

  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/feature_extractors.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T22:17:10.946Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js:176-182
Timestamp: 2026-03-08T22:17:10.946Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js`, both `runNemoTranscribe` call sites intentionally pass `returnMetrics: false`. This is the current pipeline adapter contract; metrics are not collected at the pipeline layer. The dedicated adapter tests in `test_pipelines_nemo_conformer_tdt.js` explicitly match this behavior. Do not flag `returnMetrics: false` as a regression or suggest changing it to `true`.

Applied to files:

  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/src/models/registry.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/pipelines/automatic-speech-recognition.js
  • packages/transformers/src/models/models.js
  • packages/transformers/src/utils/model_registry/get_model_files.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/pipelines/index.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.

Applied to files:

  • packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T21:41:36.882Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:17-24
Timestamp: 2026-03-08T21:41:36.882Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally quantizes each sample to int16 (via `Math.round(sample * 32768)`) before hashing. This is a deliberate design choice for deterministic cross-runtime key stability. The resulting aliasing of distinct waveforms that fall in the same int16 bins is an accepted tradeoff because feature caching is optional and off by default. Do not flag the int16 quantization as a collision risk or suggest hashing raw bytes instead.

Applied to files:

  • packages/transformers/src/models/nemo_conformer_tdt/transducer_text.js
  • packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js
  • packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
  • packages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js
  • packages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
📚 Learning: 2026-03-03T22:59:55.984Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T22:59:55.984Z
Learning: When a model subclass overrides from_pretrained and bypasses the generic model-type branch, do not introduce or rely on a MODEL_TYPES entry for that subclass in the model registry. For this NemoConformerTDT case, in packages/transformers/src/utils/model_registry/get_model_files.js, avoid adding a NemoConformerTDT entry in the model type map; rely on the override logic in modeling_nemo_conformer_tdt.js. This keeps the registry explicit to the actual file and prevents unintended dispatch through the generic branch.

Applied to files:

  • packages/transformers/src/utils/model_registry/get_model_files.js
🔇 Additional comments (7)
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (1)

52-56: 🛠️ Refactor suggestion | 🟠 Major

get() bypasses the borrow/release lifetime model.

It returns cache-owned tensors without extending their lifetime, so a caller can retain the value across set() or clear() and then touch disposed tensors. The safe public surface here is acquire()/release() only.

⛔ Skipped due to learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:17-24
Timestamp: 2026-03-08T21:41:36.882Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally quantizes each sample to int16 (via `Math.round(sample * 32768)`) before hashing. This is a deliberate design choice for deterministic cross-runtime key stability. The resulting aliasing of distinct waveforms that fall in the same int16 bins is an accepted tradeoff because feature caching is optional and off by default. Do not flag the int16 quantization as a collision risk or suggest hashing raw bytes instead.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:40.469Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
packages/transformers/src/utils/model_registry/get_model_files.js (1)

180-182: Load the canonical NeMo ONNX pair here.

These basenames match the session keys hard-coded in NemoConformerForTDT.from_pretrained(), so this branch will request the exact artifacts the custom loader expects.

packages/transformers/src/models/feature_extractors.js (1)

8-8: Expose the feature extractor through the central barrel.

This keeps the new NeMo extractor discoverable through the same public surface as the other model-specific extractors.

packages/transformers/src/models/processors.js (1)

11-11: Expose the processor through the central barrel.

This is the right public entry point for the new NeMo processor.

packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (1)

128-203: Good contract coverage for the NeMo ASR adapter.

These cases pin the three public return shapes and the options forwarded into model.transcribe(), which is exactly where this integration can regress.

packages/transformers/src/pipelines/index.js (1)

33-33: Wire the ASR task to the concrete NeMo loader.

Including NemoConformerForTDT here is the right way to let pipeline('automatic-speech-recognition', ...) resolve the custom class without disturbing the existing seq2seq and CTC loaders.

Also applies to: 154-155

packages/transformers/src/pipelines/automatic-speech-recognition.js (1)

6-8: Keep the NeMo path behind a dedicated adapter.

The new switch case and helper isolate the model-specific behavior in runNemoConformerTDTPipeline and leave the Whisper and Wav2Vec2 paths untouched.

Also applies to: 158-159, 308-316

Comment on lines +38 to +63
if (!Number.isInteger(this.config.n_fft) || this.config.n_fft <= 0) {
throw new Error(
`NemoConformerTDTFeatureExtractor expected \`n_fft\` as a positive integer, got ${this.config.n_fft}.`,
);
}
if (
!Number.isInteger(this.config.win_length) ||
this.config.win_length <= 0 ||
this.config.win_length > this.config.n_fft
) {
throw new Error(
`NemoConformerTDTFeatureExtractor expected \`win_length\` in [1, n_fft], got win_length=${this.config.win_length}, n_fft=${this.config.n_fft}.`,
);
}

// Prefer given `mel_filters` from preprocessor_config.json, or calculate them if they don't exist.
this.config.mel_filters ??= mel_filter_bank(
Math.floor(1 + this.config.n_fft / 2), // num_frequency_bins
this.config.feature_size, // num_mel_filters
0.0, // min_frequency
this.config.sampling_rate / 2, // max_frequency
this.config.sampling_rate, // sampling_rate
'slaney', // norm
'slaney', // mel_scale
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject invalid hop_length, feature_size, and sampling_rate in the constructor.

The constructor only guards n_fft and win_length. Invalid values for these other core fields still flow into mel_filter_bank(), spectrogram(), and frame-length math, so a bad model config fails later and less predictably.

Proposed fix
         if (
             !Number.isInteger(this.config.win_length) ||
             this.config.win_length <= 0 ||
             this.config.win_length > this.config.n_fft
         ) {
             throw new Error(
                 `NemoConformerTDTFeatureExtractor expected \`win_length\` in [1, n_fft], got win_length=${this.config.win_length}, n_fft=${this.config.n_fft}.`,
             );
         }
+        if (!Number.isInteger(this.config.hop_length) || this.config.hop_length <= 0) {
+            throw new Error(
+                `NemoConformerTDTFeatureExtractor expected \`hop_length\` as a positive integer, got ${this.config.hop_length}.`,
+            );
+        }
+        if (!Number.isInteger(this.config.feature_size) || this.config.feature_size <= 0) {
+            throw new Error(
+                `NemoConformerTDTFeatureExtractor expected \`feature_size\` as a positive integer, got ${this.config.feature_size}.`,
+            );
+        }
+        if (!Number.isFinite(this.config.sampling_rate) || this.config.sampling_rate <= 0) {
+            throw new Error(
+                `NemoConformerTDTFeatureExtractor expected \`sampling_rate\` as a positive number, got ${this.config.sampling_rate}.`,
+            );
+        }
 
         // Prefer given `mel_filters` from preprocessor_config.json, or calculate them if they don't exist.
         this.config.mel_filters ??= mel_filter_bank(

Comment on lines +851 to +886
const needDetailed = returnTimestamps && (returnWords || returnTokens);
const detailed = needDetailed
? buildTransducerDetailedOutputs(tokenizer, tokenIds, tokenTimestamps, tokenConfidences)
: null;
const tokenizeMs = nowMs() - tokenizeStart;

/** @type {any} */
const result = { text, isFinal: true };
const utteranceConfidence =
tokenConfidences && tokenConfidences.length > 0
? roundMetric(tokenConfidences.reduce((a, b) => a + b, 0) / tokenConfidences.length, 6)
: null;
const utteranceTimestamp =
tokenTimestamps.length > 0
? /** @type {[number, number]} */ ([
tokenTimestamps[0][0],
tokenTimestamps[tokenTimestamps.length - 1][1],
])
: /** @type {[number, number]} */ ([roundTs(timeOffset), roundTs(frameCount * frameTime + timeOffset)]);
const averageLogProb =
logProbs && logProbs.length > 0
? roundMetric(logProbs.reduce((a, b) => a + b, 0) / logProbs.length, 6)
: null;

if (returnTimestamps) {
result.utteranceTimestamp = utteranceTimestamp;

if (detailed) {
if (returnWords) result.words = detailed.words;
if (returnTokens) result.tokens = detailed.tokens;
}

result.confidence = {
utterance: utteranceConfidence,
wordAverage: detailed?.wordAverage != null ? roundMetric(detailed.wordAverage, 6) : null,
averageLogProb,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

confidence.wordAverage should not depend on returnWords or returnTokens.

Right now wordAverage is only computed when detailed word/token outputs are requested, so transcribe({ returnTimestamps: true, tokenizer }) returns confidence.wordAverage: null even though the tokenizer and per-token confidences are already available.

Proposed fix
-        const needDetailed = returnTimestamps && (returnWords || returnTokens);
-        const detailed = needDetailed
+        const needDetailed = returnTimestamps && (returnWords || returnTokens);
+        const needWordAverage = returnTimestamps && !!tokenizer;
+        const detailed = needDetailed || needWordAverage
             ? buildTransducerDetailedOutputs(tokenizer, tokenIds, tokenTimestamps, tokenConfidences)
             : null;

Comment on lines +12 to +18
/**
* Preprocess raw audio for Nemo Conformer TDT models.
* @param {Float32Array|Float64Array} audio
*/
async _call(audio) {
return await this.feature_extractor(audio);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Base Processor._call contract =="
sed -n '108,126p' packages/transformers/src/processing_utils.js

echo
echo "== NemoConformerTDTProcessor override =="
sed -n '1,24p' packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.js

Repository: ysdede/transformers.js

Length of output: 1413


Remove the unnecessary _call override or forward ...args to preserve the processor contract.

The base Processor._call(input, ...args) dispatches to the first available processor component, including feature_extractor. This override duplicates that logic while narrowing the signature and silently discarding per-call extractor options. Either remove it entirely and rely on the base class, or if retained, forward ...args:

Fix
    /**
     * Preprocess raw audio for Nemo Conformer TDT models.
     * `@param` {Float32Array|Float64Array} audio
     */
-    async _call(audio) {
-        return await this.feature_extractor(audio);
+    async _call(audio, ...args) {
+        return await this.feature_extractor(audio, ...args);
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Preprocess raw audio for Nemo Conformer TDT models.
* @param {Float32Array|Float64Array} audio
*/
async _call(audio) {
return await this.feature_extractor(audio);
}
/**
* Preprocess raw audio for Nemo Conformer TDT models.
* `@param` {Float32Array|Float64Array} audio
*/
async _call(audio, ...args) {
return await this.feature_extractor(audio, ...args);
}

Comment on lines +28 to +39
export function joinTimedWords(words) {
let text = '';
for (const word of words) {
const part = word.text ?? '';
if (!part) continue;
if (!text) {
text = part;
} else if (/^[,.;:!?)}\]]+$/.test(part)) {
text += part;
} else {
text += ` ${part}`;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

joinTimedWords() still formats opener and numeric affix tokens incorrectly.

It only suppresses spaces before closing punctuation. Standalone $, %, or ( become $ 10, 10 %, and ( hello, and that output is propagated into both segment text and pipeline chunks.

Proposed fix
 export function joinTimedWords(words) {
     let text = '';
     for (const word of words) {
         const part = word.text ?? '';
         if (!part) continue;
         if (!text) {
             text = part;
-        } else if (/^[,.;:!?)}\]]+$/.test(part)) {
+        } else if (/^[,.;:!?%)}\]]+$/.test(part) || /[(\["'“‘$£€¥]$/.test(text)) {
             text += part;
         } else {
             text += ` ${part}`;
         }
     }
     return text;
 }

Comment on lines +9 to +12
export function decodeTransducerText(tokenizer, token_ids) {
if (!Array.isArray(token_ids) || token_ids.length === 0) return '';
if (!tokenizer) return token_ids.join(' ');
return tokenizer.decode(token_ids, { skip_special_tokens: true }).trim();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Require a tokenizer for non-empty text decoding.

Returning token_ids.join(' ') here silently turns a successful decode into token ids instead of a transcription. It also makes text-only mode "work" while the detailed-output path correctly rejects the same missing-tokenizer case.

Proposed fix
 export function decodeTransducerText(tokenizer, token_ids) {
     if (!Array.isArray(token_ids) || token_ids.length === 0) return '';
-    if (!tokenizer) return token_ids.join(' ');
+    if (!tokenizer) {
+        throw new Error('decodeTransducerText requires a tokenizer for non-empty token_ids.');
+    }
     return tokenizer.decode(token_ids, { skip_special_tokens: true }).trim();
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export function decodeTransducerText(tokenizer, token_ids) {
if (!Array.isArray(token_ids) || token_ids.length === 0) return '';
if (!tokenizer) return token_ids.join(' ');
return tokenizer.decode(token_ids, { skip_special_tokens: true }).trim();
export function decodeTransducerText(tokenizer, token_ids) {
if (!Array.isArray(token_ids) || token_ids.length === 0) return '';
if (!tokenizer) {
throw new Error('decodeTransducerText requires a tokenizer for non-empty token_ids.');
}
return tokenizer.decode(token_ids, { skip_special_tokens: true }).trim();
}

Comment on lines +24 to +37
if (
prev &&
prevText === wordText &&
(
prevText.length > 0 ||
normalizeRawMergedWordText(prev.text) === normalizeRawMergedWordText(word.text)
) &&
word.startTime < prev.endTime
) {
const prevDuration = prev.endTime - prev.startTime;
const nextDuration = word.endTime - word.startTime;
if (nextDuration > prevDuration) {
merged[merged.length - 1] = word;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not drop timing bounds when deduplicating overlapped words.

This replacement keeps only one copy’s startTime/endTime. For window-overlap duplicates, that can trim the surviving word and then move the restart cursor or final chunk timestamps forward. Keep the preferred payload, but union the time span when collapsing duplicates.

Proposed fix
         if (
             prev &&
             prevText === wordText &&
             (
                 prevText.length > 0 ||
                 normalizeRawMergedWordText(prev.text) === normalizeRawMergedWordText(word.text)
             ) &&
-            word.startTime < prev.endTime
+            word.startTime <= prev.endTime + 1e-6
         ) {
-            const prevDuration = prev.endTime - prev.startTime;
-            const nextDuration = word.endTime - word.startTime;
-            if (nextDuration > prevDuration) {
-                merged[merged.length - 1] = word;
-            }
+            const keep =
+                (word.endTime - word.startTime) > (prev.endTime - prev.startTime) ? word : prev;
+            merged[merged.length - 1] = {
+                ...keep,
+                startTime: Math.min(prev.startTime, word.startTime),
+                endTime: Math.max(prev.endTime, word.endTime),
+            };
             continue;
         }

Comment on lines +48 to +65
function resolveTokenPiece(tokenizer, id) {
const rawToken = getIdToTokenMap(tokenizer).get(id) ?? '';
const decoded = tokenizer.decode([id], {
skip_special_tokens: true,
clean_up_tokenization_spaces: false,
});

const startsWithBoundaryMarker = /^(?:▁|Ġ)+/.test(rawToken);
const startsWithWhitespace = /^\s+/.test(decoded);
const startsNewWord = startsWithBoundaryMarker || startsWithWhitespace;

let clean = decoded.replace(/^\s+/, '');
if (!clean) {
clean = rawToken.replace(/^(?:▁|Ġ|Ċ)+/, '').replace(/^ +/, '');
}

return { raw: rawToken || decoded, clean, startsNewWord };
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Cache resolved token pieces, not just raw vocab text.

resolveTokenPiece still calls tokenizer.decode([id], ...) for every occurrence of a token id. On long transcripts that makes detailed-output generation pay the tokenizer cost repeatedly for identical ids.

Proposed refactor
 const TOKEN_ID_TO_TEXT_CACHE = new WeakMap();
+const TOKEN_PIECE_CACHE = new WeakMap();
 ...
 function resolveTokenPiece(tokenizer, id) {
+    let pieceCache = TOKEN_PIECE_CACHE.get(tokenizer);
+    if (!pieceCache) {
+        pieceCache = new Map();
+        TOKEN_PIECE_CACHE.set(tokenizer, pieceCache);
+    }
+    const cachedPiece = pieceCache.get(id);
+    if (cachedPiece) return cachedPiece;
+
     const rawToken = getIdToTokenMap(tokenizer).get(id) ?? '';
     const decoded = tokenizer.decode([id], {
         skip_special_tokens: true,
         clean_up_tokenization_spaces: false,
     });
@@
-    return { raw: rawToken || decoded, clean, startsNewWord };
+    const piece = { raw: rawToken || decoded, clean, startsNewWord };
+    pieceCache.set(id, piece);
+    return piece;
 }

Comment on lines +44 to 45
['nemo-conformer-tdt', 'NemoConformerForTDT'],
['audio-spectrogram-transformer', 'ASTModel'],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Explicit ASR task wiring =="
sed -n '148,158p' packages/transformers/src/pipelines/index.js

echo
echo "== NeMo additions in the generic registry/type maps =="
sed -n '12,50p' packages/transformers/src/models/registry.js
sed -n '558,590p' packages/transformers/src/models/registry.js

echo
echo "== Generic consumers of MODEL_MAPPING_NAMES / MODEL_TYPE_MAPPING =="
rg -n -C3 'MODEL_MAPPING_NAMES|MODEL_TYPE_MAPPING' \
  packages/transformers/src/models \
  packages/transformers/src/utils/model_registry

Repository: ysdede/transformers.js

Length of output: 22032


Remove NemoConformerForTDT from MODEL_MAPPING_NAMES_ENCODER_ONLY.

The ASR pipeline already loads NemoConformerForTDT directly without using registry resolution. Adding it to the encoder-only mapping exposes this class through generic resolver paths it does not use. Since the model overrides from_pretrained with specialized session construction, it should not be discoverable via generic model-type routing.

Remove line 44-45 from packages/transformers/src/models/registry.js. The MODEL_TYPE_MAPPING entries established by direct registration in modeling_nemo_conformer_tdt.js (lines 953-954) are sufficient if get_model_files() requires a fallback for architecture-less configs.

Comment on lines +52 to +69
it(
"supports concatenated delta and delta-delta features",
async () => {
const extractor = new NemoConformerTDTFeatureExtractor({
...base,
feature_size: 128,
delta_order: 2,
delta_window: 2,
delta_concatenate: true,
});
const { input_features } = await extractor(audio);
try {
expect(input_features.dims[2]).toBe(128 * 3);
} finally {
input_features.dispose();
}
},
MAX_TEST_EXECUTION_TIME,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dispose attention_mask in the concatenated-delta tests.

Both blocks only keep input_features, but the extractor also returns an attention_mask. That leaves a caller-owned tensor live in the test path, which is the wrong pattern in a suite that is explicitly checking tensor lifetimes.

Proposed fix
-        const { input_features } = await extractor(audio);
+        const { input_features, attention_mask } = await extractor(audio);
         try {
           expect(input_features.dims[2]).toBe(128 * 3);
         } finally {
           input_features.dispose();
+          attention_mask.dispose();
         }
-        let input_features;
+        let input_features;
+        let attention_mask;
         try {
-          ({ input_features } = await extractor(audio));
+          ({ input_features, attention_mask } = await extractor(audio));
           expect(input_features.dims[2]).toBe(80 * 2);
         } finally {
           Tensor.prototype.dispose = originalDispose;
           input_features?.dispose();
+          attention_mask?.dispose();
         }

Also applies to: 101-130

Comment on lines +888 to +920
it(
"computes delta and delta-delta features",
async () => {
const input = new Tensor(
"float32",
Float32Array.from([
// T=4, F=2
1, 2, 2, 4, 3, 6, 4, 8,
]),
[1, 4, 2],
);

const split = computeTemporalDeltas(input, { order: 2, window: 1, concatenate: false });
expect(split.delta.dims).toEqual([1, 4, 2]);
expect(split.delta_delta.dims).toEqual([1, 4, 2]);

const concatOrder1 = computeTemporalDeltas(input, { order: 1, window: 1, concatenate: true });
expect(concatOrder1.dims).toEqual([1, 4, 4]);
expect(Array.from(concatOrder1.data.slice(0, 8))).toEqual([
1,
2,
0.5,
1, // t0: base + delta
2,
4,
1,
2, // t1: base + delta
]);

const concat = computeTemporalDeltas(input, { order: 2, window: 1, concatenate: true });
expect(concat.dims).toEqual([1, 4, 6]);
},
MAX_TEST_EXECUTION_TIME,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dispose the tensors allocated by the delta-helper tests.

These two cases leave several caller-owned tensors live after the assertions. In the disposal-counting case, restore Tensor.prototype.dispose first, then release the test tensors in finally so the count stays stable.

Also applies to: 928-948

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant