Conversation
…cache helpers Carry over non-runtime typing fixes from the prior branch while intentionally excluding the WebGPU disable_prepacking workaround in session.js.\n\n- Cast dynamic model.transcribe access for Nemo TDT pipeline method checks/calls.\n- Cast Tensor data byteLength access in transducer cache utilities.\n- Add explicit tuple/object JSDoc annotations in transducer timestamp builder.\n\nThis keeps main-based v4 work clean with latest ORT-Web on origin/main and avoids retaining the temporary encoder prepacking workaround.
- Replace legacy per-feature flags (return_token_timestamps, return_word_timestamps, return_utterance_timestamp) with a layered API: return_timestamps (utterance-level), return_words, return_tokens - Merge duplicate outputs: words absorbs word_timestamps, tokens absorbs token_timestamps and token_ids - Add per-token confidence, word-level confidence aggregation, utterance_confidence, and confidence_scores summary - Gate frame confidences behind returnFrameConfidences flag - Add return_metrics with encode/decode/total timing and RTF - Add debug flags: returnFrameIndices, returnLogProbs, returnTdtSteps - Fix vocab Map handling in getIdToTokenMap and _resolveVocabSize (tokenizer.get_vocab() returns Map in WASM binding) - Update ASR pipeline to wire timestamp_granularity to new model flags - Format all changed files with Prettier per CONTRIBUTING.md
…ipeline - Add roundTs() for millisecond-precision timestamp rounding at source - Round all confidence averages to 6 decimal places - Round per-token and per-word confidence values - Remove timestamp_granularity and formatting helpers from pipeline - Pipeline returns model.transcribe() output directly - Auto-enable return_words and return_metrics when return_timestamps is true
…imestamps, honor return_metrics kwarg - modeling_nemo_conformer_tdt: dispose logits and new decoder state tensors before throwing when logitsData.length < vocabSize to prevent resource leak - modeling_nemo_conformer_tdt: move returnFrameConfidences output block outside the return_timestamps guard so frame/frame_avg are emitted independently - automatic-speech-recognition: change return_metrics from hardcoded true to kwargs.return_metrics ?? false to respect user intent and avoid overhead
- Accept upstream restructuring: SUPPORTED_TASKS and pipeline imports moved from pipelines.js to pipelines/index.js - Migrate NemoConformerForTDT registration to pipelines/index.js accordingly
- Add MODEL_TYPES.NemoConformerTDT (id=16) to modeling_utils - Register NemoConformerForTDT in MODEL_TYPE_MAPPING, MODEL_NAME_TO_CLASS_MAPPING, and MODEL_CLASS_TO_NAME_MAPPING so the base class from_pretrained, ModelRegistry, and is_pipeline_cached all recognise the model correctly - Add NemoConformerTDT case to get_model_files so progress_callback receives accurate file size totals for encoder_model.onnx + decoder_model_merged.onnx
Standardizes internal logging to follow the upstream convention introduced in ModelRegistry refactor.
- Guard feature extractor against empty/short audio (NaN prevention) - Move decoder tensor init inside try block for safe disposal - Add architecture key to MODEL_TYPE_MAPPING - Add input validation in buildTransducerDetailedOutputs - Harden audio cache hash against NaN samples - Add order validation in computeTemporalDeltas - Restore pipeline: return_timestamps truthy => words + metrics always on
- Remove all timestamp_granularity tests (feature was removed) - Fix option names: return_tokens, return_words, return_timestamps - Fix output fields: tokens/words arrays, not token_ids/word_timestamps - Verify pipeline passes return_words + return_metrics when timestamps on - Add test: return_timestamps 'word' treated as truthy
Address reviewer findings except the return_metrics policy decision. - Fix temporal delta concatenation to interleave per frame and add dtype validation. - Validate preemphasis range and clamp normalization variance in feature extraction. - Remove unsafe encoder layout inference; require explicit encoder_output_layout. - Redesign decode loop to read frame data on-demand instead of eager frame materialization. - Deduplicate word finalization and avoid zero-filling missing word confidences. - Tighten tests for delta layout/type checks, explicit layout requirement, call counts, and naming accuracy.
Fixes high-impact issues found in PR review validation:\n- force NemoConformerForTDT to MODEL_TYPES.NemoConformerTDT in registry overrides\n- ensure encoder outputs are disposed when pre-decode validation throws\n- remove stride sampling from audio cache key hashing to prevent false cache hits\n- use encoder_model selector key in get_model_files for Nemo per-component dtype/device overrides\n\nAlso adds targeted regression tests for mapping, disposal behavior, file selection, and cache key correctness.
- Clamp token end timestamps to encoder frame bounds during TDT decoding.\n- Validate FeatureLRUCache constructor limits to fail fast on invalid settings.\n- Add regression tests for timestamp clamping and cache limit validation.
Dispose intermediate tensors in computeTemporalDeltas concatenate paths and dispose replaced base input features when delta concatenation returns a new tensor.\n\nAdd regression tests that assert disposal behavior for delta concatenate flows and feature extractor reassignment.
Dispose non-essential Tensor outputs returned by decoder steps to prevent cumulative memory growth. Keep logits/state tensors alive for decoding/state transitions and dispose extras immediately.\n\nAdd regression test to assert auxiliary decoder tensor outputs are disposed each step.
Compute encoder length directly from attention_mask.data instead of attention_mask.tolist() to avoid large transient array allocations in ASR decode hot path.
Fail fast when duration logits are required but missing in decoder output, and enforce positive-integer vocab size at runtime config validation. Validate prepared Nemo pipeline audio for non-empty finite samples before processor/model calls. Add regression tests for missing duration logits and non-finite audio rejection.
Fix placeholder interpolation in _prepare_model_inputs error text. Add fail-fast validation for Nemo delta_window and reject duplicate decoder output aliases in transducer io config. Add regression tests for delta_window validation and duplicate decoder output alias rejection.
Validate transcribe timeOffset as finite and guard encoderOutputs cleanup path to avoid masking primary failures. Align transducer_text JSDoc token type with runtime shape (include id). Harden Parakeet feature extractor test by using direct mask data and explicit tensor disposal via try/finally; add timeOffset validation regression test.
- fail fast on missing decoder state outputs and invalid encoder layout enums\n- make FeatureLRUCache own cached tensor lifetimes (replace/evict/clear) with deduped disposal and deterministic size fallback\n- validate n_fft/win_length in Nemo feature extractor\n- align Nemo ASR pipeline docs with actual forwarded options\n- add regression coverage for runtime config validation, non-concatenated deltas/cache behavior, missing decoder state outputs, and cache disposal semantics\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt\n- pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
Apply Gemini review nit in Nemo decode loop by replacing a redundant duration expression with Math.max(1, step).\n\nValidation:\n- pnpm test -- tests/models.test.js --filter nemo_conformer_tdt
Checklist (bot comment IDs): - [x] 2892132356: guard tokenizer.get_vocab() return type before Object.keys in _resolveVocabSize. - [x] 2892132367: treat zero cache limits as explicit no-cache mode; do not store/dispose just-produced values. - [x] 2892132372: dispose processor tensors in Nemo ASR pipeline when cache does not own lifetimes. Added regression tests for vocab resolution fallback, zero-limit cache semantics, and Nemo pipeline tensor ownership behavior. Validation: - pnpm test -- tests/models.test.js --filter nemo_conformer_tdt - pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
- widen confidenceFromLogits input type to Tensor data arrays - narrow feature_cache access with explicit typed cast in ASR pipeline
Checklist (bot comment IDs): - [x] 2892287484: handle array-returning tokenizer vocab in _resolveVocabSize. - [x] 2892322884: avoid disposing when re-setting the same object for an existing cache key. - [x] 2892322906: skip caching oversized values to prevent insert-then-dispose of caller-owned tensors. - [x] 2892322910: guard byteLength type in estimateSizeBytes. Added regression tests for array vocab sizing, same-object set behavior, oversized value skipping, and non-numeric byteLength handling. Validation: - pnpm test -- tests/models.test.js --filter nemo_conformer_tdt - pnpm test -- tests/pipelines.test.js --filter automatic_speech_recognition
…o pipelines" This reverts commit b44f7f3.
Align the Nemo ASR pipeline with the shared task contract by returning text-only results by default and chunk-based timestamps for segment and word modes. Add automatic long-audio windowing, decoded-text-driven word reconstruction, and model-local helpers for window merge and chunk assembly. Also add regression coverage for numeric/punctuation word boundaries, windowed merge behavior, and auto-windowed long-form pipeline decoding.
Remove the standalone parakeet feature extractor test from this branch. It exercises an existing parakeet_ctc path that is outside the scope of Conformer TDT integration and makes the PR look broader than it is.
Use conservative sentence boundaries for pipeline timestamps and long-audio cursoring in the NeMo Conformer TDT pipeline. This keeps the HF-style pipeline contract while replacing the old fixed-window merge path with sentence-driven retranscription. Also remove dead NeMo window-merge helpers, delete the obsolete compatibility barrel, and extend the model and pipeline tests around cache handling, timestamps, and long-audio behavior.
Keep the shared ASR pipeline suite focused on the public Nemo contract and move adapter-specific windowing, retranscription, cache-ownership, and disposal coverage into a dedicated Nemo pipeline test file. Narrow the source diff by removing explanatory Nemo comments and reverting unrelated upstream-only tweaks, while also fixing the review findings around cursor snap-forward merging, tokenizer vocab-shape handling, empty timestamp validation, and cache borrow/release semantics for active inference. Verification: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Apply the remaining valid Nemo Conformer TDT review fixes without widening the shared ASR pipeline surface. - honor encoder_input_layout for canonical input_features feeds - keep borrowed cache entries counted until they are actually released - reject tokenizer-less non-empty word-offset reconstruction - raise the auto-window budget to match the minimum guaranteed cursor advance - add focused model and pipeline regressions for each fix Verified with: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Restore the original cast spacing in the unrelated moonshine path so the Nemo PR does not carry an extra formatting-only diff in automatic-speech-recognition.js.
Resolve sparse tokenizer vocab fallback by deriving the runtime size from the maximum token id instead of counting entries. This keeps decoder sizing correct when tokenizer ids are non-contiguous. Tighten merged-word dedupe so punctuation-only overlaps are only collapsed when their raw normalized text also matches, which avoids dropping distinct punctuation tokens across window boundaries. Add focused Nemo model regressions and verify with: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
Treat likely domain suffixes as continuations when tokenizer decoding inserts whitespace after a trailing period, so sequences like `LibriVox. org.` reconstruct as `LibriVox.org.` in detailed word offsets. Add a focused regression covering the split `.org` token pattern and verify with: - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt" - node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"
This reverts commit 39e5cb1.
📝 WalkthroughWalkthroughThis pull request introduces comprehensive Nemo Conformer TDT integration, adding a new encoder-only ASR model type with feature extraction, transducer-based decoding, audio caching, delta feature computation, and pipeline support. Includes corresponding processor, registry updates, and extensive test coverage. Changes
Sequence Diagram(s)sequenceDiagram
participant Audio as Audio Input
participant Processor as NemoConformerTDTProcessor
participant FeatureExtractor as NemoConformerTDTFeatureExtractor
participant Model as NemoConformerForTDT
participant Encoder as ONNX Encoder Session
participant Decoder as ONNX Decoder Session
participant Output as Transcription Output
Audio->>Processor: call(audio)
Processor->>FeatureExtractor: call(audio)
FeatureExtractor->>FeatureExtractor: compute spectrogram + preemphasis
FeatureExtractor->>FeatureExtractor: apply mel-filterbank
FeatureExtractor->>FeatureExtractor: normalize per-frame
FeatureExtractor->>FeatureExtractor: optionally cache features
FeatureExtractor-->>Processor: input_features + attention_mask
Processor-->>Model: input_features + attention_mask
Model->>Encoder: encode(input_features)
Encoder-->>Model: encoder_outputs
Model->>Model: initialize decoder state
loop Frame-by-Frame Decoding
Model->>Decoder: decode(encoder_frame, token, state)
Decoder-->>Model: logits, output_states
Model->>Model: argmax logit → token
Model->>Model: extract duration from logits
Model->>Model: advance frame by duration
end
Model->>Model: build word offsets from tokens + timestamps
Model-->>Output: text + words + timestamps + confidence
sequenceDiagram
participant Audio as Long Audio
participant Pipeline as runNemoConformerTDTPipeline
participant Processor as Processor
participant Model as Model.transcribe()
participant Segmenter as Segmenter
participant Merger as dedupeMergedWords
participant Output as Final Output
Audio->>Pipeline: audio (streaming windowed)
Pipeline->>Pipeline: partition into windows (90s default)
loop For Each Window
Pipeline->>Processor: process_window(audio_chunk)
Processor-->>Pipeline: features
Pipeline->>Model: transcribe(features)
Model-->>Pipeline: {text, words[], utteranceTimestamp}
Pipeline->>Merger: dedupe words from merged decoding
Merger-->>Pipeline: deduplicated words[]
Pipeline->>Segmenter: partition words into sentences
Segmenter-->>Pipeline: segments[]
Pipeline->>Pipeline: merge with previous segments
Pipeline->>Pipeline: validate no overlap/gaps
end
Pipeline-->>Output: {text, chunks[timestamp, text]}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Review Summary by QodoAdd NeMo Conformer TDT ASR support with pipeline integration and comprehensive testing
WalkthroughsDescription• Adds comprehensive NeMo Conformer TDT ASR model support to transformers.js with full pipeline integration • Implements NemoConformerForTDT model class with greedy token-and-duration transducer decoding and transcribe() method supporting timestamps, confidence scores, and detailed outputs • Adds NemoConformerTDTFeatureExtractor with log-mel spectrogram extraction, temporal delta/delta-delta features, and LRU caching with tensor lifecycle management • Integrates NeMo pipeline into AutomaticSpeechRecognitionPipeline with automatic long-audio windowing (180+ seconds) using sentence-cursor restart logic • Implements word reconstruction from token sequences with boundary detection and punctuation handling via buildTransducerWordOffsets() • Adds sentence-level segmentation heuristic with special handling for abbreviations, acronyms, and non-breaking periods • Registers model, processor, and feature extractor in model registry with dual-artifact file resolution (encoder_model and decoder_model_merged) • Includes comprehensive test coverage: model initialization/validation, TDT decoding, feature extraction, pipeline integration, long-audio handling, and tensor disposal • Fixes from review: pending-prefix preservation, vocab handling validation, cache borrow/release, encoder input layout, auto-window budget, and punctuation-only deduplication Diagramflowchart LR
A["Audio Input"] -->|Feature Extraction| B["NemoConformerTDTFeatureExtractor"]
B -->|Log-mel + Deltas| C["Feature Cache"]
C -->|Cached Features| D["NemoConformerForTDT Model"]
D -->|Encoder/Decoder| E["TDT Greedy Decoding"]
E -->|Token Sequences| F["buildTransducerWordOffsets"]
F -->|Word Boundaries| G["partitionNemoWordsIntoSegments"]
G -->|Sentence Chunks| H["ASR Pipeline Output"]
I["Long Audio 180s+"] -->|Auto-window| J["Sentence-cursor Restart"]
J -->|Windowed Segments| D
File Changes1. packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js
|
Code Review by Qodo
1. Disposed tensors returned
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality by adding support for NeMo Conformer TDT ASR models. The changes are extensive, including new model and feature extractor implementations, pipeline integration, long-audio handling, and various utilities for decoding and post-processing. The code is well-structured and includes comprehensive test coverage. My review focuses on improving the maintainability of some of the larger, more complex methods by suggesting refactoring opportunities.
| // normalize mel features, ignoring padding | ||
| const sum = new Float64Array(num_features); | ||
| const sum_sq = new Float64Array(num_features); | ||
|
|
||
| for (let i = 0; i < features_length; ++i) { | ||
| const offset = i * num_features; | ||
| for (let j = 0; j < num_features; ++j) { | ||
| const val = features_data[offset + j]; | ||
| sum[j] += val; | ||
| sum_sq[j] += val * val; | ||
| } | ||
| } | ||
|
|
||
| // Skip normalization for empty/very short audio to avoid NaN from divide-by-zero. | ||
| if (features_length > 0) { | ||
| // Calculate mean and standard deviation, then normalize | ||
| const divisor = features_length > 1 ? features_length - 1 : 1; | ||
| for (let j = 0; j < num_features; ++j) { | ||
| const mean = sum[j] / features_length; | ||
| const variance = (sum_sq[j] - features_length * mean * mean) / divisor; | ||
| const std = Math.sqrt(Math.max(variance, 0)) + EPSILON; | ||
| const inv_std = 1 / std; | ||
|
|
||
| for (let i = 0; i < features_length; ++i) { | ||
| const index = i * num_features + j; | ||
| features_data[index] = (features_data[index] - mean) * inv_std; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
To improve readability and separation of concerns, consider extracting this feature normalization logic into a separate private helper method. The current implementation is nested within the _extract method and is quite dense. A dedicated function, for example _normalizeFeatures(featuresData, numFeatures, featuresLength), would make the _extract method easier to follow.
| async transcribe(model_inputs, decode_options = {}) { | ||
| const { | ||
| tokenizer = null, | ||
| returnTimestamps: returnTimestampsOption, | ||
| return_timestamps: legacyReturnTimestamps, | ||
| returnWords: returnWordsOption, | ||
| return_words: legacyReturnWords, | ||
| returnTokens: returnTokensOption, | ||
| return_tokens: legacyReturnTokens, | ||
| returnMetrics: returnMetricsOption, | ||
| return_metrics: legacyReturnMetrics, | ||
| returnFrameConfidences = false, | ||
| returnFrameIndices = false, | ||
| returnLogProbs = false, | ||
| returnTdtSteps = false, | ||
| timeOffset = 0, | ||
| } = decode_options; | ||
| const returnTimestamps = returnTimestampsOption ?? legacyReturnTimestamps ?? true; | ||
| const returnWords = returnWordsOption ?? legacyReturnWords ?? false; | ||
| const returnTokens = returnTokensOption ?? legacyReturnTokens ?? false; | ||
| const returnMetrics = returnMetricsOption ?? legacyReturnMetrics ?? false; | ||
|
|
||
| if (!Number.isFinite(timeOffset)) { | ||
| throw new Error('NemoConformerForTDT.transcribe expected `timeOffset` to be a finite number.'); | ||
| } | ||
| const totalStart = nowMs(); | ||
| const io = this.transducer.io; | ||
| const vocabSize = this._resolveVocabSize(tokenizer); | ||
| this._validateRuntimeConfig(vocabSize); | ||
|
|
||
| const { feeds: encoderFeeds, disposables } = this._buildEncoderFeeds(model_inputs); | ||
| let encoderOutputs; | ||
| const encodeStart = nowMs(); | ||
| try { | ||
| encoderOutputs = await this._runEncoder(encoderFeeds); | ||
| } finally { | ||
| for (const tensor of disposables) { | ||
| tensor.dispose(); | ||
| } | ||
| } | ||
| const encodeMs = nowMs() - encodeStart; | ||
|
|
||
| let frameCount = 0; | ||
| let encoderOutput = null; | ||
| const frameTime = this.transducer.subsampling_factor * this.transducer.frame_shift_s; | ||
|
|
||
| const numLayers = this.transducer.decoder.num_layers; | ||
| const hiddenSize = this.transducer.decoder.hidden_size; | ||
| const blankId = this.transducer.blank_token_id; | ||
| const maxSymbolsPerStep = this.transducer.max_symbols_per_step; | ||
|
|
||
| const needConfidences = !!returnTimestamps; | ||
|
|
||
| /** @type {number[]} */ | ||
| const tokenIds = []; | ||
| /** @type {[number, number][]} */ | ||
| const tokenTimestamps = []; | ||
| /** @type {number[] | null} */ | ||
| const tokenConfidences = needConfidences ? [] : null; | ||
| /** @type {Map<number, { sum: number, count: number }> | null} */ | ||
| const frameConfidenceStats = returnFrameConfidences ? new Map() : null; | ||
| /** @type {number[] | null} */ | ||
| const frameIndices = returnFrameIndices ? [] : null; | ||
| /** @type {number[] | null} */ | ||
| const logProbs = returnLogProbs || needConfidences ? [] : null; | ||
| /** @type {number[] | null} */ | ||
| const tdtSteps = returnTdtSteps ? [] : null; | ||
|
|
||
| let decoderState; | ||
| let targetLengthTensor; | ||
| let reusableFrame = null; | ||
|
|
||
| let emittedOnFrame = 0; | ||
| const decodeStart = nowMs(); | ||
|
|
||
| try { | ||
| encoderOutput = this._getEncoderOutput(encoderOutputs); | ||
| frameCount = this._getEncoderFrameCount(encoderOutput); | ||
| decoderState = { | ||
| state1: new Tensor('float32', new Float32Array(numLayers * hiddenSize), [numLayers, 1, hiddenSize]), | ||
| state2: new Tensor('float32', new Float32Array(numLayers * hiddenSize), [numLayers, 1, hiddenSize]), | ||
| }; | ||
|
|
||
| targetLengthTensor = | ||
| this.transducer.decoder_token_length_dtype === 'int64' | ||
| ? new Tensor('int64', BigInt64Array.from([1n]), [1]) | ||
| : new Tensor('int32', new Int32Array([1]), [1]); | ||
|
|
||
| for (let frameIndex = 0; frameIndex < frameCount; ) { | ||
| const frameData = this._getFrameData(encoderOutput, frameIndex, reusableFrame); | ||
| if (this.transducer.encoder_output_layout === 'BDT') { | ||
| reusableFrame = frameData; | ||
| } | ||
| const frameTensor = this._createFrameTensor(frameData); | ||
| const prevTokenId = tokenIds.length > 0 ? tokenIds[tokenIds.length - 1] : blankId; | ||
| const tokenTensor = | ||
| this.transducer.decoder_token_dtype === 'int64' | ||
| ? new Tensor('int64', BigInt64Array.from([BigInt(prevTokenId)]), [1, 1]) | ||
| : new Tensor('int32', new Int32Array([prevTokenId]), [1, 1]); | ||
|
|
||
| const decoderFeeds = { | ||
| [io.decoder_encoder]: frameTensor, | ||
| [io.decoder_token]: tokenTensor, | ||
| [io.decoder_token_length]: targetLengthTensor, | ||
| [io.decoder_state_1]: decoderState.state1, | ||
| [io.decoder_state_2]: decoderState.state2, | ||
| }; | ||
|
|
||
| let decoderOutput; | ||
| try { | ||
| decoderOutput = await this._runDecoder(decoderFeeds); | ||
| } finally { | ||
| tokenTensor.dispose(); | ||
| frameTensor.dispose(); | ||
| } | ||
|
|
||
| const logits = decoderOutput[io.decoder_output]; | ||
| const outputState1 = decoderOutput[io.decoder_output_state_1]; | ||
| const outputState2 = decoderOutput[io.decoder_output_state_2]; | ||
| const seenDecoderTensors = new Set(); | ||
| for (const value of Object.values(decoderOutput)) { | ||
| if (!(value instanceof Tensor) || seenDecoderTensors.has(value)) continue; | ||
| seenDecoderTensors.add(value); | ||
| if (value === logits || value === outputState1 || value === outputState2) { | ||
| continue; | ||
| } | ||
| value.dispose(); | ||
| } | ||
| if (!(logits instanceof Tensor)) { | ||
| this._disposeDecoderState( | ||
| { | ||
| state1: outputState1, | ||
| state2: outputState2, | ||
| }, | ||
| decoderState, | ||
| ); | ||
| throw new Error( | ||
| `Nemo Conformer TDT decoder output "${io.decoder_output}" was not returned by the session.`, | ||
| ); | ||
| } | ||
| if (!(outputState1 instanceof Tensor) || !(outputState2 instanceof Tensor)) { | ||
| logits.dispose(); | ||
| this._disposeDecoderState( | ||
| { | ||
| state1: outputState1, | ||
| state2: outputState2, | ||
| }, | ||
| decoderState, | ||
| ); | ||
| throw new Error( | ||
| `Nemo Conformer TDT decoder state outputs "${io.decoder_output_state_1}" and "${io.decoder_output_state_2}" were not returned by the session.`, | ||
| ); | ||
| } | ||
| const logitsData = logits.data; | ||
| if (logitsData.length < vocabSize) { | ||
| logits.dispose(); | ||
| this._disposeDecoderState( | ||
| { | ||
| state1: outputState1, | ||
| state2: outputState2, | ||
| }, | ||
| decoderState, | ||
| ); | ||
| throw new Error( | ||
| `Nemo Conformer TDT decoder output is too small (${logitsData.length}) for vocab_size=${vocabSize}.`, | ||
| ); | ||
| } | ||
| const tokenId = argmax(logitsData, 0, vocabSize); | ||
| const durationStart = this.transducer.duration_start_index ?? vocabSize; | ||
| const hasDurationLogits = logitsData.length > durationStart; | ||
| if (this.transducer.duration_start_index != null && !hasDurationLogits) { | ||
| logits.dispose(); | ||
| this._disposeDecoderState( | ||
| { | ||
| state1: outputState1, | ||
| state2: outputState2, | ||
| }, | ||
| decoderState, | ||
| ); | ||
| throw new Error( | ||
| `Nemo Conformer TDT decoder output is missing duration logits: expected values beyond index ${durationStart - 1}, got length=${logitsData.length}.`, | ||
| ); | ||
| } | ||
| const step = hasDurationLogits | ||
| ? argmax(logitsData, durationStart, logitsData.length - durationStart) - durationStart | ||
| : 0; | ||
| if (tdtSteps) { | ||
| tdtSteps.push(step); | ||
| } | ||
|
|
||
| const maybeConfidence = | ||
| needConfidences || returnLogProbs || returnFrameConfidences | ||
| ? confidenceFromLogits(logitsData, tokenId, vocabSize) | ||
| : null; | ||
| if (frameConfidenceStats && maybeConfidence) { | ||
| const stats = frameConfidenceStats.get(frameIndex); | ||
| if (stats) { | ||
| stats.sum += maybeConfidence.confidence; | ||
| stats.count += 1; | ||
| } else { | ||
| frameConfidenceStats.set(frameIndex, { sum: maybeConfidence.confidence, count: 1 }); | ||
| } | ||
| } | ||
|
|
||
| const newState = { | ||
| state1: outputState1, | ||
| state2: outputState2, | ||
| }; | ||
|
|
||
| if (tokenId !== blankId) { | ||
| this._disposeDecoderState(decoderState, newState); | ||
| decoderState = newState; | ||
|
|
||
| tokenIds.push(tokenId); | ||
| // TDT duration convention: step=0 means "stay on current frame" (duration index 0 = no advance). | ||
| // We still associate the token with this frame, so durationFrames is at least 1. | ||
| const durationFrames = Math.max(1, step); | ||
| const endFrame = Math.min(frameCount, frameIndex + durationFrames); | ||
| tokenTimestamps.push([ | ||
| roundTs(frameIndex * frameTime + timeOffset), | ||
| roundTs(endFrame * frameTime + timeOffset), | ||
| ]); | ||
| if (tokenConfidences && maybeConfidence) { | ||
| tokenConfidences.push(maybeConfidence.confidence); | ||
| } | ||
| if (frameIndices) { | ||
| frameIndices.push(frameIndex); | ||
| } | ||
| if (logProbs && maybeConfidence) { | ||
| logProbs.push(maybeConfidence.logProb); | ||
| } | ||
| emittedOnFrame += 1; | ||
| } else { | ||
| this._disposeDecoderState(newState, decoderState); | ||
| } | ||
|
|
||
| logits.dispose(); | ||
|
|
||
| if (step > 0) { | ||
| frameIndex += step; | ||
| emittedOnFrame = 0; | ||
| } else if (tokenId === blankId || emittedOnFrame >= maxSymbolsPerStep) { | ||
| frameIndex += 1; | ||
| emittedOnFrame = 0; | ||
| } | ||
| } | ||
| } finally { | ||
| if (targetLengthTensor) targetLengthTensor.dispose(); | ||
| if (decoderState) this._disposeDecoderState(decoderState); | ||
| if (encoderOutputs && typeof encoderOutputs === 'object') { | ||
| const seen = new Set(); | ||
| for (const value of Object.values(encoderOutputs)) { | ||
| if (value instanceof Tensor && !seen.has(value)) { | ||
| value.dispose(); | ||
| seen.add(value); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| const decodeMs = nowMs() - decodeStart; | ||
|
|
||
| const tokenizeStart = nowMs(); | ||
| const text = decodeTransducerText(tokenizer, tokenIds); | ||
| const needDetailed = returnTimestamps && (returnWords || returnTokens); | ||
| const detailed = needDetailed | ||
| ? buildTransducerDetailedOutputs(tokenizer, tokenIds, tokenTimestamps, tokenConfidences) | ||
| : null; | ||
| const tokenizeMs = nowMs() - tokenizeStart; | ||
|
|
||
| /** @type {any} */ | ||
| const result = { text, isFinal: true }; | ||
| const utteranceConfidence = | ||
| tokenConfidences && tokenConfidences.length > 0 | ||
| ? roundMetric(tokenConfidences.reduce((a, b) => a + b, 0) / tokenConfidences.length, 6) | ||
| : null; | ||
| const utteranceTimestamp = | ||
| tokenTimestamps.length > 0 | ||
| ? /** @type {[number, number]} */ ([ | ||
| tokenTimestamps[0][0], | ||
| tokenTimestamps[tokenTimestamps.length - 1][1], | ||
| ]) | ||
| : /** @type {[number, number]} */ ([roundTs(timeOffset), roundTs(frameCount * frameTime + timeOffset)]); | ||
| const averageLogProb = | ||
| logProbs && logProbs.length > 0 | ||
| ? roundMetric(logProbs.reduce((a, b) => a + b, 0) / logProbs.length, 6) | ||
| : null; | ||
|
|
||
| if (returnTimestamps) { | ||
| result.utteranceTimestamp = utteranceTimestamp; | ||
|
|
||
| if (detailed) { | ||
| if (returnWords) result.words = detailed.words; | ||
| if (returnTokens) result.tokens = detailed.tokens; | ||
| } | ||
|
|
||
| result.confidence = { | ||
| utterance: utteranceConfidence, | ||
| wordAverage: detailed?.wordAverage != null ? roundMetric(detailed.wordAverage, 6) : null, | ||
| averageLogProb, | ||
| }; | ||
| } | ||
|
|
||
| // Frame confidences are independent of return_timestamps — emit whenever requested. | ||
| if (returnFrameConfidences && frameConfidenceStats && frameConfidenceStats.size > 0) { | ||
| const frameConfidences = []; | ||
| for (const { sum, count } of frameConfidenceStats.values()) { | ||
| frameConfidences.push(sum / count); | ||
| } | ||
| result.confidence = { | ||
| ...(result.confidence ?? {}), | ||
| frames: frameConfidences, | ||
| frameAverage: roundMetric(frameConfidences.reduce((a, b) => a + b, 0) / frameConfidences.length, 6), | ||
| }; | ||
| } | ||
|
|
||
| if (!returnTimestamps && averageLogProb != null) { | ||
| result.confidence = { | ||
| ...(result.confidence ?? {}), | ||
| averageLogProb, | ||
| }; | ||
| } | ||
|
|
||
| const debug = {}; | ||
| if (returnFrameIndices) { | ||
| debug.frameIndices = frameIndices; | ||
| } | ||
| if (returnLogProbs) { | ||
| debug.logProbs = logProbs; | ||
| } | ||
| if (returnTdtSteps) { | ||
| debug.tdtSteps = tdtSteps; | ||
| } | ||
| if (Object.keys(debug).length > 0) { | ||
| result.debug = debug; | ||
| } | ||
|
|
||
| if (returnMetrics) { | ||
| const totalMs = nowMs() - totalStart; | ||
| const utteranceDuration = result.utteranceTimestamp | ||
| ? Math.max(result.utteranceTimestamp[1] - result.utteranceTimestamp[0], 1e-8) | ||
| : Math.max(frameCount * frameTime, 1e-8); | ||
| const rtf = totalMs / 1000 / utteranceDuration; | ||
| result.metrics = { | ||
| preprocessMs: 0.0, | ||
| encodeMs: roundMetric(encodeMs, 2), | ||
| decodeMs: roundMetric(decodeMs, 2), | ||
| tokenizeMs: roundMetric(tokenizeMs, 2), | ||
| totalMs: roundMetric(totalMs, 2), | ||
| rtf: roundMetric(rtf, 4), | ||
| rtfX: roundMetric(1 / Math.max(rtf, 1e-8), 2), | ||
| }; | ||
| } | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
The transcribe method is quite long and complex, which could impact readability and future maintenance. I recommend refactoring it by breaking it down into smaller, more focused private methods. For instance, the main decoding loop (lines 676-833) could be extracted into a _greedyDecodeLoop method, and the result construction logic (lines 849-941) could be moved to a _buildTranscribeOutput method. This would make the overall structure of the transcribe method clearer and easier to understand.
| this.cache.delete(key); | ||
| this.cache.set(key, existing); | ||
| return true; | ||
| } else { |
There was a problem hiding this comment.
CRITICAL: Potential memory leak - when existing?.value === value but existing.size_bytes > max_bytes, the existing entry is deleted and false is returned, but the caller's value is neither cached nor returned to the caller. The caller expects to retain ownership (since false is returned), but the value is lost.
The fix should either:
- Store the value in cache regardless and return ownership status appropriately, or
- Return the value to the caller when the cache cannot store it
Code Review SummaryStatus: 1 Issue Found | Recommendation: Address before merge Overview
Issue Details (click to expand)CRITICAL
Detailed Analysis1. Correctness ReviewCRITICAL - Memory Leak in Cache (Line 94-103) The if (existing?.value === value) {
if (existing.size_bytes <= max_bytes) {
// Refresh recency - works correctly
this.cache.delete(key);
this.cache.set(key, existing);
return true;
} else {
this._deleteEntry(key, existing); // Deletes existing
return false; // Returns false but caller\'s value is lost!
}
}When the same object is passed again but the cache entry size exceeds the limit:
Recommended Fix: } else {
// Cannot refresh same-value entry due to size limit
// Caller retains ownership, value is not lost
this._deleteEntry(key, existing);
// Note: caller still has their value reference
return false;
}Wait - actually, the value IS the same reference as existing.value, so the caller still has their reference. The issue is that _deleteEntry might dispose the value (since it thinks the cache owns it). Let me re-analyze... Actually looking more closely at the logic:
Let me re-evaluate... Actually, this might be intentional behavior since the caller and cache share the same reference. The comment says "Refresh recency for unchanged value without invalidating caller-owned references" - so it seems like the design expects the caller to retain their reference. Re-assessment: This might NOT be a bug after all - it's returning false to indicate the cache doesn't own the value (since it couldn't store it), but the caller still has their reference to the same object. The existing entry being deleted is the old cached copy, not the caller's value. However, there's still a subtle issue: when Looking at _deleteEntry(key, entry) {
const current = this.cache.get(key);
if (current !== entry) {
return;
}
this.cache.delete(key);
if (entry.borrowers > 0) {
entry.pendingDispose = true;
} else {
this.current_size_bytes -= entry.size_bytes;
disposeCachedValue(entry.value); // This disposes!
}
}Yes! This IS a bug - when This is a real bug that needs fixing. 2. Security ReviewNo concrete security issues identified in this diff. The code properly validates audio inputs, uses type checking, and doesn't appear to have injection vulnerabilities. 3. Performance ReviewThe implementation includes several good performance practices:
No major performance concerns detected. 4. Reliability Review
The memory issue noted above is the main reliability concern. 5. Test CoverageComprehensive test coverage including:
Merge RecommendationRequest Changes - The memory issue in the cache needs to be addressed before merge. The bug could cause callers to hold references to disposed tensors when same-value entries are evicted due to size limits. Files Reviewed (8 files)
|
| if (this.feature_cache) { | ||
| const key = `${createAudioCacheKey(audio, this.config.sampling_rate)}:${this.delta_order}:${this.delta_window}:${this.delta_concatenate}`; | ||
| const cached = this.feature_cache.acquire(key); | ||
| if (cached) { | ||
| return tagNemoFeatureOutputOwnership({ ...cached.value }, true, cached.release); | ||
| } | ||
|
|
||
| const extracted = await this._extract(audio); | ||
| const cacheOwnsTensors = this.feature_cache.set(key, extracted); | ||
| if (!cacheOwnsTensors) { | ||
| return tagNemoFeatureOutputOwnership({ ...extracted }, false); | ||
| } |
There was a problem hiding this comment.
1. Disposed tensors returned 🐞 Bug ⛯ Reliability
NemoConformerTDTFeatureExtractor._call() returns extracted as caller-owned when FeatureLRUCache.set() returns false, but FeatureLRUCache.set() may already have evicted and disposed that same value, yielding disposed tensors to downstream code.
Agent Prompt
### Issue description
`NemoConformerTDTFeatureExtractor` treats `FeatureLRUCache.set(...)=false` as “cache did not retain ownership, so the returned tensors are caller-owned and valid”. But the cache can return `false` after evicting and disposing the inserted value, which can cause downstream use-after-dispose/double-dispose.
### Issue Context
- `FeatureLRUCache.set()` currently calls `_evict()` and may dispose entries via `_deleteEntry()`.
- Repo tests already assert that `set()` can return `false` while disposing the inserted value.
- The feature extractor returns `extracted` when `set()` returns `false`, which becomes unsafe.
### Fix Focus Areas
- packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js[83-176]
- packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js[158-178]
- packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.js[1114-1147]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Actionable comments posted: 10
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2dee9b19-9b6b-4dc5-9cff-42ab817a924e
📒 Files selected for processing (22)
packages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/models.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/registry.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.js
📜 Review details
🧰 Additional context used
🧠 Learnings (11)
📓 Common learnings
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
📚 Learning: 2026-03-03T23:00:02.309Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T23:00:02.309Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`, `NemoConformerForTDT` overrides `from_pretrained` (line 229) and explicitly calls `constructSessions` with `{ encoder_model: 'encoder_model', decoder_model_merged: 'decoder_model_merged' }`. It does NOT rely on the generic model-type branch logic in `modeling_utils.js`, so no `MODEL_TYPES.NemoConformerTDT` branch is needed there.
Applied to files:
packages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/models.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T23:43:17.261Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 12
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135
Timestamp: 2026-03-05T23:43:17.261Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.
Applied to files:
packages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/models.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T22:59:08.656Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js:521-535
Timestamp: 2026-03-03T22:59:08.656Z
Learning: In `NemoConformerForTDT.transcribe` (`packages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js`), passing `return_words=true` or `return_tokens=true` without `return_timestamps=true` silently no-ops by design. This permissive behavior is intentional for backward compatibility; do not flag it as a bug.
Applied to files:
packages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/models.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:30:12.192Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js:26-26
Timestamp: 2026-03-03T23:30:12.192Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.js`, the `batch` variable destructured from `input_features.dims` via `const [batch, T, F] = input_features.dims;` is intentionally used in all output tensor shape arrays (e.g., `[batch, T, F]`, `[batch, T, F * 2]`, `[batch, T, F * 3]`). Do not flag it as an unused variable.
Applied to files:
packages/transformers/src/models/modeling_utils.jspackages/transformers/src/models/processors.jspackages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/models.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T20:49:40.469Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282
Timestamp: 2026-03-08T20:49:40.469Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.
Applied to files:
packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/models.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-05T21:16:49.338Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 11
File: packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js:162-167
Timestamp: 2026-03-05T21:16:49.338Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `raw_features_length` is computed as `Math.floor((audio.length + Math.floor(n_fft / 2) * 2 - n_fft) / hop_length)` rather than using the spectrogram's internal `1 + Math.floor((length - frame_length) / hop_length)` formula. This intentional mismatch excludes the terminal fully-padded frame produced by the centered STFT, matching existing Parakeet extractor behavior and keeping mask/normalization bounded to non-padding frames. Do not flag this as a frame-count formula mismatch.
Applied to files:
packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/feature_extractors.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/models.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T22:17:10.946Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js:176-182
Timestamp: 2026-03-08T22:17:10.946Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.js`, both `runNemoTranscribe` call sites intentionally pass `returnMetrics: false`. This is the current pipeline adapter contract; metrics are not collected at the pipeline layer. The dedicated adapter tests in `test_pipelines_nemo_conformer_tdt.js` explicitly match this behavior. Do not flag `returnMetrics: false` as a regression or suggest changing it to `true`.
Applied to files:
packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/src/models/registry.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/pipelines/automatic-speech-recognition.jspackages/transformers/src/models/models.jspackages/transformers/src/utils/model_registry/get_model_files.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/pipelines/index.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-03T23:29:56.832Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25
Timestamp: 2026-03-03T23:29:56.832Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.
Applied to files:
packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_window_merge.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/modeling_nemo_conformer_tdt.js
📚 Learning: 2026-03-08T21:41:36.882Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 14
File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:17-24
Timestamp: 2026-03-08T21:41:36.882Z
Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally quantizes each sample to int16 (via `Math.round(sample * 32768)`) before hashing. This is a deliberate design choice for deterministic cross-runtime key stability. The resulting aliasing of distinct waveforms that fall in the same int16 bins is an accepted tradeoff because feature caching is optional and off by default. Do not flag the int16 quantization as a collision risk or suggest hashing raw bytes instead.
Applied to files:
packages/transformers/src/models/nemo_conformer_tdt/transducer_text.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/pipeline_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_segment_offsets.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_deltas.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_cache.jspackages/transformers/src/models/nemo_conformer_tdt/transducer_word_offsets.js
📚 Learning: 2026-03-03T22:59:55.984Z
Learnt from: ysdede
Repo: ysdede/transformers.js PR: 8
File: packages/transformers/src/utils/model_registry/get_model_files.js:169-171
Timestamp: 2026-03-03T22:59:55.984Z
Learning: When a model subclass overrides from_pretrained and bypasses the generic model-type branch, do not introduce or rely on a MODEL_TYPES entry for that subclass in the model registry. For this NemoConformerTDT case, in packages/transformers/src/utils/model_registry/get_model_files.js, avoid adding a NemoConformerTDT entry in the model type map; rely on the override logic in modeling_nemo_conformer_tdt.js. This keeps the registry explicit to the actual file and prevents unintended dispatch through the generic branch.
Applied to files:
packages/transformers/src/utils/model_registry/get_model_files.js
🔇 Additional comments (7)
packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js (1)
52-56: 🛠️ Refactor suggestion | 🟠 Major
get()bypasses the borrow/release lifetime model.It returns cache-owned tensors without extending their lifetime, so a caller can retain the value across
set()orclear()and then touch disposed tensors. The safe public surface here isacquire()/release()only.⛔ Skipped due to learnings
Learnt from: ysdede Repo: ysdede/transformers.js PR: 12 File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:98-135 Timestamp: 2026-03-05T23:43:17.261Z Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `FeatureLRUCache` does not share `Tensor` objects across cache keys. Each cache entry owns a distinct tensor instance produced by an independent extraction call. Ref-count tracking across entries is therefore unnecessary and should not be flagged as a missing safety mechanism unless cross-key tensor sharing is explicitly introduced.Learnt from: ysdede Repo: ysdede/transformers.js PR: 8 File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:9-25 Timestamp: 2026-03-03T23:29:56.832Z Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally hashes every sample in the audio buffer (full iteration, not a subset). This is a deliberate design choice to minimize cache-key collisions and prevent cross-audio cache contamination. Do not flag the O(n) iteration as a performance issue or suggest subset-sampling strategies.Learnt from: ysdede Repo: ysdede/transformers.js PR: 14 File: packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js:17-24 Timestamp: 2026-03-08T21:41:36.882Z Learning: In `packages/transformers/src/models/nemo_conformer_tdt/transducer_cache.js`, `createAudioCacheKey` intentionally quantizes each sample to int16 (via `Math.round(sample * 32768)`) before hashing. This is a deliberate design choice for deterministic cross-runtime key stability. The resulting aliasing of distinct waveforms that fall in the same int16 bins is an accepted tradeoff because feature caching is optional and off by default. Do not flag the int16 quantization as a collision risk or suggest hashing raw bytes instead.Learnt from: ysdede Repo: ysdede/transformers.js PR: 14 File: packages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.js:274-282 Timestamp: 2026-03-08T20:49:40.469Z Learning: In `packages/transformers/src/models/nemo_conformer_tdt/feature_extraction_nemo_conformer_tdt.js`, `preemphasis` range validation is intentionally deferred to the async feature extraction path (`_call`/`_extract`), not the constructor. Tests must use `await expect(extractor(audio)).rejects.toThrow("preemphasis")`, not a synchronous constructor-throw assertion. Do not flag this pattern as incorrect.packages/transformers/src/utils/model_registry/get_model_files.js (1)
180-182: Load the canonical NeMo ONNX pair here.These basenames match the session keys hard-coded in
NemoConformerForTDT.from_pretrained(), so this branch will request the exact artifacts the custom loader expects.packages/transformers/src/models/feature_extractors.js (1)
8-8: Expose the feature extractor through the central barrel.This keeps the new NeMo extractor discoverable through the same public surface as the other model-specific extractors.
packages/transformers/src/models/processors.js (1)
11-11: Expose the processor through the central barrel.This is the right public entry point for the new NeMo processor.
packages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.js (1)
128-203: Good contract coverage for the NeMo ASR adapter.These cases pin the three public return shapes and the options forwarded into
model.transcribe(), which is exactly where this integration can regress.packages/transformers/src/pipelines/index.js (1)
33-33: Wire the ASR task to the concrete NeMo loader.Including
NemoConformerForTDThere is the right way to letpipeline('automatic-speech-recognition', ...)resolve the custom class without disturbing the existing seq2seq and CTC loaders.Also applies to: 154-155
packages/transformers/src/pipelines/automatic-speech-recognition.js (1)
6-8: Keep the NeMo path behind a dedicated adapter.The new switch case and helper isolate the model-specific behavior in
runNemoConformerTDTPipelineand leave the Whisper and Wav2Vec2 paths untouched.Also applies to: 158-159, 308-316
| if (!Number.isInteger(this.config.n_fft) || this.config.n_fft <= 0) { | ||
| throw new Error( | ||
| `NemoConformerTDTFeatureExtractor expected \`n_fft\` as a positive integer, got ${this.config.n_fft}.`, | ||
| ); | ||
| } | ||
| if ( | ||
| !Number.isInteger(this.config.win_length) || | ||
| this.config.win_length <= 0 || | ||
| this.config.win_length > this.config.n_fft | ||
| ) { | ||
| throw new Error( | ||
| `NemoConformerTDTFeatureExtractor expected \`win_length\` in [1, n_fft], got win_length=${this.config.win_length}, n_fft=${this.config.n_fft}.`, | ||
| ); | ||
| } | ||
|
|
||
| // Prefer given `mel_filters` from preprocessor_config.json, or calculate them if they don't exist. | ||
| this.config.mel_filters ??= mel_filter_bank( | ||
| Math.floor(1 + this.config.n_fft / 2), // num_frequency_bins | ||
| this.config.feature_size, // num_mel_filters | ||
| 0.0, // min_frequency | ||
| this.config.sampling_rate / 2, // max_frequency | ||
| this.config.sampling_rate, // sampling_rate | ||
| 'slaney', // norm | ||
| 'slaney', // mel_scale | ||
| ); | ||
|
|
There was a problem hiding this comment.
Reject invalid hop_length, feature_size, and sampling_rate in the constructor.
The constructor only guards n_fft and win_length. Invalid values for these other core fields still flow into mel_filter_bank(), spectrogram(), and frame-length math, so a bad model config fails later and less predictably.
Proposed fix
if (
!Number.isInteger(this.config.win_length) ||
this.config.win_length <= 0 ||
this.config.win_length > this.config.n_fft
) {
throw new Error(
`NemoConformerTDTFeatureExtractor expected \`win_length\` in [1, n_fft], got win_length=${this.config.win_length}, n_fft=${this.config.n_fft}.`,
);
}
+ if (!Number.isInteger(this.config.hop_length) || this.config.hop_length <= 0) {
+ throw new Error(
+ `NemoConformerTDTFeatureExtractor expected \`hop_length\` as a positive integer, got ${this.config.hop_length}.`,
+ );
+ }
+ if (!Number.isInteger(this.config.feature_size) || this.config.feature_size <= 0) {
+ throw new Error(
+ `NemoConformerTDTFeatureExtractor expected \`feature_size\` as a positive integer, got ${this.config.feature_size}.`,
+ );
+ }
+ if (!Number.isFinite(this.config.sampling_rate) || this.config.sampling_rate <= 0) {
+ throw new Error(
+ `NemoConformerTDTFeatureExtractor expected \`sampling_rate\` as a positive number, got ${this.config.sampling_rate}.`,
+ );
+ }
// Prefer given `mel_filters` from preprocessor_config.json, or calculate them if they don't exist.
this.config.mel_filters ??= mel_filter_bank(| const needDetailed = returnTimestamps && (returnWords || returnTokens); | ||
| const detailed = needDetailed | ||
| ? buildTransducerDetailedOutputs(tokenizer, tokenIds, tokenTimestamps, tokenConfidences) | ||
| : null; | ||
| const tokenizeMs = nowMs() - tokenizeStart; | ||
|
|
||
| /** @type {any} */ | ||
| const result = { text, isFinal: true }; | ||
| const utteranceConfidence = | ||
| tokenConfidences && tokenConfidences.length > 0 | ||
| ? roundMetric(tokenConfidences.reduce((a, b) => a + b, 0) / tokenConfidences.length, 6) | ||
| : null; | ||
| const utteranceTimestamp = | ||
| tokenTimestamps.length > 0 | ||
| ? /** @type {[number, number]} */ ([ | ||
| tokenTimestamps[0][0], | ||
| tokenTimestamps[tokenTimestamps.length - 1][1], | ||
| ]) | ||
| : /** @type {[number, number]} */ ([roundTs(timeOffset), roundTs(frameCount * frameTime + timeOffset)]); | ||
| const averageLogProb = | ||
| logProbs && logProbs.length > 0 | ||
| ? roundMetric(logProbs.reduce((a, b) => a + b, 0) / logProbs.length, 6) | ||
| : null; | ||
|
|
||
| if (returnTimestamps) { | ||
| result.utteranceTimestamp = utteranceTimestamp; | ||
|
|
||
| if (detailed) { | ||
| if (returnWords) result.words = detailed.words; | ||
| if (returnTokens) result.tokens = detailed.tokens; | ||
| } | ||
|
|
||
| result.confidence = { | ||
| utterance: utteranceConfidence, | ||
| wordAverage: detailed?.wordAverage != null ? roundMetric(detailed.wordAverage, 6) : null, | ||
| averageLogProb, |
There was a problem hiding this comment.
confidence.wordAverage should not depend on returnWords or returnTokens.
Right now wordAverage is only computed when detailed word/token outputs are requested, so transcribe({ returnTimestamps: true, tokenizer }) returns confidence.wordAverage: null even though the tokenizer and per-token confidences are already available.
Proposed fix
- const needDetailed = returnTimestamps && (returnWords || returnTokens);
- const detailed = needDetailed
+ const needDetailed = returnTimestamps && (returnWords || returnTokens);
+ const needWordAverage = returnTimestamps && !!tokenizer;
+ const detailed = needDetailed || needWordAverage
? buildTransducerDetailedOutputs(tokenizer, tokenIds, tokenTimestamps, tokenConfidences)
: null;| /** | ||
| * Preprocess raw audio for Nemo Conformer TDT models. | ||
| * @param {Float32Array|Float64Array} audio | ||
| */ | ||
| async _call(audio) { | ||
| return await this.feature_extractor(audio); | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Base Processor._call contract =="
sed -n '108,126p' packages/transformers/src/processing_utils.js
echo
echo "== NemoConformerTDTProcessor override =="
sed -n '1,24p' packages/transformers/src/models/nemo_conformer_tdt/processing_nemo_conformer_tdt.jsRepository: ysdede/transformers.js
Length of output: 1413
Remove the unnecessary _call override or forward ...args to preserve the processor contract.
The base Processor._call(input, ...args) dispatches to the first available processor component, including feature_extractor. This override duplicates that logic while narrowing the signature and silently discarding per-call extractor options. Either remove it entirely and rely on the base class, or if retained, forward ...args:
Fix
/**
* Preprocess raw audio for Nemo Conformer TDT models.
* `@param` {Float32Array|Float64Array} audio
*/
- async _call(audio) {
- return await this.feature_extractor(audio);
+ async _call(audio, ...args) {
+ return await this.feature_extractor(audio, ...args);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /** | |
| * Preprocess raw audio for Nemo Conformer TDT models. | |
| * @param {Float32Array|Float64Array} audio | |
| */ | |
| async _call(audio) { | |
| return await this.feature_extractor(audio); | |
| } | |
| /** | |
| * Preprocess raw audio for Nemo Conformer TDT models. | |
| * `@param` {Float32Array|Float64Array} audio | |
| */ | |
| async _call(audio, ...args) { | |
| return await this.feature_extractor(audio, ...args); | |
| } |
| export function joinTimedWords(words) { | ||
| let text = ''; | ||
| for (const word of words) { | ||
| const part = word.text ?? ''; | ||
| if (!part) continue; | ||
| if (!text) { | ||
| text = part; | ||
| } else if (/^[,.;:!?)}\]]+$/.test(part)) { | ||
| text += part; | ||
| } else { | ||
| text += ` ${part}`; | ||
| } |
There was a problem hiding this comment.
joinTimedWords() still formats opener and numeric affix tokens incorrectly.
It only suppresses spaces before closing punctuation. Standalone $, %, or ( become $ 10, 10 %, and ( hello, and that output is propagated into both segment text and pipeline chunks.
Proposed fix
export function joinTimedWords(words) {
let text = '';
for (const word of words) {
const part = word.text ?? '';
if (!part) continue;
if (!text) {
text = part;
- } else if (/^[,.;:!?)}\]]+$/.test(part)) {
+ } else if (/^[,.;:!?%)}\]]+$/.test(part) || /[(\["'“‘$£€¥]$/.test(text)) {
text += part;
} else {
text += ` ${part}`;
}
}
return text;
}| export function decodeTransducerText(tokenizer, token_ids) { | ||
| if (!Array.isArray(token_ids) || token_ids.length === 0) return ''; | ||
| if (!tokenizer) return token_ids.join(' '); | ||
| return tokenizer.decode(token_ids, { skip_special_tokens: true }).trim(); |
There was a problem hiding this comment.
Require a tokenizer for non-empty text decoding.
Returning token_ids.join(' ') here silently turns a successful decode into token ids instead of a transcription. It also makes text-only mode "work" while the detailed-output path correctly rejects the same missing-tokenizer case.
Proposed fix
export function decodeTransducerText(tokenizer, token_ids) {
if (!Array.isArray(token_ids) || token_ids.length === 0) return '';
- if (!tokenizer) return token_ids.join(' ');
+ if (!tokenizer) {
+ throw new Error('decodeTransducerText requires a tokenizer for non-empty token_ids.');
+ }
return tokenizer.decode(token_ids, { skip_special_tokens: true }).trim();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export function decodeTransducerText(tokenizer, token_ids) { | |
| if (!Array.isArray(token_ids) || token_ids.length === 0) return ''; | |
| if (!tokenizer) return token_ids.join(' '); | |
| return tokenizer.decode(token_ids, { skip_special_tokens: true }).trim(); | |
| export function decodeTransducerText(tokenizer, token_ids) { | |
| if (!Array.isArray(token_ids) || token_ids.length === 0) return ''; | |
| if (!tokenizer) { | |
| throw new Error('decodeTransducerText requires a tokenizer for non-empty token_ids.'); | |
| } | |
| return tokenizer.decode(token_ids, { skip_special_tokens: true }).trim(); | |
| } |
| if ( | ||
| prev && | ||
| prevText === wordText && | ||
| ( | ||
| prevText.length > 0 || | ||
| normalizeRawMergedWordText(prev.text) === normalizeRawMergedWordText(word.text) | ||
| ) && | ||
| word.startTime < prev.endTime | ||
| ) { | ||
| const prevDuration = prev.endTime - prev.startTime; | ||
| const nextDuration = word.endTime - word.startTime; | ||
| if (nextDuration > prevDuration) { | ||
| merged[merged.length - 1] = word; | ||
| } |
There was a problem hiding this comment.
Do not drop timing bounds when deduplicating overlapped words.
This replacement keeps only one copy’s startTime/endTime. For window-overlap duplicates, that can trim the surviving word and then move the restart cursor or final chunk timestamps forward. Keep the preferred payload, but union the time span when collapsing duplicates.
Proposed fix
if (
prev &&
prevText === wordText &&
(
prevText.length > 0 ||
normalizeRawMergedWordText(prev.text) === normalizeRawMergedWordText(word.text)
) &&
- word.startTime < prev.endTime
+ word.startTime <= prev.endTime + 1e-6
) {
- const prevDuration = prev.endTime - prev.startTime;
- const nextDuration = word.endTime - word.startTime;
- if (nextDuration > prevDuration) {
- merged[merged.length - 1] = word;
- }
+ const keep =
+ (word.endTime - word.startTime) > (prev.endTime - prev.startTime) ? word : prev;
+ merged[merged.length - 1] = {
+ ...keep,
+ startTime: Math.min(prev.startTime, word.startTime),
+ endTime: Math.max(prev.endTime, word.endTime),
+ };
continue;
}| function resolveTokenPiece(tokenizer, id) { | ||
| const rawToken = getIdToTokenMap(tokenizer).get(id) ?? ''; | ||
| const decoded = tokenizer.decode([id], { | ||
| skip_special_tokens: true, | ||
| clean_up_tokenization_spaces: false, | ||
| }); | ||
|
|
||
| const startsWithBoundaryMarker = /^(?:▁|Ġ)+/.test(rawToken); | ||
| const startsWithWhitespace = /^\s+/.test(decoded); | ||
| const startsNewWord = startsWithBoundaryMarker || startsWithWhitespace; | ||
|
|
||
| let clean = decoded.replace(/^\s+/, ''); | ||
| if (!clean) { | ||
| clean = rawToken.replace(/^(?:▁|Ġ|Ċ)+/, '').replace(/^ +/, ''); | ||
| } | ||
|
|
||
| return { raw: rawToken || decoded, clean, startsNewWord }; | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Cache resolved token pieces, not just raw vocab text.
resolveTokenPiece still calls tokenizer.decode([id], ...) for every occurrence of a token id. On long transcripts that makes detailed-output generation pay the tokenizer cost repeatedly for identical ids.
Proposed refactor
const TOKEN_ID_TO_TEXT_CACHE = new WeakMap();
+const TOKEN_PIECE_CACHE = new WeakMap();
...
function resolveTokenPiece(tokenizer, id) {
+ let pieceCache = TOKEN_PIECE_CACHE.get(tokenizer);
+ if (!pieceCache) {
+ pieceCache = new Map();
+ TOKEN_PIECE_CACHE.set(tokenizer, pieceCache);
+ }
+ const cachedPiece = pieceCache.get(id);
+ if (cachedPiece) return cachedPiece;
+
const rawToken = getIdToTokenMap(tokenizer).get(id) ?? '';
const decoded = tokenizer.decode([id], {
skip_special_tokens: true,
clean_up_tokenization_spaces: false,
});
@@
- return { raw: rawToken || decoded, clean, startsNewWord };
+ const piece = { raw: rawToken || decoded, clean, startsNewWord };
+ pieceCache.set(id, piece);
+ return piece;
}| ['nemo-conformer-tdt', 'NemoConformerForTDT'], | ||
| ['audio-spectrogram-transformer', 'ASTModel'], |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Explicit ASR task wiring =="
sed -n '148,158p' packages/transformers/src/pipelines/index.js
echo
echo "== NeMo additions in the generic registry/type maps =="
sed -n '12,50p' packages/transformers/src/models/registry.js
sed -n '558,590p' packages/transformers/src/models/registry.js
echo
echo "== Generic consumers of MODEL_MAPPING_NAMES / MODEL_TYPE_MAPPING =="
rg -n -C3 'MODEL_MAPPING_NAMES|MODEL_TYPE_MAPPING' \
packages/transformers/src/models \
packages/transformers/src/utils/model_registryRepository: ysdede/transformers.js
Length of output: 22032
Remove NemoConformerForTDT from MODEL_MAPPING_NAMES_ENCODER_ONLY.
The ASR pipeline already loads NemoConformerForTDT directly without using registry resolution. Adding it to the encoder-only mapping exposes this class through generic resolver paths it does not use. Since the model overrides from_pretrained with specialized session construction, it should not be discoverable via generic model-type routing.
Remove line 44-45 from packages/transformers/src/models/registry.js. The MODEL_TYPE_MAPPING entries established by direct registration in modeling_nemo_conformer_tdt.js (lines 953-954) are sufficient if get_model_files() requires a fallback for architecture-less configs.
| it( | ||
| "supports concatenated delta and delta-delta features", | ||
| async () => { | ||
| const extractor = new NemoConformerTDTFeatureExtractor({ | ||
| ...base, | ||
| feature_size: 128, | ||
| delta_order: 2, | ||
| delta_window: 2, | ||
| delta_concatenate: true, | ||
| }); | ||
| const { input_features } = await extractor(audio); | ||
| try { | ||
| expect(input_features.dims[2]).toBe(128 * 3); | ||
| } finally { | ||
| input_features.dispose(); | ||
| } | ||
| }, | ||
| MAX_TEST_EXECUTION_TIME, |
There was a problem hiding this comment.
Dispose attention_mask in the concatenated-delta tests.
Both blocks only keep input_features, but the extractor also returns an attention_mask. That leaves a caller-owned tensor live in the test path, which is the wrong pattern in a suite that is explicitly checking tensor lifetimes.
Proposed fix
- const { input_features } = await extractor(audio);
+ const { input_features, attention_mask } = await extractor(audio);
try {
expect(input_features.dims[2]).toBe(128 * 3);
} finally {
input_features.dispose();
+ attention_mask.dispose();
}- let input_features;
+ let input_features;
+ let attention_mask;
try {
- ({ input_features } = await extractor(audio));
+ ({ input_features, attention_mask } = await extractor(audio));
expect(input_features.dims[2]).toBe(80 * 2);
} finally {
Tensor.prototype.dispose = originalDispose;
input_features?.dispose();
+ attention_mask?.dispose();
}Also applies to: 101-130
| it( | ||
| "computes delta and delta-delta features", | ||
| async () => { | ||
| const input = new Tensor( | ||
| "float32", | ||
| Float32Array.from([ | ||
| // T=4, F=2 | ||
| 1, 2, 2, 4, 3, 6, 4, 8, | ||
| ]), | ||
| [1, 4, 2], | ||
| ); | ||
|
|
||
| const split = computeTemporalDeltas(input, { order: 2, window: 1, concatenate: false }); | ||
| expect(split.delta.dims).toEqual([1, 4, 2]); | ||
| expect(split.delta_delta.dims).toEqual([1, 4, 2]); | ||
|
|
||
| const concatOrder1 = computeTemporalDeltas(input, { order: 1, window: 1, concatenate: true }); | ||
| expect(concatOrder1.dims).toEqual([1, 4, 4]); | ||
| expect(Array.from(concatOrder1.data.slice(0, 8))).toEqual([ | ||
| 1, | ||
| 2, | ||
| 0.5, | ||
| 1, // t0: base + delta | ||
| 2, | ||
| 4, | ||
| 1, | ||
| 2, // t1: base + delta | ||
| ]); | ||
|
|
||
| const concat = computeTemporalDeltas(input, { order: 2, window: 1, concatenate: true }); | ||
| expect(concat.dims).toEqual([1, 4, 6]); | ||
| }, | ||
| MAX_TEST_EXECUTION_TIME, |
There was a problem hiding this comment.
Dispose the tensors allocated by the delta-helper tests.
These two cases leave several caller-owned tensors live after the assertions. In the disposal-counting case, restore Tensor.prototype.dispose first, then release the test tensors in finally so the count stays stable.
Also applies to: 928-948
Summary
This PR adds NeMo Conformer TDT ASR support to
transformers.js, including model execution, feature extraction, decoding, reconstruction, pipeline wiring, registry integration, and NeMo-specific regression coverage.The NeMo pipeline is aligned with the shared
automatic-speech-recognitiontask contract, while richer directmodel.transcribe()outputs remain available for lower-level use.This branch supersedes my earlier fork-side iteration and is rebased onto the current
upstream/mainline.What is included
1. Model and decoder
model.transcribe()support for text, timestamps, confidences, optional words and tokens, and optional metrics/debug payloads.2. Feature extraction
3. ASR pipeline integration
AutomaticSpeechRecognitionPipelinedispatch.{ text }return_timestamps: true:{ text, chunks }with sentence-like finalized chunksreturn_timestamps: 'word':{ text, chunks }with word-level timestampsmodel.transcribe().4. Long-audio handling
chunk_length_sis used as the NeMo window-size override in pipeline mode.5. Word reconstruction and timestamp grouping
score.48-year-oldwith0.5March20th,2021.6. Registry and model file resolution
encoder_modelanddecoder_model_merged.7. Follow-up review fixes included in this branch
encoder_input_layoutfor canonicalinput_featuresfeeds.vocab_sizefrom the maximum tokenizer id so sparse vocabularies do not undersize decoder logits.Regression coverage
Added or updated tests in:
packages/transformers/tests/models/nemo_conformer_tdt/test_modeling_nemo_conformer_tdt.jspackages/transformers/tests/models/nemo_conformer_tdt/test_feature_extraction_nemo_conformer_tdt.jspackages/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.jspackages/transformers/tests/pipelines/test_pipelines_nemo_conformer_tdt.jsCoverage includes:
Upstream sync included
This branch was synced with
upstream/mainthrough commitf65a4c7cvia merge commit49a4af8f.Included upstream commits in that sync:
2120d13e[deno] Support both wgpu and dawn webgpu backends (#1546)a289e5c3Add support for new Qwen VL models (Qwen2.5-VL, Qwen3-VL, Qwen3.5, and Qwen3.5 MoE) (#1551)b5b51ca9[version] Update to 4.0.0-next.52a210d3e[deno via CDN] Fix simultaneous multi-session loading (e.g. VLMs) and support usage for image loading (#1556)e60a6ee3Use ModelRegistry for pipeline file loading (#1555)4331d723Support PKV cached generation for Qwen-VL models (#1557)cd155a05fix: prevent partial file reads during concurrent downloads (#1548)30773fb7Fix WASM factory blob URL loading (#1558)f65a4c7cfeat: add fast boolean is_cached / is_pipeline_cached (#1559)NeMo adaptation after upstream sync
Relevant branch commits after that sync include:
ee819a1cfix(nemo-tdt): add supports() for ASR model class selectiona85dff25fix(nemo-tdt): address PR #12 reviewer feedback8dfccddcfeat(nemo-tdt): align asr pipeline outputs and long-audio handling816f5811chore(tests): drop unrelated parakeet feature extractor coveragef59ba068feat(nemo-conformer-tdt): add sentence-based ASR pipeline chunking00b3d934fix(nemo): scope ASR tests and address review fixes07118c38fix(nemo-tdt): address follow-up review threads341df3d7chore(asr): restore upstream cast spacing29f2baaffix(nemo-tdt): handle sparse vocab and merge dedupeValidation
Executed for this refresh:
node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/models.test.js -t "nemo_conformer_tdt"node --experimental-vm-modules --expose-gc node_modules/jest/bin/jest.js --config jest.config.mjs --runInBand tests/pipelines.test.js -t "Nemo Conformer TDT pipeline adapter|Automatic Speech Recognition"pnpm buildpnpm format:checkon files changed by this branch: OKNote: repo-wide
pnpm format:checkcurrently reports unrelated formatting issues outside this branch, but the files touched by this PR pass Prettier checks.A prebuilt demo for this branch is available at https://ysdede.github.io/tdt-webgpu-demo/ for interactive testing of the current behavior, parameters, outputs, and example usage.
Scope boundary
This PR stays focused on NeMo Conformer TDT integration and the follow-up work needed to:
Direct
model.transcribe()remains the low-level API for advanced app-specific postprocessing.Architectural Review: NeMo Conformer TDT ASR Integration
Risk Level: HIGH
Critical Issues
[Critical] Memory Leak: Undisposed Frame Tensors in Decoding Loop
File:
modeling_nemo_conformer_tdt.js(lines ~450-500)Problem:
In the
transcribe()method's decoding loop,frameTensoris created on each iteration but never disposed. The loop allocates a new Tensor per frame without cleanup, accumulating memory pressure linearly with audio duration.Impact:
For a 1-hour audio file with 16kHz sampling and 10ms frame shifts (~360k frames), this allocates ~360k Tensors that never release their underlying memory. This will cause OOM crashes on long audio and fails the project's disposal hardening goal stated in PR objectives.
Fix:
Add disposal in the decoding loop's finally block or immediately after decoder runs:
[Critical] Incomplete Decoder Output Tensor Disposal
File:
modeling_nemo_conformer_tdt.js(lines ~460-480)Problem:
The code attempts to track and dispose decoder output tensors via
seenDecoderTensors, but thelogitstensor is disposed while intermediate state tensors may leak:If
decoderOutputcontains additional Tensor values not enumerated in the loop condition, they remain undisposed. The set-based deduplication doesn't guarantee all tensors are cleaned.Impact:
Accumulation of state tensors in long audio processing, eventually causing OOM.
Fix:
Explicitly dispose all three state tensors or refactor to track all decoder outputs:
[High] Frame Reuse Logic Introduces Non-Deterministic State
File:
modeling_nemo_conformer_tdt.js(lines ~335-350)Problem:
The
reusableFrameoptimization reuses a Float32Array for BDT layout but creates a new one for BTD layout on every iteration:If
_getFrameDatareturns the previous iteration'sreusableFrameAND that data is modified by the encoder/decoder session (unlikely but possible with ONNX backends), the decoder receives corrupted data. More critically, the Array reference becomes ambiguous about ownership—is it still safe to reuse if the session holds a reference?Impact:
Potential silent data corruption in long audio, producing incorrect transcriptions without error signals.
Fix:
Either always copy the frame data or document strict ownership semantics with the session layer. Remove reuse optimization unless profiling proves measurable benefit.
High-Severity Issues
[High] Numerical Stability in Confidence Calculation
File:
modeling_nemo_conformer_tdt.js(lines ~48-70)Problem:
The
confidenceFromLogits()function computes softmax with log-sum-exp but does not validate input bounds:If
logitscontains extreme values (e.g., all logits < -700, causingexp()underflow to 0), thenlogSumExpbecomes-Infinity, producing NaN confidence values that corrupt metrics and debug output.Impact:
Silent production of NaN/Infinity in confidence scores; downstream code may accept these invalid values, producing unreliable transcription metrics.
Fix:
Add guards and use stable softmax:
[High] Timestamp Fallback Logic Produces Invalid Ranges
File:
modeling_nemo_conformer_tdt.js(lines ~495-510)Problem:
When token timestamps are empty, the code computes a fallback utterance timestamp:
If
frameTime(computed assubsampling_factor * frame_shift_s) is not finite or is zero, then:roundTs()receives Infinity/NaN, producing invalid outputAdditionally, the RTF calculation uses
Math.max(utteranceDuration, 1e-8)as a guard against division by zero, but if utterance timestamps are bothroundTs(timeOffset), the duration becomes 0 before the guard is applied.Impact:
Invalid timestamp ranges (start >= end) propagate to pipeline output, breaking downstream segment processing and potentially causing index-out-of-bounds in audio windowing logic.
Fix:
Validate frameTime and fallback timestamp computation:
[High] Word-Timestamp Alignment Fragility
File:
transducer_word_offsets.js(lines ~50-80)Problem:
The
consumeAlignedTokenText()function usesfullText.indexOf()as a fallback when exact token matching fails:This heuristic can silently match the wrong token occurrence if the decoded text contains repeated substrings (e.g., "the the" → first token "the" matches first "the", second token "the" matches second "the", but what if tokenization is different?). No validation that the matched text is actually a valid token boundary.
Impact:
Word timestamps become misaligned with actual token positions, corrupting word-level output timestamps. User receives confident but incorrect word timings.
Fix:
Log skipped tokens and add validation; optionally use a more sophisticated alignment algorithm (e.g., longest common subsequence or Levenshtein-based matching for robustness).
[High] Audio Windowing Window Boundary Precision Loss
File:
pipeline_nemo_conformer_tdt.js(lines ~100-130)Problem:
Audio is sliced using floating-point sample indices computed from time values:
For a 16kHz audio file with
chunk_length_s = 90:This can cause gaps or overlaps in coverage, and more critically, cursor relocation logic (
relocateNemoCursorToNearbyGap) operates on timestamps rounded to 3 decimals (roundTs), but the actual audio sample positions use floor/ceil, creating misalignment between logical window boundaries and actual sample positions.Impact:
Long audio transcription produces gaps in coverage or duplicate segments due to window boundary misalignment.
Fix:
Maintain sample-based cursor state (integer indices) and convert to timestamps for external APIs only. Verify that consecutive windows have exact coverage without gaps.
[High] Missing Null/Undefined Guards in Confidence Aggregation
File:
modeling_nemo_conformer_tdt.js(lines ~505-520)Problem:
Confidence aggregation assumes arrays are non-empty but doesn't validate:
If
tokenConfidencescontains null/undefined values (fromconfidenceFromLogitsreturning NaN), the reduce produces NaN, whichroundMetric()converts to 0:This masks bugs: NaN inputs are silently replaced with 0, making it appear that all tokens had zero confidence.
Impact:
Incorrect confidence metrics in transcription results; user cannot detect when confidence calculation failed.
Fix:
Filter out non-finite confidence values before aggregation; log a warning:
Medium-Severity Issues
[Medium] Feature Cache Ownership Semantics Unclear
File:
feature_extraction_nemo_conformer_tdt.js+transducer_cache.jsProblem:
The cache uses symbols (
NEMO_FEATURE_OUTPUT_OWNERSHIP,NEMO_FEATURE_OUTPUT_RELEASE) to signal ownership, but:inputs?.[NEMO_FEATURE_OUTPUT_RELEASE]but there's no guarantee the feature extractor actually attaches this symbolset()returnsfalseand caller retains ownership, but the code doesn't document this contract clearlyImpact:
Developers may forget to dispose processor outputs if cache is disabled, or cache may not properly release tensors if the release callback isn't invoked by downstream code.
Fix:
Document ownership transfer explicitly; consider returning a wrapper object with guaranteed lifecycle management.
[Medium] Cache Eviction May Discard In-Use Entries
File:
transducer_cache.js(lines ~170-190)Problem:
The LRU eviction logic can evict entries that have pending borrows:
If a new call to
acquire()comes whilependingDisposeis true, the entry is not in the cache, so acquire returns null, but a borrowed reference may still exist elsewhere. This creates ambiguity: is the caller responsible for disposal of a borrowed tensor that the cache says doesn't exist?Impact:
Potential double-disposal or use-after-free if caller tries to dispose a tensor that cache has already marked for pending disposal.
Fix:
Prevent eviction of borrowed entries or document strict ordering: "Once acquired, caller must release before cache can evict."
[Medium] No Validation of Decoder IO Names at Construction
File:
modeling_nemo_conformer_tdt.js(lines ~160-200)Problem:
The
resolveTransducerConfig()function is called in the constructor but doesn't validate that the session's actual input/output names match the configured IO mapping. A mismatch is only detected at runtime during the first_runDecoder()call, after initialization succeeds.Impact:
Misconfiguration silently passes initialization and only fails during inference, wasting resources.
Fix:
Validate decoder session names in
resolveTransducerConfig():Low-Severity Issues
[Low] Hardcoded Magic Numbers in Pipeline
File:
pipeline_nemo_conformer_tdt.js(lines ~1-15)Constants like
NEMO_AUTO_WINDOW_THRESHOLD_S = 180,NEMO_AUTO_CHUNK_LENGTH_S = 90are hardcoded but should be exposed as configurable options or documented in clear comments explaining why these specific values were chosen.[Low] Test Coverage Gap: Error Paths
Tests for
from_pretrained()with missing encoder/decoder files are not visible in the provided test output. This is a critical error path that should be validated.Merge Decision: NEEDS CHANGES
This PR introduces significant functionality but has critical memory leak issues and numerical stability problems that will cause production failures:
Required Actions Before Merge:
Acceptable with follow-ups: No. Memory leaks are unacceptable in production code, especially in a framework handling user-provided audio.