Skip to content

fix: tile subsampling for long audio to avoid ggml 2^31 tensor overflow on GPU#19

Merged
mudler merged 7 commits into
masterfrom
worktree-long-audio-tiling
Jun 7, 2026
Merged

fix: tile subsampling for long audio to avoid ggml 2^31 tensor overflow on GPU#19
mudler merged 7 commits into
masterfrom
worktree-long-audio-tiling

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

@localai-bot localai-bot commented Jun 7, 2026

Problem

Transcribing audio longer than ~44 min on GPU crashed. There were two distinct CUDA limits in the long-audio path, both invisible to CPU tests (CPU has neither limit):

  1. Subsampling relu — int32 element overflow. The first subsampling conv makes a tensor of (n_mels/2)·(T/2)·conv_channels elements. For tdt-0.6b-v3 (n_mels=128, conv_channels=256) a 51-min clip is 2,521,890,816 > INT_MAX, and ggml's CUDA unary (relu) kernel indexes elements with int → wraps negative → invalid configuration argument in ggml_cuda_op_relu. (Same 2³¹ wall PyTorch hits, canUse32BitIndexMath not working properly in Conv2D layer pytorch/pytorch#80020; NeMo chunks the subsampling conv via subsampling_conv_chunking_factor.)
  2. Banded attention pad — gridDim.y cap. Once past (1), the encoder's banded local-attention over-pads K/V to a contiguous axis Lk = (C+P-1)·ceil(T'/C) ≈ 77k for T'=38,481, and ggml's CUDA pad kernel maps ne1 straight to gridDim.y, which CUDA caps at 65535PAD failed / invalid argument.

Fix

(1) Tile the subsampling stage over time (Subsampling::forward_tiled, Encoder::forward_batch_tiled) so no conv tensor exceeds 2³¹, then run the unchanged conformer stack on the full sequence. The subsampler's receptive field is ±7 mel frames, so tiling with an 8-frame halo is bit-exact on interior frames. Done in our code (no ggml change) — covers CPU/Metal too and bounds the ~10 GB activation spike. Model::transcribe_* route long audio to the tiled path above a model-derived threshold (safe_mel_window) via one subsampling_tile_for helper; both batched and single-clip (CLI / transcribe_path) paths are wired. PARAKEET_SUBSAMPLING_TILE=<frames> forces it (testing).

(2) Grid-stride the ggml-cuda pad kernel (third_party/ggml-patches/0004-cuda-pad-grid-stride.patch) so it handles ne1/ne2·ne3 > 65535. A kernel audit confirmed pad is the only op in the long-audio encoder that routes a large dim through the capped gridDim.y/z (softmax/norm use gridDim.x; add/mul auto-fall-back to a flattened x; im2col already grid-strides; cpy/scale/concat are x-only/int64). The fix is perf-neutral (when a dim ≤ 65535 the stride loop runs exactly once → identical launch geometry) and general — it lifts the ceiling to ~23 h (next limit is an unrelated bin_bcast int32 index). This is what PyTorch already does, and it's upstreamable.

Validation

  • test_subsampling_tilingforward_tiled vs forward: single-tile bit-exact; multi-tile worst per-frame rel ~1.8e-5.
  • test_encoder_longforward_batch_tiled vs forward_batch: injection layout verified (large-tile worstrel 3.4e-3).
  • test_transcribe_tiled — full pipeline, fused vs forced-tiled, identical non-empty transcripts on batched and single-clip paths.
  • No GPU regression (A/B of the pad grid-stride, patched vs reverted libggml-cuda.so, GB10): 20-min banded clip +0.01%, 60-s clip +0.53% (both within run-to-run noise), proc_ms transcribe-only, output byte-identical. Re-verified end-to-end on the rebased PR head with a 52-min synthetic clip (exit 0, no CUDA error). See comment for the table.
  • GPU end-to-end on dgx.casa (GB10, CUDA 13, sm_121), real tdt-0.6b-v3, the 51-min file: PASS. Was a crash on master; now transcribes in ~16 s (≈192× realtime) to a complete 9,196-word transcript, exit 0, no CUDA error.

🤖 Generated with Claude Code

@mudler mudler force-pushed the worktree-long-audio-tiling branch from 0cc3194 to 236c688 Compare June 7, 2026 21:11
@localai-bot
Copy link
Copy Markdown
Collaborator Author

Post-rebase GPU re-verification + perf regression check

After rebasing onto current master, re-verified on dgx.casa (GB10, CUDA 13, sm_121) at the PR head:

1. End-to-end re-run on the rebased commit (the private 51-min repro file was deleted for privacy, so used a ~52-min synthetic clip = the speech.wav fixture looped, which is longer than the original and triggers both fixes):

  • parakeet-cli transcribe exit 0, no CUDA error, ~16 s, full 9,266-word transcript. The earlier 51-min run and this 52-min run on the rebased head both pass.

2. Performance regression A/B — the only change on the common (short/normal) execution path is the pad grid-stride. A/B'd it by swapping libggml-cuda.so (patched vs the kernel reverted) on the same parakeet-cli, 3 reps each, GB10, proc_ms = transcribe-only (the bench warms up first):

clip audio patched unpatched delta patched RTFx
20 min (banded local-attn — where pad runs hardest) 1197 s 5404.5 ms 5404.1 ms +0.01% 221x
60 s (full attention) 59.5 s 263.5 ms 262.1 ms +0.53% 226x

Both deltas are within run-to-run noise (rep spreads overlap), and text_identical=True for both clips — the grid-stride pad is perf-neutral and output-identical, as the kernel audit predicted (when a dim <= 65535 the stride loop runs exactly once). The subsampling-tiling change only engages above ~30 min, where the alternative was a crash, so there is no short-audio path to regress there.

Conclusion: no performance or accuracy regression; long audio that crashed on master now transcribes.

@mudler mudler merged commit 96b81bb into master Jun 7, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants