Overdue Whisper fixes by xenova · Pull Request #1594 · huggingface/transformers.js

xenova · 2026-03-18T02:45:06Z

Did a deep dive recently to uncover and fix issues with whisper models.

Closes Whisper model word-level timestamps broken #551
Closes Inaccurate Word Timestamps in ASR Transcription #805
Closes whisper-large-v3-turbo_timestamped has broken timestamps #1357 (tested this one specifically and output is much better than described in the issue).
Closes whisper-base_timestamped broken with chunk_length_s=30 #1358
Closes Regression in v4 automatic-speech-recognition #1590

Also add spectrogram unit tests

Closes Missing unit tests and mismatched output in audio_utils (e.g., spectrogram, window_function) vs Python tests #1387

HuggingFaceDocBuilderDev · 2026-03-18T02:47:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

xenova · 2026-03-18T03:20:22Z

@jozefchutka would appreciate if you could check this out! 🙏 We'll put out a new release tomorrow, aiming to fit this in.

jozefchutka · 2026-03-18T06:31:50Z

I tested this branch with:

Regression in v4 automatic-speech-recognition #1590 - no difference to 4.0.0-next.7 - not fixed
whisper-large-v3-turbo_timestamped has broken timestamps #1357 - output is different but still broken (notice repeated timestamps 40sec, 41, 42, 40, 41...)

...
60: "39.7 -> 40.1  having"
61: "40.1 -> 40.26  nightmares"
62: "40.26 -> 40.32  that"
63: "40.32 -> 40.52  I'm"
64: "40.52 -> 40.82  being"
65: "40.82 -> 40.98  chased"
66: "40.98 -> 41.1  by"
67: "41.1 -> 41.44  these"
68: "41.44 -> 42.02  giant"
69: "42.02 -> 42.38  robotic"
70: "42.38 -> 42.4  claws."
71: "42.42 -> 42.44  Oh,"
72: "42.84 -> 43.12  whatever,"
73: "43.32 -> 43.4  Tom."
74: "44.44 -> 44.88  We're"
75: "44.88 -> 45.16  done."
76: "40.24 -> 40.34  that"
77: "40.34 -> 40.52  I'm"
78: "40.52 -> 40.8  being"
79: "40.8 -> 40.98  chased"
80: "40.98 -> 41.1  by"
81: "41.1 -> 41.44  this"
82: "41.44 -> 42  giant"
83: "42 -> 42.34  robotic"
...

whisper-base_timestamped broken with chunk_length_s=30 #1358 - output is better compared to 3.8.1, but ends at 26th second, while the speaking continues to 30. Using chunk_length_s:29 provides better results.

...
65:"25.52 -> 25.68  I"
66:"25.68 -> 25.84  said,"
67:"25.9 -> 26.34  we're"
68:"26.34 -> 26.58  going"
69:"26.58 -> 26.8  through"
70:"26.82 -> 27.04  that."
END

Please consider adding these three cases into integration tests, I provided minimalistic repro code and .pcm file within each issue in the first post

xenova · 2026-03-18T15:41:02Z

Thanks so much for testing! Strangely, I tested your exact inputs from those issues and got much better results... Let me revisit.

xenova · 2026-03-18T18:04:46Z

the python version can also run into some weird issues, like

from transformers import pipeline
import numpy as np

pipe = pipeline("automatic-speech-recognition", "openai/whisper-base")

audio = "src.pcm"
audio = np.fromfile(audio, dtype=np.float32)

print(audio.shape)
result = pipe(audio, return_timestamps=True, num_beams=1, do_sample=False)
print(result)

producing

{'text': " everyday style. True classic delivers premium essentials built for real life. Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today. Now before we go, just want to give a big shout out to the CEO and founder Ryan brother for coming on our show and just showing some love. Now let's get back to the episode. I mean, like I said, we're going through that. We're losing stars. And then we kind of...", 'chunks': [{'timestamp': (0.0, 4.8), 'text': ' everyday style. True classic delivers premium essentials built for real life.'}, {'timestamp': (5.36, 14.0), 'text': ' Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today.'}, {'timestamp': (14.0, 18.56), 'text': ' Now before we go, just want to give a big shout out to the CEO and founder Ryan'}, {'timestamp': (18.56, 23.6), 'text': " brother for coming on our show and just showing some love. Now let's get back to the episode."}, {'timestamp': (24.24, 23.6), 'text': ''}, {'timestamp': (27.02, 29.18), 'text': " I mean, like I said, we're going through that. We're losing stars."}, {'timestamp': (29.18, 30.54), 'text': ' And then we kind of...'}]}

notice

{'timestamp': (24.24, 23.6), 'text': ''}

(empty chunk + backwards timestamp)

The current version I've got in my dev branch, after making a few fixes, looks like

{
  text: " everyday style. True classic delivers premium essentials built for real life. Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today. Now before we go, just want to give a big shout out to the CEO and founder Ryan brother for coming on our show and just showing some love. Now let's get back to the episode. I mean, like I said, we're going through that. We're losing stars. And then we kind of...",
  chunks: [
    { timestamp: [0, 4.8], text: " everyday style. True classic delivers premium essentials built for real life." },
    { timestamp: [5.36, 14], text: " Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today." },
    { timestamp: [14, 18.56], text: " Now before we go, just want to give a big shout out to the CEO and founder Ryan" },
    { timestamp: [18.56, 23.6], text: " brother for coming on our show and just showing some love. Now let's get back to the episode." },
    { timestamp: [23.6, 27.02], text: " I mean, like I said, we're going through that." },
    { timestamp: [27.02, 29.18], text: " We're losing stars." },
    { timestamp: [29.18, 30], text: " And then we kind of..." },
  ],
};

would you say that's a suitable transcription?

Fixes #1357

xenova · 2026-03-18T19:26:54Z

here's the current version for #1590

[
  '2.54 -> 5.14  We',
  '5.14 -> 5.44  have',
  '5.44 -> 5.76  main',
  '5.76 -> 6.22  engine',
  '6.22 -> 6.54  start.',
  '7.2 -> 7.42  Four,',
  '7.78 -> 8.14  three,',
  '8.58 -> 9.12  two,',
  '9.54 -> 9.6  one.',
  "23.2 -> 23.54  You're",
  '23.54 -> 23.94  a',
  '23.94 -> 24.12  jerk,',
  '24.24 -> 24.32  Tom.',
  '24.66 -> 24.94  Look,',
  '25.06 -> 25.38  Celia,',
  '25.44 -> 25.64  we',
  '25.64 -> 25.78  have',
  '25.78 -> 26.08  to',
  '26.08 -> 26.28  follow',
  '26.28 -> 26.82  our',
  '26.82 -> 26.82  passions.',
  '27.44 -> 27.58  You',
  '27.58 -> 27.7  have',
  '27.7 -> 28.18  your',
  '28.18 -> 28.36  robotics,',
  '28.36 -> 28.44  and',
  '28.44 -> 28.58  I',
  '28.58 -> 28.74  just',
  '28.74 -> 28.82  want',
  '28.82 -> 28.94  to',
  '28.94 -> 29.44  be',
  '29.44 -> 29.62  awesome',
  '29.62 -> 29.94  in',
  '29.94 -> 30.02  space.',
  '30.82 -> 30.98  Why',
  "30.98 -> 31.18  don't",
  '31.18 -> 31.4  you',
  '31.4 -> 31.78  just',
  '31.78 -> 31.98  admit',
  '31.98 -> 32.14  that',
  "32.14 -> 32.92  you're",
  '32.92 -> 33.1  freaked',
  '33.1 -> 33.3  out',
  '33.3 -> 33.48  by',
  '33.48 -> 33.96  my',
  '33.96 -> 34.28  robot',
  '34.28 -> 34.32  hand?',
  "34.8 -> 35.02  I'm",
  '35.02 -> 35.24  not',
  '35.24 -> 35.44  freaked',
  '35.44 -> 35.64  out',
  '35.64 -> 35.8  by',
  '35.8 -> 35.86  it.',
  '37.72 -> 37.9  All',
  '37.9 -> 38.04  right,',
  '38.28 -> 38.36  fine.',
  "38.72 -> 39.06  I'm",
  '39.06 -> 39.36  freaked',
  '39.36 -> 39.4  out.',
  "39.5 -> 39.7  I'm",
  '39.7 -> 40.1  having',
  '40.1 -> 40.26  nightmares',
  '40.26 -> 40.32  that',
  "40.32 -> 40.52  I'm",
  '40.52 -> 40.82  being',
  '40.82 -> 40.98  chased',
  '40.98 -> 41.1  by',
  '41.1 -> 41.44  these',
  '41.44 -> 42.02  giant',
  '42.02 -> 42.38  robotic',
  '42.38 -> 42.4  claws.',
  '42.42 -> 42.44  Oh,',
  '42.84 -> 43.12  whatever,',
  '43.32 -> 43.4  Tom.',
  "44.44 -> 44.88  We're",
  '44.88 -> 45.16  done.',
  "50.34 -> 50.96  Robot's",
  '50.96 -> 51.44  memory',
  '51.44 -> 52.6  synced',
  '52.6 -> 53.54  and',
  '53.54 -> 53.8  locked.',
  '60.3 -> 60.34  Oh,',
  '60.44 -> 60.52  no.'
]

(the "Oh, no" is a hallucination which also appears in python version)

xenova · 2026-03-18T19:56:08Z

and here are the outputs for src.pcm, for both phrase-level and word-level timestamps:

phrase-level:

{
  text: " everyday style. True classic delivers premium essentials built for real life. Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today. Now before we go, just want to give a big shout out to the CEO and founder Ryan brother for coming on our show and just showing some love. Now let's get back to the episode. I mean, like I said, we're going through that. We're losing stars. And then we kind of...",
  chunks: [
    {
      timestamp: [ 0, 4.8 ],
      text: ' everyday style. True classic delivers premium essentials built for real life.'
    },
    {
      timestamp: [ 5.36, 14 ],
      text: ' Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today.'
    },
    {
      timestamp: [ 14, 18.56 ],
      text: ' Now before we go, just want to give a big shout out to the CEO and founder Ryan'
    },
    {
      timestamp: [ 18.56, 23.6 ],
      text: " brother for coming on our show and just showing some love. Now let's get back to the episode."
    },
    {
      timestamp: [ 23.6, 27.02 ],
      text: " I mean, like I said, we're going through that."
    },
    { timestamp: [ 27.02, 29.18 ], text: " We're losing stars." },
    { timestamp: [ 29.18, 30 ], text: ' And then we kind of...' }
  ]
}

word-level:

{
  text: " everyday style. True classic delivers premium essentials built for real life. Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today. Now before we go, just want to give a big shout out to the CEO and founder Ryan brother for coming on our show and just showing some love. Now let's get back to the episode. I mean, like I said, we're going through that. We're losing stars. And then we kind of...",
  chunks: [
    { text: ' everyday', timestamp: [ 0.42, 0.82 ] },
    { text: ' style.', timestamp: [ 0.82, 1.3 ] },
    { text: ' True', timestamp: [ 1.56, 1.9 ] },
    { text: ' classic', timestamp: [ 1.9, 2.5 ] },
    { text: ' delivers', timestamp: [ 2.5, 3.02 ] },
    { text: ' premium', timestamp: [ 3.02, 3.54 ] },
    { text: ' essentials', timestamp: [ 3.54, 3.84 ] },
    { text: ' built', timestamp: [ 3.84, 4.08 ] },
    { text: ' for', timestamp: [ 4.08, 4.4 ] },
    { text: ' real', timestamp: [ 4.4, 4.8 ] },
    { text: ' life.', timestamp: [ 4.92, 5.46 ] },
    { text: ' Grab', timestamp: [ 5.64, 6.04 ] },
    { text: ' yours', timestamp: [ 6.04, 6.36 ] },
    { text: ' at', timestamp: [ 6.36, 6.84 ] },
    { text: ' Target,', timestamp: [ 6.84, 7.2 ] },
    { text: ' Costco,', timestamp: [ 7.7, 8.28 ] },
    { text: ' or', timestamp: [ 8.3, 8.5 ] },
    { text: ' head', timestamp: [ 8.5, 8.7 ] },
    { text: ' to', timestamp: [ 8.7, 8.92 ] },
    { text: ' TrueClassic', timestamp: [ 8.92, 9.68 ] },
    { text: '.com', timestamp: [ 9.68, 10.78 ] },
    { text: ' slash', timestamp: [ 10.78, 11.3 ] },
    { text: ' P4P.', timestamp: [ 11.3, 12.86 ] },
    { text: ' Get', timestamp: [ 13.06, 13.3 ] },
    { text: ' hooked', timestamp: [ 13.3, 13.52 ] },
    { text: ' up', timestamp: [ 13.52, 13.92 ] },
    { text: ' today.', timestamp: [ 13.92, 14 ] },
    { text: ' Now', timestamp: [ 14.24, 14.46 ] },
    { text: ' before', timestamp: [ 14.46, 14.62 ] },
    { text: ' we', timestamp: [ 14.62, 14.82 ] },
    { text: ' go,', timestamp: [ 14.82, 15.06 ] },
    { text: ' just', timestamp: [ 15.24, 15.42 ] },
    { text: ' want', timestamp: [ 15.42, 15.48 ] },
    { text: ' to', timestamp: [ 15.48, 15.56 ] },
    { text: ' give', timestamp: [ 15.56, 15.68 ] },
    { text: ' a', timestamp: [ 15.68, 15.86 ] },
    { text: ' big', timestamp: [ 15.86, 16.1 ] },
    { text: ' shout', timestamp: [ 16.1, 16.28 ] },
    { text: ' out', timestamp: [ 16.28, 16.74 ] },
    { text: ' to', timestamp: [ 16.74, 16.9 ] },
    { text: ' the', timestamp: [ 16.9, 17.52 ] },
    { text: ' CEO', timestamp: [ 17.52, 17.86 ] },
    { text: ' and', timestamp: [ 17.86, 18.3 ] },
    { text: ' founder', timestamp: [ 18.3, 18.56 ] },
    { text: ' Ryan', timestamp: [ 18.6, 18.72 ] },
    { text: ' brother', timestamp: [ 18.86, 19.06 ] },
    { text: ' for', timestamp: [ 19.06, 19.2 ] },
    { text: ' coming', timestamp: [ 19.2, 19.36 ] },
    { text: ' on', timestamp: [ 19.36, 19.5 ] },
    { text: ' our', timestamp: [ 19.5, 19.74 ] },
    { text: ' show', timestamp: [ 19.74, 20.4 ] },
    { text: ' and', timestamp: [ 20.4, 20.6 ] },
    { text: ' just', timestamp: [ 20.6, 20.86 ] },
    { text: ' showing', timestamp: [ 20.86, 21.08 ] },
    { text: ' some', timestamp: [ 21.08, 21.26 ] },
    { text: ' love.', timestamp: [ 21.26, 21.48 ] },
    { text: ' Now', timestamp: [ 21.6, 22.3 ] },
    { text: " let's", timestamp: [ 22.3, 22.46 ] },
    { text: ' get', timestamp: [ 22.46, 22.68 ] },
    { text: ' back', timestamp: [ 22.68, 23.06 ] },
    { text: ' to', timestamp: [ 23.06, 23.2 ] },
    { text: ' the', timestamp: [ 23.2, 23.56 ] },
    { text: ' episode.', timestamp: [ 23.56, 23.6 ] },
    { text: ' I', timestamp: [ 24.42, 24.6 ] },
    { text: ' mean,', timestamp: [ 24.6, 25.1 ] },
    { text: ' like', timestamp: [ 25.42, 25.54 ] },
    { text: ' I', timestamp: [ 25.54, 25.68 ] },
    { text: ' said,', timestamp: [ 25.68, 25.9 ] },
    { text: " we're", timestamp: [ 25.9, 26.34 ] },
    { text: ' going', timestamp: [ 26.34, 26.58 ] },
    { text: ' through', timestamp: [ 26.58, 26.84 ] },
    { text: ' that.', timestamp: [ 26.84, 27 ] },
    { text: " We're", timestamp: [ 27.24, 27.54 ] },
    { text: ' losing', timestamp: [ 27.54, 28.04 ] },
    { text: ' stars.', timestamp: [ 28.04, 28.96 ] },
    { text: ' And', timestamp: [ 29.3, 29.38 ] },
    { text: ' then', timestamp: [ 29.38, 29.58 ] },
    { text: ' we', timestamp: [ 29.58, 29.84 ] },
    { text: ' kind', timestamp: [ 29.84, 29.98 ] },
    { text: ' of', timestamp: [ 29.98, 29.98 ] },
    { text: '...', timestamp: [ 29.98, 29.98 ] }
  ]
}

@jozefchutka let me know if that looks good now! :) And if any other test cases of yours are still having issues.

jozefchutka · 2026-03-19T07:27:41Z

Thanks for putting so much effort into this @xenova . I re-tested the latest commit

whisper-large-v3-turbo_timestamped has broken timestamps #1357 - fixed
whisper-base_timestamped broken with chunk_length_s=30 #1358 - fixed
Regression in v4 automatic-speech-recognition #1590 - using the reported setup (and ch.pcm), I am seeing no difference, still getting just single ['0 -> 29.98 Chocolate Rain']

Do you think 1590 can be somehow fixed?

xenova added 6 commits March 17, 2026 21:19

Fix whisper word-level timestamps

a684670

add whisper tiny tests

af01d89

Cap word end timestamps to the chunk's end timestamp

8669e4d

Fix WhisperTimeStampLogitsProcessor

d9af246

Update whisper unit tests

90467ea

Formatting

6145a6f

xenova added 3 commits March 17, 2026 23:10

Add spectrogram unit tests

d713736

Fix spectrogram center padding

86dd62e

window_function should zero-pad the window when frame_length is provided

04a08a1

This was referenced Mar 18, 2026

Fix word-level timestamp overflow in Whisper chunked transcription #1483

Closed

Fix word-level timestamp overflow in Whisper chunked transcription #1486

Closed

xenova added 2 commits March 18, 2026 13:44

Add support for SuppressTokensLogitsProcessor

2d37278

Add whisper base unit test

5c34c73

xenova added 6 commits March 18, 2026 14:16

Update audio test cache

1f58985

Update test_pipelines_automatic_speech_recognition.js

c859021

Add timestamp merge tolerance

5b766d7

Fixes #1357

Implement _generate_with_seek for whisper

93c940d

Add another _decode_asr test

017d677

Fix backwards-jumping timestamps

27b7918

xenova added 3 commits March 18, 2026 15:44

Update test_pipelines_automatic_speech_recognition.js

b036ce3

final fix

ca96392

fix build

8c53aba

set max test execution time

44469fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overdue Whisper fixes#1594

Overdue Whisper fixes#1594
xenova wants to merge 21 commits intomainfrom
whisper-fixes

xenova commented Mar 18, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 18, 2026

Uh oh!

xenova commented Mar 18, 2026

Uh oh!

jozefchutka commented Mar 18, 2026 •

edited

Loading

Uh oh!

xenova commented Mar 18, 2026

Uh oh!

xenova commented Mar 18, 2026 •

edited

Loading

Uh oh!

xenova commented Mar 18, 2026 •

edited

Loading

Uh oh!

xenova commented Mar 18, 2026 •

edited

Loading

Uh oh!

jozefchutka commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xenova commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 18, 2026

Uh oh!

xenova commented Mar 18, 2026

Uh oh!

jozefchutka commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenova commented Mar 18, 2026

Uh oh!

xenova commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenova commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xenova commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jozefchutka commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xenova commented Mar 18, 2026 •

edited

Loading

jozefchutka commented Mar 18, 2026 •

edited

Loading

xenova commented Mar 18, 2026 •

edited

Loading

xenova commented Mar 18, 2026 •

edited

Loading

xenova commented Mar 18, 2026 •

edited

Loading

jozefchutka commented Mar 19, 2026 •

edited

Loading