Skip to content

Overdue Whisper fixes#1594

Open
xenova wants to merge 21 commits intomainfrom
whisper-fixes
Open

Overdue Whisper fixes#1594
xenova wants to merge 21 commits intomainfrom
whisper-fixes

Conversation

@xenova
Copy link
Collaborator

@xenova xenova commented Mar 18, 2026

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova
Copy link
Collaborator Author

xenova commented Mar 18, 2026

@jozefchutka would appreciate if you could check this out! 🙏 We'll put out a new release tomorrow, aiming to fit this in.

@jozefchutka
Copy link

jozefchutka commented Mar 18, 2026

I tested this branch with:

...
60: "39.7 -> 40.1  having"
61: "40.1 -> 40.26  nightmares"
62: "40.26 -> 40.32  that"
63: "40.32 -> 40.52  I'm"
64: "40.52 -> 40.82  being"
65: "40.82 -> 40.98  chased"
66: "40.98 -> 41.1  by"
67: "41.1 -> 41.44  these"
68: "41.44 -> 42.02  giant"
69: "42.02 -> 42.38  robotic"
70: "42.38 -> 42.4  claws."
71: "42.42 -> 42.44  Oh,"
72: "42.84 -> 43.12  whatever,"
73: "43.32 -> 43.4  Tom."
74: "44.44 -> 44.88  We're"
75: "44.88 -> 45.16  done."
76: "40.24 -> 40.34  that"
77: "40.34 -> 40.52  I'm"
78: "40.52 -> 40.8  being"
79: "40.8 -> 40.98  chased"
80: "40.98 -> 41.1  by"
81: "41.1 -> 41.44  this"
82: "41.44 -> 42  giant"
83: "42 -> 42.34  robotic"
...
...
65:"25.52 -> 25.68  I"
66:"25.68 -> 25.84  said,"
67:"25.9 -> 26.34  we're"
68:"26.34 -> 26.58  going"
69:"26.58 -> 26.8  through"
70:"26.82 -> 27.04  that."
END

Please consider adding these three cases into integration tests, I provided minimalistic repro code and .pcm file within each issue in the first post

@xenova
Copy link
Collaborator Author

xenova commented Mar 18, 2026

Thanks so much for testing! Strangely, I tested your exact inputs from those issues and got much better results... Let me revisit.

@xenova
Copy link
Collaborator Author

xenova commented Mar 18, 2026

the python version can also run into some weird issues, like

from transformers import pipeline
import numpy as np

pipe = pipeline("automatic-speech-recognition", "openai/whisper-base")

audio = "src.pcm"
audio = np.fromfile(audio, dtype=np.float32)

print(audio.shape)
result = pipe(audio, return_timestamps=True, num_beams=1, do_sample=False)
print(result)

producing

{'text': " everyday style. True classic delivers premium essentials built for real life. Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today. Now before we go, just want to give a big shout out to the CEO and founder Ryan brother for coming on our show and just showing some love. Now let's get back to the episode. I mean, like I said, we're going through that. We're losing stars. And then we kind of...", 'chunks': [{'timestamp': (0.0, 4.8), 'text': ' everyday style. True classic delivers premium essentials built for real life.'}, {'timestamp': (5.36, 14.0), 'text': ' Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today.'}, {'timestamp': (14.0, 18.56), 'text': ' Now before we go, just want to give a big shout out to the CEO and founder Ryan'}, {'timestamp': (18.56, 23.6), 'text': " brother for coming on our show and just showing some love. Now let's get back to the episode."}, {'timestamp': (24.24, 23.6), 'text': ''}, {'timestamp': (27.02, 29.18), 'text': " I mean, like I said, we're going through that. We're losing stars."}, {'timestamp': (29.18, 30.54), 'text': ' And then we kind of...'}]}

notice

{'timestamp': (24.24, 23.6), 'text': ''}

(empty chunk + backwards timestamp)


The current version I've got in my dev branch, after making a few fixes, looks like

{
  text: " everyday style. True classic delivers premium essentials built for real life. Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today. Now before we go, just want to give a big shout out to the CEO and founder Ryan brother for coming on our show and just showing some love. Now let's get back to the episode. I mean, like I said, we're going through that. We're losing stars. And then we kind of...",
  chunks: [
    { timestamp: [0, 4.8], text: " everyday style. True classic delivers premium essentials built for real life." },
    { timestamp: [5.36, 14], text: " Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today." },
    { timestamp: [14, 18.56], text: " Now before we go, just want to give a big shout out to the CEO and founder Ryan" },
    { timestamp: [18.56, 23.6], text: " brother for coming on our show and just showing some love. Now let's get back to the episode." },
    { timestamp: [23.6, 27.02], text: " I mean, like I said, we're going through that." },
    { timestamp: [27.02, 29.18], text: " We're losing stars." },
    { timestamp: [29.18, 30], text: " And then we kind of..." },
  ],
};

would you say that's a suitable transcription?

@xenova
Copy link
Collaborator Author

xenova commented Mar 18, 2026

here's the current version for #1590

[
  '2.54 -> 5.14  We',
  '5.14 -> 5.44  have',
  '5.44 -> 5.76  main',
  '5.76 -> 6.22  engine',
  '6.22 -> 6.54  start.',
  '7.2 -> 7.42  Four,',
  '7.78 -> 8.14  three,',
  '8.58 -> 9.12  two,',
  '9.54 -> 9.6  one.',
  "23.2 -> 23.54  You're",
  '23.54 -> 23.94  a',
  '23.94 -> 24.12  jerk,',
  '24.24 -> 24.32  Tom.',
  '24.66 -> 24.94  Look,',
  '25.06 -> 25.38  Celia,',
  '25.44 -> 25.64  we',
  '25.64 -> 25.78  have',
  '25.78 -> 26.08  to',
  '26.08 -> 26.28  follow',
  '26.28 -> 26.82  our',
  '26.82 -> 26.82  passions.',
  '27.44 -> 27.58  You',
  '27.58 -> 27.7  have',
  '27.7 -> 28.18  your',
  '28.18 -> 28.36  robotics,',
  '28.36 -> 28.44  and',
  '28.44 -> 28.58  I',
  '28.58 -> 28.74  just',
  '28.74 -> 28.82  want',
  '28.82 -> 28.94  to',
  '28.94 -> 29.44  be',
  '29.44 -> 29.62  awesome',
  '29.62 -> 29.94  in',
  '29.94 -> 30.02  space.',
  '30.82 -> 30.98  Why',
  "30.98 -> 31.18  don't",
  '31.18 -> 31.4  you',
  '31.4 -> 31.78  just',
  '31.78 -> 31.98  admit',
  '31.98 -> 32.14  that',
  "32.14 -> 32.92  you're",
  '32.92 -> 33.1  freaked',
  '33.1 -> 33.3  out',
  '33.3 -> 33.48  by',
  '33.48 -> 33.96  my',
  '33.96 -> 34.28  robot',
  '34.28 -> 34.32  hand?',
  "34.8 -> 35.02  I'm",
  '35.02 -> 35.24  not',
  '35.24 -> 35.44  freaked',
  '35.44 -> 35.64  out',
  '35.64 -> 35.8  by',
  '35.8 -> 35.86  it.',
  '37.72 -> 37.9  All',
  '37.9 -> 38.04  right,',
  '38.28 -> 38.36  fine.',
  "38.72 -> 39.06  I'm",
  '39.06 -> 39.36  freaked',
  '39.36 -> 39.4  out.',
  "39.5 -> 39.7  I'm",
  '39.7 -> 40.1  having',
  '40.1 -> 40.26  nightmares',
  '40.26 -> 40.32  that',
  "40.32 -> 40.52  I'm",
  '40.52 -> 40.82  being',
  '40.82 -> 40.98  chased',
  '40.98 -> 41.1  by',
  '41.1 -> 41.44  these',
  '41.44 -> 42.02  giant',
  '42.02 -> 42.38  robotic',
  '42.38 -> 42.4  claws.',
  '42.42 -> 42.44  Oh,',
  '42.84 -> 43.12  whatever,',
  '43.32 -> 43.4  Tom.',
  "44.44 -> 44.88  We're",
  '44.88 -> 45.16  done.',
  "50.34 -> 50.96  Robot's",
  '50.96 -> 51.44  memory',
  '51.44 -> 52.6  synced',
  '52.6 -> 53.54  and',
  '53.54 -> 53.8  locked.',
  '60.3 -> 60.34  Oh,',
  '60.44 -> 60.52  no.'
]

(the "Oh, no" is a hallucination which also appears in python version)

@xenova
Copy link
Collaborator Author

xenova commented Mar 18, 2026

and here are the outputs for src.pcm, for both phrase-level and word-level timestamps:

phrase-level:

{
  text: " everyday style. True classic delivers premium essentials built for real life. Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today. Now before we go, just want to give a big shout out to the CEO and founder Ryan brother for coming on our show and just showing some love. Now let's get back to the episode. I mean, like I said, we're going through that. We're losing stars. And then we kind of...",
  chunks: [
    {
      timestamp: [ 0, 4.8 ],
      text: ' everyday style. True classic delivers premium essentials built for real life.'
    },
    {
      timestamp: [ 5.36, 14 ],
      text: ' Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today.'
    },
    {
      timestamp: [ 14, 18.56 ],
      text: ' Now before we go, just want to give a big shout out to the CEO and founder Ryan'
    },
    {
      timestamp: [ 18.56, 23.6 ],
      text: " brother for coming on our show and just showing some love. Now let's get back to the episode."
    },
    {
      timestamp: [ 23.6, 27.02 ],
      text: " I mean, like I said, we're going through that."
    },
    { timestamp: [ 27.02, 29.18 ], text: " We're losing stars." },
    { timestamp: [ 29.18, 30 ], text: ' And then we kind of...' }
  ]
}

word-level:

{
  text: " everyday style. True classic delivers premium essentials built for real life. Grab yours at Target, Costco, or head to TrueClassic.com slash P4P. Get hooked up today. Now before we go, just want to give a big shout out to the CEO and founder Ryan brother for coming on our show and just showing some love. Now let's get back to the episode. I mean, like I said, we're going through that. We're losing stars. And then we kind of...",
  chunks: [
    { text: ' everyday', timestamp: [ 0.42, 0.82 ] },
    { text: ' style.', timestamp: [ 0.82, 1.3 ] },
    { text: ' True', timestamp: [ 1.56, 1.9 ] },
    { text: ' classic', timestamp: [ 1.9, 2.5 ] },
    { text: ' delivers', timestamp: [ 2.5, 3.02 ] },
    { text: ' premium', timestamp: [ 3.02, 3.54 ] },
    { text: ' essentials', timestamp: [ 3.54, 3.84 ] },
    { text: ' built', timestamp: [ 3.84, 4.08 ] },
    { text: ' for', timestamp: [ 4.08, 4.4 ] },
    { text: ' real', timestamp: [ 4.4, 4.8 ] },
    { text: ' life.', timestamp: [ 4.92, 5.46 ] },
    { text: ' Grab', timestamp: [ 5.64, 6.04 ] },
    { text: ' yours', timestamp: [ 6.04, 6.36 ] },
    { text: ' at', timestamp: [ 6.36, 6.84 ] },
    { text: ' Target,', timestamp: [ 6.84, 7.2 ] },
    { text: ' Costco,', timestamp: [ 7.7, 8.28 ] },
    { text: ' or', timestamp: [ 8.3, 8.5 ] },
    { text: ' head', timestamp: [ 8.5, 8.7 ] },
    { text: ' to', timestamp: [ 8.7, 8.92 ] },
    { text: ' TrueClassic', timestamp: [ 8.92, 9.68 ] },
    { text: '.com', timestamp: [ 9.68, 10.78 ] },
    { text: ' slash', timestamp: [ 10.78, 11.3 ] },
    { text: ' P4P.', timestamp: [ 11.3, 12.86 ] },
    { text: ' Get', timestamp: [ 13.06, 13.3 ] },
    { text: ' hooked', timestamp: [ 13.3, 13.52 ] },
    { text: ' up', timestamp: [ 13.52, 13.92 ] },
    { text: ' today.', timestamp: [ 13.92, 14 ] },
    { text: ' Now', timestamp: [ 14.24, 14.46 ] },
    { text: ' before', timestamp: [ 14.46, 14.62 ] },
    { text: ' we', timestamp: [ 14.62, 14.82 ] },
    { text: ' go,', timestamp: [ 14.82, 15.06 ] },
    { text: ' just', timestamp: [ 15.24, 15.42 ] },
    { text: ' want', timestamp: [ 15.42, 15.48 ] },
    { text: ' to', timestamp: [ 15.48, 15.56 ] },
    { text: ' give', timestamp: [ 15.56, 15.68 ] },
    { text: ' a', timestamp: [ 15.68, 15.86 ] },
    { text: ' big', timestamp: [ 15.86, 16.1 ] },
    { text: ' shout', timestamp: [ 16.1, 16.28 ] },
    { text: ' out', timestamp: [ 16.28, 16.74 ] },
    { text: ' to', timestamp: [ 16.74, 16.9 ] },
    { text: ' the', timestamp: [ 16.9, 17.52 ] },
    { text: ' CEO', timestamp: [ 17.52, 17.86 ] },
    { text: ' and', timestamp: [ 17.86, 18.3 ] },
    { text: ' founder', timestamp: [ 18.3, 18.56 ] },
    { text: ' Ryan', timestamp: [ 18.6, 18.72 ] },
    { text: ' brother', timestamp: [ 18.86, 19.06 ] },
    { text: ' for', timestamp: [ 19.06, 19.2 ] },
    { text: ' coming', timestamp: [ 19.2, 19.36 ] },
    { text: ' on', timestamp: [ 19.36, 19.5 ] },
    { text: ' our', timestamp: [ 19.5, 19.74 ] },
    { text: ' show', timestamp: [ 19.74, 20.4 ] },
    { text: ' and', timestamp: [ 20.4, 20.6 ] },
    { text: ' just', timestamp: [ 20.6, 20.86 ] },
    { text: ' showing', timestamp: [ 20.86, 21.08 ] },
    { text: ' some', timestamp: [ 21.08, 21.26 ] },
    { text: ' love.', timestamp: [ 21.26, 21.48 ] },
    { text: ' Now', timestamp: [ 21.6, 22.3 ] },
    { text: " let's", timestamp: [ 22.3, 22.46 ] },
    { text: ' get', timestamp: [ 22.46, 22.68 ] },
    { text: ' back', timestamp: [ 22.68, 23.06 ] },
    { text: ' to', timestamp: [ 23.06, 23.2 ] },
    { text: ' the', timestamp: [ 23.2, 23.56 ] },
    { text: ' episode.', timestamp: [ 23.56, 23.6 ] },
    { text: ' I', timestamp: [ 24.42, 24.6 ] },
    { text: ' mean,', timestamp: [ 24.6, 25.1 ] },
    { text: ' like', timestamp: [ 25.42, 25.54 ] },
    { text: ' I', timestamp: [ 25.54, 25.68 ] },
    { text: ' said,', timestamp: [ 25.68, 25.9 ] },
    { text: " we're", timestamp: [ 25.9, 26.34 ] },
    { text: ' going', timestamp: [ 26.34, 26.58 ] },
    { text: ' through', timestamp: [ 26.58, 26.84 ] },
    { text: ' that.', timestamp: [ 26.84, 27 ] },
    { text: " We're", timestamp: [ 27.24, 27.54 ] },
    { text: ' losing', timestamp: [ 27.54, 28.04 ] },
    { text: ' stars.', timestamp: [ 28.04, 28.96 ] },
    { text: ' And', timestamp: [ 29.3, 29.38 ] },
    { text: ' then', timestamp: [ 29.38, 29.58 ] },
    { text: ' we', timestamp: [ 29.58, 29.84 ] },
    { text: ' kind', timestamp: [ 29.84, 29.98 ] },
    { text: ' of', timestamp: [ 29.98, 29.98 ] },
    { text: '...', timestamp: [ 29.98, 29.98 ] }
  ]
}

@jozefchutka let me know if that looks good now! :) And if any other test cases of yours are still having issues.

@jozefchutka
Copy link

jozefchutka commented Mar 19, 2026

Thanks for putting so much effort into this @xenova . I re-tested the latest commit

Do you think 1590 can be somehow fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment