Skip to content

fix: speech to text live transcription#816

Open
IgorSwat wants to merge 18 commits intomainfrom
@is/speech-to-text
Open

fix: speech to text live transcription#816
IgorSwat wants to merge 18 commits intomainfrom
@is/speech-to-text

Conversation

@IgorSwat
Copy link
Contributor

@IgorSwat IgorSwat commented Feb 17, 2026

Description

Various improvements & adjustments in Speech-to-Text module. The list of changes includes:

  • Adjusting native implementation to the new format of Whisper models (single file, bundled encode & decode methods)
  • Refactoring native implementation in order to support multiple STT models in the future
  • Fixing an impropriate behavior of Whisper streaming

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

You can run the tests defined for Speech-to-Text module, as well as test it manually with the 'speech' demo app (SpeechToText screen).

Screenshots

Related issues

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@msluszniak msluszniak added the bug fix PRs that are fixing bugs label Feb 20, 2026
@msluszniak msluszniak linked an issue Feb 20, 2026 that may be closed by this pull request
@msluszniak msluszniak changed the title @is/speech to text fix: speech to text live transcription Feb 20, 2026
@IgorSwat IgorSwat force-pushed the @is/speech-to-text branch from 7b1e6ff to 2ee6d1d Compare March 2, 2026 09:21
Copy link
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments are not needed imo

Copy link
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall solid work, thanks 👏🏻
Left a couple of nits

this->decoder->unload();
: callInvoker_(std::move(callInvoker)) {
// Switch between the ASR implementations based on model name
if (modelName == "whisper") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

food for thought: as we discussed a few days back, think about how we can make it work so that the native side doesn't need the model name, but accepts a bunch of configurable pipeline steps. no need to do this now IMO, but just a note.

Maybe we can have different ASR implementations based on whether the model does support timestamps or not?

std::shared_ptr<OwningArrayBuffer>
SpeechToText::encode(std::span<float> waveform) const {
std::vector<float> encoderOutput = this->asr->encode(waveform);
std::vector<float> encoderOutput = transcriber_->encode(waveform);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking whether we need to return std::vector from the encoder? Maybe we would just return a span. We wrap this in OwningArrayBuffer, which copies the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the encode & decode methods to return ExecuTorch Tensor - so no redundant copies now I guess.

Copy link
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more things:

  1. I wasn't able to compile the app for Android (due to Norbert bumping minSdkVersion in RNET). You have to bump the minSdkVersion in the example app.
  2. Once compiled, it doesn't ask for mic permissions (im using a Pixel 10) and silently fails.

@IgorSwat IgorSwat force-pushed the @is/speech-to-text branch from ef854bc to d253381 Compare March 5, 2026 11:25
@IgorSwat IgorSwat force-pushed the @is/speech-to-text branch from a140cfb to bf69627 Compare March 6, 2026 12:33
@IgorSwat IgorSwat force-pushed the @is/speech-to-text branch from 816c75a to ae017ef Compare March 6, 2026 16:12
Comment on lines +44 to +49
if (
!modelSources ||
!tokenizerSources ||
!modelSources[0] ||
!tokenizerSources[0]
) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (
!modelSources ||
!tokenizerSources ||
!modelSources[0] ||
!tokenizerSources[0]
) {
if (
!modelSources?.[0] ||
!tokenizerSources?.[0]
) {

Copy link
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should change the TS side as you're returnign a different thing from C++, for example:

  public async encode(waveform: Float32Array): Promise<Float32Array> {
    return new Float32Array(await this.nativeModule.encode(waveform));
  }            

Also, why did you switch back to type in SpeechToTextModelConfig?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug fix PRs that are fixing bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix Speech to Text streaming mode

3 participants