fix(kokoro): voice loading, Synth method selection, padding fixes by yocontra · Pull Request #943 · software-mansion/react-native-executorch

yocontra · 2026-03-06T21:53:30Z

Fixes for Kokoro TTS native code. Addresses voice data truncation, missing Synthesizer method selection, XNNPACK shape mismatches on repeated inference, progressive speed-up on longer inputs, and phoneme token reordering.

Voice loading reads only 128 of 510 rows

voice_ was a fixed std::array<..., kMaxInputTokens> (128 elements), but hexgrad/Kokoro-82M voice files contain 510 rows. The remaining 382 rows were silently dropped.

Changed voice_ to std::vector, sized dynamically from the file. Also fixed an OOB in voiceID — upstream used std::min(phonemes.size() - 1, noTokens) where noTokens could equal 128, indexing past the end of a 128-element array. Now uses a three-way std::min({phonemes.size() - 1, dpTokens - 1, voice_.size() - 1}).

Synthesizer doesn't do method selection

DurationPredictor discovers and selects from forward_8/forward_32/forward_64/forward_128 based on input size, but Synthesizer only knew about forward. Added the same discovery and selection logic. Falls back to "forward" if no forward_N methods exist, so older models still work.

`indices` tensor changes size between calls

DurationPredictor::generate() returns an indices vector whose size depends on predicted durations — different per input. XNNPACK caches the execution plan from the first call and errors on shape mismatches.

Fixed by padding indices to context_.inputDurationLimit before passing to the Synthesizer.

Audio progressively speeds up on longer inputs

The Synthesizer's attention mechanism drifts on longer input sequences (60+ tokens), causing later phonemes to be spoken progressively faster than the Duration Predictor intended. The DP's timing predictions are correct, but the Synthesizer compresses later phonemes into fewer samples.

Fixed by capping inputTokensLimit to 60, which forces the Partitioner to split text into shorter chunks that the Synthesizer can render faithfully. Each chunk is roughly one sentence (~15-20 words).

`tokenize()` scrambles phoneme order on invalid tokens

std::partition was used to filter out invalid (unrecognized) phoneme tokens, but partition does not preserve relative order. When any phonemes fall outside the vocabulary, the remaining valid tokens could be reordered, producing garbled audio.

Changed to std::stable_partition which preserves relative order.

Misc perf

Skip the durPadded heap alloc (up to 320KB) when DP and Synth use the same token count, which is the common case
Replace temporary pause zero-vectors with resize() directly on the output
Move-capture audio in the streaming callback instead of copying

Changes

Kokoro.h — voice_ from fixed array to vector
Kokoro.cpp — loadVoice(), synthesize(), generate(), stream(), constructor token limit cap
DurationPredictor.h — getMethodTokenCount()
Synthesizer.h — forwardMethods_ member, getMethodTokenCount()
Synthesizer.cpp — method discovery and selection
Utils.cpp — stable_partition in tokenize()

chmjkb · 2026-03-06T22:02:00Z

Thanks for the contribution, @yocontra! We'll be taking a closer look at this soon. In the meantime, if you're open to sharing your experience using this or any other APIs in the library, we'd love to hear your feedback!

yocontra · 2026-03-06T22:11:29Z

Right now I'm trying to get TTS working well - I was using an onnx based approach and got the quality pretty high (after writing my own phonemizer, because nobody else had theirs quite right). Switched to this library as I need multiple types of things now, and been fixing issues to get the quality to where that other library was.

I'm also sending some PRs to the phonemizer this library uses as there are some words it pronounces wrong and that should get it to perfect again :)

yocontra · 2026-03-06T22:24:32Z

I'm very eager for #936 as I'm trying to get a fully local replacement for gemini live/chatgpt realtime by doing Speech to Text -> Llama 3.2 -> Kokoro TTS

Voice loading: - Read all rows from voice file instead of truncating to kMaxInputTokens. Voice files (hexgrad/Kokoro-82M) have 510 rows; upstream discards 382. - Changed voice_ from fixed std::array to std::vector, sized from file. - voiceID: three-way min(phonemes-1, dpTokens-1, voice_.size()-1) to prevent OOB. Upstream had a latent OOB with voiceID=noTokens on a 128-element array. Synthesizer method selection: - Discover forward_N methods at construction, same pattern DurationPredictor already uses. Falls back to "forward" for older/single-method models. - Uses execute() instead of forward() for named method dispatch. Padding fixes: - Pad indices to inputDurationLimit before Synthesizer to prevent XNNPACK shape mismatch on repeated calls with varying duration predictions. - When DP and Synth use same token count (common case), pass DP tensor directly to Synth instead of copying (~320KB save). Perf: - Use resize() for silence padding instead of allocating temp vectors. - Move-capture audio in streaming callback instead of copying.

The Synthesizer's attention drifts on longer sequences (60+ tokens), causing later phonemes to be spoken progressively faster. Cap inputTokensLimit to 60 so the Partitioner splits text into shorter chunks that stay faithful to the Duration Predictor's timing. Also switch tokenize()'s std::partition to std::stable_partition so phoneme token order is preserved when invalid tokens are filtered out.

chmjkb requested a review from IgorSwat March 6, 2026 21:58

chmjkb added the bug fix PRs that are fixing bugs label Mar 6, 2026

yocontra mentioned this pull request Mar 6, 2026

fix: off-by-one in Kokoro::synthesize voice index causes crash #941

Closed

yocontra changed the title ~~fix(kokoro): load all voice file rows instead of truncating to kMaxInputTokens~~ fix(kokoro): load all voice rows + add Synthesizer method selection Mar 7, 2026

yocontra marked this pull request as draft March 7, 2026 04:46

yocontra changed the title ~~fix(kokoro): load all voice rows + add Synthesizer method selection~~ fix(kokoro): per-method padding, voice loading, Synthesizer method selection Mar 7, 2026

yocontra changed the title ~~fix(kokoro): per-method padding, voice loading, Synthesizer method selection~~ fix(kokoro): multi-sentence TTS errors, voice loading, Synth method selection Mar 7, 2026

yocontra marked this pull request as ready for review March 7, 2026 04:53

yocontra marked this pull request as draft March 7, 2026 15:49

yocontra force-pushed the fix/kokoro-voice-loading branch from 241cae9 to b5ed922 Compare March 7, 2026 16:19

yocontra changed the title ~~fix(kokoro): multi-sentence TTS errors, voice loading, Synth method selection~~ fix(kokoro): voice loading, Synth method selection, padding fixes Mar 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kokoro): voice loading, Synth method selection, padding fixes#943

fix(kokoro): voice loading, Synth method selection, padding fixes#943
yocontra wants to merge 2 commits intosoftware-mansion:mainfrom
yocontra:fix/kokoro-voice-loading

yocontra commented Mar 6, 2026 •

edited

Loading

Uh oh!

chmjkb commented Mar 6, 2026

Uh oh!

yocontra commented Mar 6, 2026

Uh oh!

yocontra commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yocontra commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Voice loading reads only 128 of 510 rows

Synthesizer doesn't do method selection

indices tensor changes size between calls

Audio progressively speeds up on longer inputs

tokenize() scrambles phoneme order on invalid tokens

Misc perf

Changes

Uh oh!

chmjkb commented Mar 6, 2026

Uh oh!

yocontra commented Mar 6, 2026

Uh oh!

yocontra commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yocontra commented Mar 6, 2026 •

edited

Loading

`indices` tensor changes size between calls

`tokenize()` scrambles phoneme order on invalid tokens