Enhance speech tokenizer with 48kHz version#253
Enhance speech tokenizer with 48kHz version#253takuma104 wants to merge 5 commits intojamiepine:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a persisted application setting and UI toggle to enable a 48kHz speech tokenizer for the PyTorch TTS backend. Introduces GET/PATCH /settings endpoints, JSON-backed persistence, backend logic to apply the 48k tokenizer on model load, and frontend API + Advanced Settings UI to read/update the flag. Changes
Sequence DiagramsequenceDiagram
participant User
participant Frontend as Frontend (ModelManagement)
participant ApiClient as API Client
participant Backend as Backend (FastAPI)
participant Storage as Settings Storage
User->>Frontend: Toggle 48kHz checkbox
Frontend->>ApiClient: updateSettings({use_48k_speech_tokenizer: true})
ApiClient->>Backend: PATCH /settings
Backend->>Storage: save_app_settings(data)
Storage-->>Backend: persisted
Backend-->>ApiClient: AppSettings (updated)
ApiClient-->>Frontend: success + cache update
Frontend->>User: show toast (enabled)
Note over Backend,Storage: On next model generation
Backend->>Storage: load_app_settings()
Storage-->>Backend: {use_48k_speech_tokenizer: true}
Backend->>Backend: load/reload TTS model
Backend->>Backend: replace speech_tokenizer with 48kHz variant
Backend->>Backend: generate speech
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can get early access to new features in CodeRabbit.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/src/components/ServerSettings/ModelManagement.tsx`:
- Around line 238-262: The 48kHz tokenizer toggle (Checkbox id "use48kTokenizer"
and its bound value appSettings?.use_48k_speech_tokenizer with
settingsMutation.mutate) is rendered for all users but only works on the PyTorch
backend; update the component to check backend capability (e.g.,
appSettings.backend === 'pytorch' or a server capability flag like
supports48kTokenizer) before rendering the Checkbox and its label, and if
unsupported either hide the control or render it disabled with an explanatory
message so changes aren't persisted as a no-op; ensure the gating logic is
applied where the Checkbox and the surrounding markup (the div containing the
label and description) are currently returned.
- Around line 33-44: The settingsMutation currently only defines onSuccess and
silently fails on network/patch errors; add an onError handler to the
useMutation call (the same object where mutationFn and onSuccess live) that
calls toast with a clear error title and description (including the error
message if available) and optionally reverts any optimistic UI changes;
reference the existing settingsMutation/useMutation, mutationFn ->
apiClient.updateSettings, onSuccess -> queryClient.setQueryData, and the toast
utility when implementing the onError handler.
In `@backend/backends/pytorch_backend.py`:
- Around line 201-207: The toggle is only applied in _load_model_sync but
load_model_async may return early if the same model size is loaded, so changes
to settings.use_48k_speech_tokenizer won't take effect; modify load_model_async
to read AppSettings(**config.load_app_settings()) and compare the desired
tokenizer mode to the current state (e.g., track a flag like
self._using_48k_tokenizer or inspect self.model.model.speech_tokenizer), and if
they differ call the same tokenizer-replacement logic (the block that sets
self.model.model.speech_tokenizer = Qwen3TTSTokenizer.from_pretrained(...)) or
trigger a reload path so the tokenizer is swapped without requiring a full
unload/reload; ensure the check references load_model_async, _load_model_sync,
settings.use_48k_speech_tokenizer and self.model.model.speech_tokenizer.
In `@backend/config.py`:
- Around line 75-88: The load_app_settings function should not swallow
parse/read errors into an empty dict: catch json.JSONDecodeError and OSError
separately when reading (in load_app_settings), rename or move the corrupt file
(e.g., to path.with_suffix('.corrupt.<timestamp>')) and surface/raise the error
(or at least log it) instead of returning {} so callers can decide; only return
{} when the file truly doesn't exist. In save_app_settings ensure the parent
directory exists before writing by calling path.parent.mkdir(parents=True,
exist_ok=True) and propagate IO errors instead of letting them fail silently;
reference get_settings_path, load_app_settings, and save_app_settings when
making these changes.
In `@backend/main.py`:
- Around line 1773-1781: The handler update_settings currently constructs a
models.AppSettings from config.load_app_settings() which strips unknown keys and
causes silent resets; instead, load the raw settings dict (call it raw =
config.load_app_settings()), do not convert it to models.AppSettings first,
merge the incoming patch (patch = update.model_dump(exclude_none=True)) into raw
(raw.update(patch)), then validate the merged result by constructing
models.AppSettings(**merged) to ensure correctness, and finally call
config.save_app_settings(merged) to persist — also add a guard so that if
config.load_app_settings() returns an empty dict due to a read/parse failure you
do not overwrite the file (return an error/abort the PATCH) rather than saving
defaults.
In `@backend/README.md`:
- Around line 233-257: Update the README settings section to clearly state that
the "use_48k_speech_tokenizer" flag only affects the PyTorch backend: mention
that the GET/PATCH /settings endpoints persist and return this field but it is
consumed only by backend/backends/pytorch_backend.py (e.g., in the code paths
that select the speech tokenizer) and has no effect on other backends; reference
the setting name "use_48k_speech_tokenizer" and the PyTorch-specific module to
make the scope explicit for MLX users.
In `@CHANGELOG.md`:
- Line 61: The "48kHz speech tokenizer opt-in setting" bullet is currently
placed above the Makefile sub-list, causing the indented Lines 62-65 to render
as children of the tokenizer entry; either move the entire "48kHz speech
tokenizer opt-in setting" bullet to appear below the Makefile sub-bullets or
adjust the indentation of the Makefile block so its sub-items are properly
nested under the Makefile header (ensure the "Makefile" item and its sub-bullets
remain a distinct list and the "48kHz speech tokenizer opt-in setting" bullet is
a separate top-level entry).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: beb68b8e-c436-46a1-82cd-0844efe8d932
📒 Files selected for processing (10)
CHANGELOG.mdapp/src/components/ServerSettings/ModelManagement.tsxapp/src/lib/api/client.tsapp/src/lib/api/types.tsbackend/README.mdbackend/backends/pytorch_backend.pybackend/config.pybackend/main.pybackend/models.pydata/settings.json
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 128-143: Read app settings once and use that single snapshot for
the entire load flow: in load_model_async() capture
AppSettings(**config.load_app_settings()) (or at least the computed
requested_48k) and pass that snapshot/value into _load_model_sync (update
_load_model_sync signature to accept the settings or tokenizer flag) so
_load_model_sync does not re-read config; also apply the same change to the
other load call region (lines ~208-215) so both code paths use the same settings
snapshot when deciding and applying the 48k speech tokenizer.
- Around line 138-143: The reload path currently calls unload_model() and
_load_model_sync while other coroutines may be using self.model (e.g.,
create_voice_clone_prompt() and generate_voice_clone()), so add serialization:
introduce an instance-level asyncio.Lock or a read-write pattern (e.g., an
asyncio.Lock named model_swap_lock or an active_ops counter with an
asyncio.Event) and use it to prevent unloading during in-flight inference;
acquire the lock (or increment active_ops) at the start of
create_voice_clone_prompt() and generate_voice_clone() and release (or
decrement) when done, and in the reload branch (the code that calls
unload_model() and await asyncio.to_thread(self._load_model_sync, ...)) acquire
the swap lock or wait for active_ops==0 before calling unload_model() so model
teardown is deferred until no active operations are using self.model.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fee69089-f41a-4dff-96fe-2648f82b7b07
📒 Files selected for processing (7)
CHANGELOG.mdapp/src/components/ServerSettings/ModelManagement.tsxapp/src/lib/api/types.tsbackend/README.mdbackend/backends/pytorch_backend.pybackend/config.pybackend/main.py
🚧 Files skipped from review as they are similar to previous changes (6)
- app/src/components/ServerSettings/ModelManagement.tsx
- backend/config.py
- CHANGELOG.md
- backend/main.py
- backend/README.md
- app/src/lib/api/types.ts
backend/backends/pytorch_backend.py
Outdated
| # Unload existing model if reload is needed | ||
| if self.model is not None: | ||
| self.unload_model() | ||
|
|
||
| # Run blocking load in thread pool | ||
| await asyncio.to_thread(self._load_model_sync, model_size) |
There was a problem hiding this comment.
Guard reloads against in-flight inference.
This new reload path can call unload_model() while another request is still inside create_voice_clone_prompt() or generate_voice_clone() on the same self.model. Toggling the checkbox during concurrent requests can therefore tear down the live model underneath an active generation and produce nondeterministic failures. Please serialize model unload/reload against prompt creation and generation, or defer the swap until no active operations are using the model.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/backends/pytorch_backend.py` around lines 138 - 143, The reload path
currently calls unload_model() and _load_model_sync while other coroutines may
be using self.model (e.g., create_voice_clone_prompt() and
generate_voice_clone()), so add serialization: introduce an instance-level
asyncio.Lock or a read-write pattern (e.g., an asyncio.Lock named
model_swap_lock or an active_ops counter with an asyncio.Event) and use it to
prevent unloading during in-flight inference; acquire the lock (or increment
active_ops) at the start of create_voice_clone_prompt() and
generate_voice_clone() and release (or decrement) when done, and in the reload
branch (the code that calls unload_model() and await
asyncio.to_thread(self._load_model_sync, ...)) acquire the swap lock or wait for
active_ops==0 before calling unload_model() so model teardown is deferred until
no active operations are using self.model.
There was a problem hiding this comment.
The same issue exists in the original code and is unrelated to this PR.
_generate_sync directly references self.model (L376), which is the core of the problem.
Race condition scenario (pre-existing):
- Request A calls
generate()→load_model_async()sees model is loaded, passes through → startsawait asyncio.to_thread(_generate_sync) - The event loop switches to Request B (e.g., model size change) →
load_model_async()detects size mismatch →unload_model()setsself.model = None/del self.model→ begins loading the new model - Request A's
_generate_syncthread resumes and callsself.model.generate_voice_clone(...)→ crashes becauseself.modelis nowNone
Why this rarely occurs in practice:
- In typical usage, a model size change and an in-flight generation request almost never happen concurrently
- FastAPI route handlers may also serialize requests in practice
In short, this is a pre-existing latent design issue that predates this PR. The 48kHz tokenizer toggle makes it marginally more likely to occur (since toggling the setting while a generation is running is a new possible trigger), but addressing it is out of scope for this PR.
There was a problem hiding this comment.
The PR description image shows a dark-themed "Advanced Settings" UI panel with a gold/tan checkmark checkbox for "Use 48kHz speech tokenizer (experimental)", accompanied by the description "Enables a higher quality 48kHz speech tokenizer. Changes take effect from your next generation."
@takuma104, you're right — the race condition is a pre-existing latent issue rooted in _generate_sync holding a direct reference to self.model without any lock, and this PR doesn't change that fundamental design. The 48kHz toggle is a marginal incremental exposure at most. I'll withdraw the critical flag for this PR.
That said, the issue is worth tracking so it doesn't get forgotten. Would you like me to open a follow-up GitHub issue to capture the model_swap_lock / active-ops guard work for a future PR?
✏️ Learnings added
Learnt from: takuma104
Repo: jamiepine/voicebox PR: 253
File: backend/backends/pytorch_backend.py:0-0
Timestamp: 2026-03-12T18:05:24.715Z
Learning: In `backend/backends/pytorch_backend.py` (PyTorchTTSBackend), the model reload/unload race condition (unloading self.model while _generate_sync or _create_prompt_sync threads hold a direct reference to it) is a pre-existing latent design issue that predates the 48kHz tokenizer PR. It is not introduced by the 48kHz toggle changes. Fixing it requires an asyncio.Lock or active-ops counter to serialize model swap against in-flight inference, and is tracked as a future follow-up.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/backends/pytorch_backend.py (1)
138-143:⚠️ Potential issue | 🔴 CriticalSerialize tokenizer-triggered reloads with active inference.
load_model_async()still callsunload_model()immediately here.create_voice_prompt()andgenerate()both useself.modelfrom worker threads afterward, so a concurrent toggle can tear down the live model under an in-flightcreate_voice_clone_prompt()/generate_voice_clone()call. Please gate unload/reload behind a swap lock or an active-operations counter so the model is only replaced once no request is using it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/backends/pytorch_backend.py` around lines 138 - 143, The reload path in load_model_async() unconditionally calls unload_model() and races with in-flight inference calls (create_voice_prompt(), generate(), create_voice_clone_prompt(), generate_voice_clone()) that read self.model; protect model swaps by introducing a synchronization primitive (e.g., an asyncio.Lock named something like self._model_swap_lock or an active-operations counter guarded by a lock) so unload_model()/_load_model_sync are only executed after waiting for active operations to drain or by acquiring the swap lock; update generate() and other inference entry points to increment/decrement the active counter or acquire/release the swap lock around usage of self.model so reloads are serialized and no thread tears down the model while in use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 208-214: The prompt cache is keyed only by (audio_path,
reference_text) in create_voice_prompt(), which leads to reuse of prompts across
speech_tokenizer variants; update the cache key to include the tokenizer mode
(e.g., self._use_48k_speech_tokenizer or a tokenizer identifier such as
model.model.speech_tokenizer.__class__.__name__ / tokenizer ID) or
clear/invalidate relevant entries when self._use_48k_speech_tokenizer is changed
(where you set it in the model load block), so create_voice_prompt() returns
prompts generated with the currently active tokenizer.
---
Duplicate comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 138-143: The reload path in load_model_async() unconditionally
calls unload_model() and races with in-flight inference calls
(create_voice_prompt(), generate(), create_voice_clone_prompt(),
generate_voice_clone()) that read self.model; protect model swaps by introducing
a synchronization primitive (e.g., an asyncio.Lock named something like
self._model_swap_lock or an active-operations counter guarded by a lock) so
unload_model()/_load_model_sync are only executed after waiting for active
operations to drain or by acquiring the swap lock; update generate() and other
inference entry points to increment/decrement the active counter or
acquire/release the swap lock around usage of self.model so reloads are
serialized and no thread tears down the model while in use.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2685269a-a06c-4f8e-9b5f-0f5b0f62273d
📒 Files selected for processing (1)
backend/backends/pytorch_backend.py
| # optionally replace speech tokenizer with 48kHz version for better audio quality | ||
| if use_48k_speech_tokenizer: | ||
| from qwen_tts import Qwen3TTSTokenizer | ||
| self.model.model.speech_tokenizer = Qwen3TTSTokenizer.from_pretrained( | ||
| "takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz" | ||
| ) | ||
| self._use_48k_speech_tokenizer = use_48k_speech_tokenizer |
There was a problem hiding this comment.
Namespace cached voice prompts by tokenizer mode.
This adds a second speech-tokenizer variant, but create_voice_prompt() still caches prompts by only (audio_path, reference_text) at Line 280. After switching this flag, the model reloads and can then immediately reuse a prompt generated under the old tokenizer, so the “next generation” does not actually reflect the new mode and the cached tensors may not match the new tokenizer. Include use_48k_speech_tokenizer (or a tokenizer ID) in the cache key, or invalidate prompt cache entries when the mode changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/backends/pytorch_backend.py` around lines 208 - 214, The prompt cache
is keyed only by (audio_path, reference_text) in create_voice_prompt(), which
leads to reuse of prompts across speech_tokenizer variants; update the cache key
to include the tokenizer mode (e.g., self._use_48k_speech_tokenizer or a
tokenizer identifier such as model.model.speech_tokenizer.__class__.__name__ /
tokenizer ID) or clear/invalidate relevant entries when
self._use_48k_speech_tokenizer is changed (where you set it in the model load
block), so create_voice_prompt() returns prompts generated with the currently
active tokenizer.
add hindi language |
add hindi language |

I created a fine-tuned version of the Qwen3-TTS Speech Tokenizer. This model modifies the decoder block configuration to output 48kHz sampling instead of the original 24kHz, which should improve audio quality. For details about the model, please see the model card.
The integration into the backend was only a few lines of code, but I felt users should be able to opt in rather than having it applied automatically, so I also built a UI for it. A checkbox has been added to the Models tab as shown in the attached screenshot. I also added a /settings API endpoint to persist this setting.

I imagine you may have your own preferences around UI/UX direction, so please feel free to let me know if you'd like any changes.
Summary by CodeRabbit
New Features
Documentation
UI