Skip to content

Enhance speech tokenizer with 48kHz version#253

Open
takuma104 wants to merge 5 commits intojamiepine:mainfrom
takuma104:speech_tokenizer_48k_ui
Open

Enhance speech tokenizer with 48kHz version#253
takuma104 wants to merge 5 commits intojamiepine:mainfrom
takuma104:speech_tokenizer_48k_ui

Conversation

@takuma104
Copy link

@takuma104 takuma104 commented Mar 12, 2026

I created a fine-tuned version of the Qwen3-TTS Speech Tokenizer. This model modifies the decoder block configuration to output 48kHz sampling instead of the original 24kHz, which should improve audio quality. For details about the model, please see the model card.

The integration into the backend was only a few lines of code, but I felt users should be able to opt in rather than having it applied automatically, so I also built a UI for it. A checkbox has been added to the Models tab as shown in the attached screenshot. I also added a /settings API endpoint to persist this setting.
スクリーンショット 2026-03-13 0 20 45

I imagine you may have your own preferences around UI/UX direction, so please feel free to let me know if you'd like any changes.

Summary by CodeRabbit

  • New Features

    • Added an Advanced Settings toggle (Models tab) to opt in to a higher-quality 48kHz speech tokenizer for PyTorch backends (disabled by default; takes effect on next generation). Settings persist across restarts and server settings endpoints.
  • Documentation

    • Updated backend docs with details on the new settings, the GET/PATCH settings endpoints, and persistence behavior.
  • UI

    • Added immediate UI feedback when toggling the setting (success/error notifications).

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

Adds a persisted application setting and UI toggle to enable a 48kHz speech tokenizer for the PyTorch TTS backend. Introduces GET/PATCH /settings endpoints, JSON-backed persistence, backend logic to apply the 48k tokenizer on model load, and frontend API + Advanced Settings UI to read/update the flag.

Changes

Cohort / File(s) Summary
Changelog & Docs
CHANGELOG.md, backend/README.md
Documented new setting and added GET/PATCH /settings API examples and persistence notes.
Backend API & Models
backend/main.py, backend/models.py
Added GET /settings and PATCH /settings endpoints; new Pydantic models AppSettings and AppSettingsUpdate.
Backend Persistence
backend/config.py, data/settings.json
Added get_settings_path, load_app_settings, save_app_settings to persist settings to data/settings.json; added default settings file.
PyTorch TTS Backend
backend/backends/pytorch_backend.py
Read app settings and optionally replace model.speech_tokenizer with the 48kHz tokenizer; track _use_48k_speech_tokenizer and reload/unload logic when flag or model size changes.
Frontend API Types & Client
app/src/lib/api/types.ts, app/src/lib/api/client.ts
Added AppSettings type and ApiClient.getSettings / ApiClient.updateSettings methods for /settings endpoints.
Frontend UI
app/src/components/ServerSettings/ModelManagement.tsx
Added Advanced Settings UI gated by PyTorch backend: checkbox for use_48k_speech_tokenizer, mutation handling, loading state, and toasts on success/error.

Sequence Diagram

sequenceDiagram
    participant User
    participant Frontend as Frontend (ModelManagement)
    participant ApiClient as API Client
    participant Backend as Backend (FastAPI)
    participant Storage as Settings Storage

    User->>Frontend: Toggle 48kHz checkbox
    Frontend->>ApiClient: updateSettings({use_48k_speech_tokenizer: true})
    ApiClient->>Backend: PATCH /settings
    Backend->>Storage: save_app_settings(data)
    Storage-->>Backend: persisted
    Backend-->>ApiClient: AppSettings (updated)
    ApiClient-->>Frontend: success + cache update
    Frontend->>User: show toast (enabled)

    Note over Backend,Storage: On next model generation
    Backend->>Storage: load_app_settings()
    Storage-->>Backend: {use_48k_speech_tokenizer: true}
    Backend->>Backend: load/reload TTS model
    Backend->>Backend: replace speech_tokenizer with 48kHz variant
    Backend->>Backend: generate speech
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Poem

🐰 I hopped to the settings, a bright little thrill,

Forty‑eight kHz now waits at my will,
Toggle in Advanced, save with a tap,
Tokens crisp as a carrot, no noise on the map,
Hooray — clearer hops in every vocal trill! 🎶

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly summarizes the main change: adding a 48kHz speech tokenizer enhancement, which is the primary focus across backend settings APIs, frontend UI integration, and model loading logic.
Docstring Coverage ✅ Passed Docstring coverage is 84.62% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/src/components/ServerSettings/ModelManagement.tsx`:
- Around line 238-262: The 48kHz tokenizer toggle (Checkbox id "use48kTokenizer"
and its bound value appSettings?.use_48k_speech_tokenizer with
settingsMutation.mutate) is rendered for all users but only works on the PyTorch
backend; update the component to check backend capability (e.g.,
appSettings.backend === 'pytorch' or a server capability flag like
supports48kTokenizer) before rendering the Checkbox and its label, and if
unsupported either hide the control or render it disabled with an explanatory
message so changes aren't persisted as a no-op; ensure the gating logic is
applied where the Checkbox and the surrounding markup (the div containing the
label and description) are currently returned.
- Around line 33-44: The settingsMutation currently only defines onSuccess and
silently fails on network/patch errors; add an onError handler to the
useMutation call (the same object where mutationFn and onSuccess live) that
calls toast with a clear error title and description (including the error
message if available) and optionally reverts any optimistic UI changes;
reference the existing settingsMutation/useMutation, mutationFn ->
apiClient.updateSettings, onSuccess -> queryClient.setQueryData, and the toast
utility when implementing the onError handler.

In `@backend/backends/pytorch_backend.py`:
- Around line 201-207: The toggle is only applied in _load_model_sync but
load_model_async may return early if the same model size is loaded, so changes
to settings.use_48k_speech_tokenizer won't take effect; modify load_model_async
to read AppSettings(**config.load_app_settings()) and compare the desired
tokenizer mode to the current state (e.g., track a flag like
self._using_48k_tokenizer or inspect self.model.model.speech_tokenizer), and if
they differ call the same tokenizer-replacement logic (the block that sets
self.model.model.speech_tokenizer = Qwen3TTSTokenizer.from_pretrained(...)) or
trigger a reload path so the tokenizer is swapped without requiring a full
unload/reload; ensure the check references load_model_async, _load_model_sync,
settings.use_48k_speech_tokenizer and self.model.model.speech_tokenizer.

In `@backend/config.py`:
- Around line 75-88: The load_app_settings function should not swallow
parse/read errors into an empty dict: catch json.JSONDecodeError and OSError
separately when reading (in load_app_settings), rename or move the corrupt file
(e.g., to path.with_suffix('.corrupt.<timestamp>')) and surface/raise the error
(or at least log it) instead of returning {} so callers can decide; only return
{} when the file truly doesn't exist. In save_app_settings ensure the parent
directory exists before writing by calling path.parent.mkdir(parents=True,
exist_ok=True) and propagate IO errors instead of letting them fail silently;
reference get_settings_path, load_app_settings, and save_app_settings when
making these changes.

In `@backend/main.py`:
- Around line 1773-1781: The handler update_settings currently constructs a
models.AppSettings from config.load_app_settings() which strips unknown keys and
causes silent resets; instead, load the raw settings dict (call it raw =
config.load_app_settings()), do not convert it to models.AppSettings first,
merge the incoming patch (patch = update.model_dump(exclude_none=True)) into raw
(raw.update(patch)), then validate the merged result by constructing
models.AppSettings(**merged) to ensure correctness, and finally call
config.save_app_settings(merged) to persist — also add a guard so that if
config.load_app_settings() returns an empty dict due to a read/parse failure you
do not overwrite the file (return an error/abort the PATCH) rather than saving
defaults.

In `@backend/README.md`:
- Around line 233-257: Update the README settings section to clearly state that
the "use_48k_speech_tokenizer" flag only affects the PyTorch backend: mention
that the GET/PATCH /settings endpoints persist and return this field but it is
consumed only by backend/backends/pytorch_backend.py (e.g., in the code paths
that select the speech tokenizer) and has no effect on other backends; reference
the setting name "use_48k_speech_tokenizer" and the PyTorch-specific module to
make the scope explicit for MLX users.

In `@CHANGELOG.md`:
- Line 61: The "48kHz speech tokenizer opt-in setting" bullet is currently
placed above the Makefile sub-list, causing the indented Lines 62-65 to render
as children of the tokenizer entry; either move the entire "48kHz speech
tokenizer opt-in setting" bullet to appear below the Makefile sub-bullets or
adjust the indentation of the Makefile block so its sub-items are properly
nested under the Makefile header (ensure the "Makefile" item and its sub-bullets
remain a distinct list and the "48kHz speech tokenizer opt-in setting" bullet is
a separate top-level entry).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: beb68b8e-c436-46a1-82cd-0844efe8d932

📥 Commits

Reviewing files that changed from the base of the PR and between 38bf96f and 5e917d9.

📒 Files selected for processing (10)
  • CHANGELOG.md
  • app/src/components/ServerSettings/ModelManagement.tsx
  • app/src/lib/api/client.ts
  • app/src/lib/api/types.ts
  • backend/README.md
  • backend/backends/pytorch_backend.py
  • backend/config.py
  • backend/main.py
  • backend/models.py
  • data/settings.json

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 128-143: Read app settings once and use that single snapshot for
the entire load flow: in load_model_async() capture
AppSettings(**config.load_app_settings()) (or at least the computed
requested_48k) and pass that snapshot/value into _load_model_sync (update
_load_model_sync signature to accept the settings or tokenizer flag) so
_load_model_sync does not re-read config; also apply the same change to the
other load call region (lines ~208-215) so both code paths use the same settings
snapshot when deciding and applying the 48k speech tokenizer.
- Around line 138-143: The reload path currently calls unload_model() and
_load_model_sync while other coroutines may be using self.model (e.g.,
create_voice_clone_prompt() and generate_voice_clone()), so add serialization:
introduce an instance-level asyncio.Lock or a read-write pattern (e.g., an
asyncio.Lock named model_swap_lock or an active_ops counter with an
asyncio.Event) and use it to prevent unloading during in-flight inference;
acquire the lock (or increment active_ops) at the start of
create_voice_clone_prompt() and generate_voice_clone() and release (or
decrement) when done, and in the reload branch (the code that calls
unload_model() and await asyncio.to_thread(self._load_model_sync, ...)) acquire
the swap lock or wait for active_ops==0 before calling unload_model() so model
teardown is deferred until no active operations are using self.model.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fee69089-f41a-4dff-96fe-2648f82b7b07

📥 Commits

Reviewing files that changed from the base of the PR and between 5e917d9 and 721c126.

📒 Files selected for processing (7)
  • CHANGELOG.md
  • app/src/components/ServerSettings/ModelManagement.tsx
  • app/src/lib/api/types.ts
  • backend/README.md
  • backend/backends/pytorch_backend.py
  • backend/config.py
  • backend/main.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • app/src/components/ServerSettings/ModelManagement.tsx
  • backend/config.py
  • CHANGELOG.md
  • backend/main.py
  • backend/README.md
  • app/src/lib/api/types.ts

Comment on lines 138 to 143
# Unload existing model if reload is needed
if self.model is not None:
self.unload_model()

# Run blocking load in thread pool
await asyncio.to_thread(self._load_model_sync, model_size)
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard reloads against in-flight inference.

This new reload path can call unload_model() while another request is still inside create_voice_clone_prompt() or generate_voice_clone() on the same self.model. Toggling the checkbox during concurrent requests can therefore tear down the live model underneath an active generation and produce nondeterministic failures. Please serialize model unload/reload against prompt creation and generation, or defer the swap until no active operations are using the model.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/backends/pytorch_backend.py` around lines 138 - 143, The reload path
currently calls unload_model() and _load_model_sync while other coroutines may
be using self.model (e.g., create_voice_clone_prompt() and
generate_voice_clone()), so add serialization: introduce an instance-level
asyncio.Lock or a read-write pattern (e.g., an asyncio.Lock named
model_swap_lock or an active_ops counter with an asyncio.Event) and use it to
prevent unloading during in-flight inference; acquire the lock (or increment
active_ops) at the start of create_voice_clone_prompt() and
generate_voice_clone() and release (or decrement) when done, and in the reload
branch (the code that calls unload_model() and await
asyncio.to_thread(self._load_model_sync, ...)) acquire the swap lock or wait for
active_ops==0 before calling unload_model() so model teardown is deferred until
no active operations are using self.model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue exists in the original code and is unrelated to this PR.

_generate_sync directly references self.model (L376), which is the core of the problem.

Race condition scenario (pre-existing):

  1. Request A calls generate()load_model_async() sees model is loaded, passes through → starts await asyncio.to_thread(_generate_sync)
  2. The event loop switches to Request B (e.g., model size change) → load_model_async() detects size mismatch → unload_model() sets self.model = None / del self.model → begins loading the new model
  3. Request A's _generate_sync thread resumes and calls self.model.generate_voice_clone(...) → crashes because self.model is now None

Why this rarely occurs in practice:

  • In typical usage, a model size change and an in-flight generation request almost never happen concurrently
  • FastAPI route handlers may also serialize requests in practice

In short, this is a pre-existing latent design issue that predates this PR. The 48kHz tokenizer toggle makes it marginally more likely to occur (since toggling the setting while a generation is running is a new possible trigger), but addressing it is out of scope for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description image shows a dark-themed "Advanced Settings" UI panel with a gold/tan checkmark checkbox for "Use 48kHz speech tokenizer (experimental)", accompanied by the description "Enables a higher quality 48kHz speech tokenizer. Changes take effect from your next generation."


@takuma104, you're right — the race condition is a pre-existing latent issue rooted in _generate_sync holding a direct reference to self.model without any lock, and this PR doesn't change that fundamental design. The 48kHz toggle is a marginal incremental exposure at most. I'll withdraw the critical flag for this PR.

That said, the issue is worth tracking so it doesn't get forgotten. Would you like me to open a follow-up GitHub issue to capture the model_swap_lock / active-ops guard work for a future PR?


✏️ Learnings added
Learnt from: takuma104
Repo: jamiepine/voicebox PR: 253
File: backend/backends/pytorch_backend.py:0-0
Timestamp: 2026-03-12T18:05:24.715Z
Learning: In `backend/backends/pytorch_backend.py` (PyTorchTTSBackend), the model reload/unload race condition (unloading self.model while _generate_sync or _create_prompt_sync threads hold a direct reference to it) is a pre-existing latent design issue that predates the 48kHz tokenizer PR. It is not introduced by the 48kHz toggle changes. Fixing it requires an asyncio.Lock or active-ops counter to serialize model swap against in-flight inference, and is tracked as a future follow-up.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
backend/backends/pytorch_backend.py (1)

138-143: ⚠️ Potential issue | 🔴 Critical

Serialize tokenizer-triggered reloads with active inference.

load_model_async() still calls unload_model() immediately here. create_voice_prompt() and generate() both use self.model from worker threads afterward, so a concurrent toggle can tear down the live model under an in-flight create_voice_clone_prompt() / generate_voice_clone() call. Please gate unload/reload behind a swap lock or an active-operations counter so the model is only replaced once no request is using it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/backends/pytorch_backend.py` around lines 138 - 143, The reload path
in load_model_async() unconditionally calls unload_model() and races with
in-flight inference calls (create_voice_prompt(), generate(),
create_voice_clone_prompt(), generate_voice_clone()) that read self.model;
protect model swaps by introducing a synchronization primitive (e.g., an
asyncio.Lock named something like self._model_swap_lock or an active-operations
counter guarded by a lock) so unload_model()/_load_model_sync are only executed
after waiting for active operations to drain or by acquiring the swap lock;
update generate() and other inference entry points to increment/decrement the
active counter or acquire/release the swap lock around usage of self.model so
reloads are serialized and no thread tears down the model while in use.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 208-214: The prompt cache is keyed only by (audio_path,
reference_text) in create_voice_prompt(), which leads to reuse of prompts across
speech_tokenizer variants; update the cache key to include the tokenizer mode
(e.g., self._use_48k_speech_tokenizer or a tokenizer identifier such as
model.model.speech_tokenizer.__class__.__name__ / tokenizer ID) or
clear/invalidate relevant entries when self._use_48k_speech_tokenizer is changed
(where you set it in the model load block), so create_voice_prompt() returns
prompts generated with the currently active tokenizer.

---

Duplicate comments:
In `@backend/backends/pytorch_backend.py`:
- Around line 138-143: The reload path in load_model_async() unconditionally
calls unload_model() and races with in-flight inference calls
(create_voice_prompt(), generate(), create_voice_clone_prompt(),
generate_voice_clone()) that read self.model; protect model swaps by introducing
a synchronization primitive (e.g., an asyncio.Lock named something like
self._model_swap_lock or an active-operations counter guarded by a lock) so
unload_model()/_load_model_sync are only executed after waiting for active
operations to drain or by acquiring the swap lock; update generate() and other
inference entry points to increment/decrement the active counter or
acquire/release the swap lock around usage of self.model so reloads are
serialized and no thread tears down the model while in use.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2685269a-a06c-4f8e-9b5f-0f5b0f62273d

📥 Commits

Reviewing files that changed from the base of the PR and between 721c126 and b515263.

📒 Files selected for processing (1)
  • backend/backends/pytorch_backend.py

Comment on lines +208 to +214
# optionally replace speech tokenizer with 48kHz version for better audio quality
if use_48k_speech_tokenizer:
from qwen_tts import Qwen3TTSTokenizer
self.model.model.speech_tokenizer = Qwen3TTSTokenizer.from_pretrained(
"takuma104/Qwen3-TTS-Tokenizer-12Hz-48kHz"
)
self._use_48k_speech_tokenizer = use_48k_speech_tokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Namespace cached voice prompts by tokenizer mode.

This adds a second speech-tokenizer variant, but create_voice_prompt() still caches prompts by only (audio_path, reference_text) at Line 280. After switching this flag, the model reloads and can then immediately reuse a prompt generated under the old tokenizer, so the “next generation” does not actually reflect the new mode and the cached tensors may not match the new tokenizer. Include use_48k_speech_tokenizer (or a tokenizer ID) in the cache key, or invalidate prompt cache entries when the mode changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/backends/pytorch_backend.py` around lines 208 - 214, The prompt cache
is keyed only by (audio_path, reference_text) in create_voice_prompt(), which
leads to reuse of prompts across speech_tokenizer variants; update the cache key
to include the tokenizer mode (e.g., self._use_48k_speech_tokenizer or a
tokenizer identifier such as model.model.speech_tokenizer.__class__.__name__ /
tokenizer ID) or clear/invalidate relevant entries when
self._use_48k_speech_tokenizer is changed (where you set it in the model load
block), so create_voice_prompt() returns prompts generated with the currently
active tokenizer.

@Vishaldadlani321
Copy link

I created a fine-tuned version of the Qwen3-TTS Speech Tokenizer. This model modifies the decoder block configuration to output 48kHz sampling instead of the original 24kHz, which should improve audio quality. For details about the model, please see the model card.

The integration into the backend was only a few lines of code, but I felt users should be able to opt in rather than having it applied automatically, so I also built a UI for it. A checkbox has been added to the Models tab as shown in the attached screenshot. I also added a /settings API endpoint to persist this setting. スクリーンショット 2026-03-13 0 20 45

I imagine you may have your own preferences around UI/UX direction, so please feel free to let me know if you'd like any changes.

Summary by CodeRabbit

  • New Features

    • Added an Advanced Settings toggle (Models tab) to opt in to a higher-quality 48kHz speech tokenizer for PyTorch backends (disabled by default; takes effect on next generation). Settings persist across restarts and server settings endpoints.
  • Documentation

    • Updated backend docs with details on the new settings, the GET/PATCH settings endpoints, and persistence behavior.
  • UI

    • Added immediate UI feedback when toggling the setting (success/error notifications).

add hindi language

@Vishaldadlani321
Copy link

add hindi language

add hindi language

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants