feat: paralinguistic tag autocomplete for Chatterbox Turbo#265
feat: paralinguistic tag autocomplete for Chatterbox Turbo#265
Conversation
The previous approach of patching librosa.load didn't work because melspectrogram itself performs float64 math (numpy dot, signal.lfilter) regardless of input dtype. The actual mismatch happens when pack() creates a float64 tensor from the mel arrays and passes it into the float32 LSTM weights in VoiceEncoder.forward(). Fix by monkey-patching VoiceEncoder.forward() to call mels.float() before the LSTM, ensuring the input always matches the model dtype.
- POST /models/{model_name}/unload — unloads a specific model from
memory without deleting from disk, supports all engine types
- Frontend: Unload button in model detail dialog when model is loaded
- Delete button remains disabled while loaded (unload first)
The actual dtype mismatch was in S3Tokenizer.log_mel_spectrogram, not VoiceEncoder.forward. librosa.load returns float64 numpy, which torch.from_numpy preserves as double. The STFT output (double) then hits _mel_filters (float32) in a matmul at s3tokenizer.py:163. Now patching both entry points after model load: 1. S3Tokenizer.log_mel_spectrogram — cast audio to float32 before STFT 2. VoiceEncoder.forward — cast mels to float32 before LSTM Remove debug traceback logging (no longer needed).
Type / in the text input when using Chatterbox Turbo to open an autocomplete dropdown with 9 supported paralinguistic tags ([laugh], [chuckle], [gasp], [cough], [sigh], [groan], [sniff], [shush], [clear throat]). - contentEditable div replaces textarea for Turbo engine only - Tags render as inline styled badges - Pasting text with [tag] patterns auto-converts to badges - Badges serialize back to plain [tag] text for the API - Dropdown portalled to body, opens above caret to avoid overflow
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughAdds a ParalinguisticInput rich-text editor used when engine === "chatterbox_turbo", a model unload API + client/UI integration, and runtime dtype-casting monkey-patches in chatterbox and chatterbox_turbo backends to coerce audio tensors to float32. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Frontend as Frontend<br/>(ModelManagement UI)
participant ApiClient
participant Backend as Backend<br/>(main.py)
participant Model as ModelManager
User->>Frontend: Click "Unload" for model
Frontend->>ApiClient: call unloadModel(modelName)
ApiClient->>Backend: POST /models/{model_name}/unload
Backend->>Backend: resolve model_name -> (type,size)
Backend->>Model: query is_loaded / backend-specific state
Model-->>Backend: loaded status
alt model is loaded
Backend->>Model: perform backend-specific unload
Model-->>Backend: unload success
Backend-->>ApiClient: { message: "unloaded" }
else not loaded
Backend-->>ApiClient: { message: "not loaded" }
end
ApiClient-->>Frontend: response
Frontend->>Frontend: show toast, invalidate queries, update UI
Frontend-->>User: display result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
backend/backends/chatterbox_turbo_backend.py (1)
181-213: LGTM — consistent with chatterbox_backend.py implementation.The dtype patching is identical to
chatterbox_backend.py, ensuring consistent behavior across both Chatterbox variants.💡 Optional: Consider extracting shared patching logic
Since both backends apply identical patches, you could extract this to a shared utility function in a common module (e.g.,
backend/backends/chatterbox_utils.py). This would reduce duplication and ensure both backends stay in sync if the upstream library changes.# backend/backends/chatterbox_utils.py def apply_dtype_patches(model): """Patch float64 → float32 dtype mismatches in upstream chatterbox.""" import types _tokzr = model.s3gen.tokenizer _orig_log_mel = _tokzr.log_mel_spectrogram.__func__ def _f32_log_mel(self_tokzr, audio, padding=0): import torch as _torch if _torch.is_tensor(audio): audio = audio.float() return _orig_log_mel(self_tokzr, audio, padding) _tokzr.log_mel_spectrogram = types.MethodType(_f32_log_mel, _tokzr) _ve = model.ve _orig_ve_forward = _ve.forward.__func__ def _f32_ve_forward(self_ve, mels): return _orig_ve_forward(self_ve, mels.float()) _ve.forward = types.MethodType(_f32_ve_forward, _ve)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/backends/chatterbox_turbo_backend.py` around lines 181 - 213, Extract the duplicated dtype-patching code into a shared function (e.g., apply_dtype_patches(model)) and call it from both chatterbox_turbo_backend and chatterbox_backend; specifically move the logic that accesses model.s3gen.tokenizer and its log_mel_spectrogram original (__func__), and model.ve and its forward original (__func__), into the new utility, preserve the MethodType wrapping for _f32_log_mel and _f32_ve_forward, and then replace the inline patching in both backends with a single call to apply_dtype_patches(self.model) so updates remain in one place.backend/main.py (1)
1519-1539: Consider moving repeated import outside the conditionals.The
get_tts_backend_for_engineimport is repeated in each branch. Moving it to the top of the try block reduces duplication.♻️ Proposed refactor
try: + from .backends import get_tts_backend_for_engine + if model_type == "tts": tts_model = tts.get_tts_model() if tts_model.is_loaded() and tts_model.model_size == model_size: tts.unload_tts_model() else: return {"message": f"Model {model_name} is not loaded"} elif model_type == "luxtts": - from .backends import get_tts_backend_for_engine backend = get_tts_backend_for_engine("luxtts") ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/main.py` around lines 1519 - 1539, Move the repeated import of get_tts_backend_for_engine out of each model_type branch: import get_tts_backend_for_engine once at the start of the try block, then inside the branches call get_tts_backend_for_engine with the appropriate engine string (e.g., "luxtts", "chatterbox", "chatterbox_turbo"); keep the existing logic that checks backend.is_loaded() and calls backend.unload_model() or returns the not-loaded message, but remove the duplicate from each elif branch so only the single import remains.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/src/components/Generation/ParalinguisticInput.tsx`:
- Around line 357-406: The portalled autocomplete can remain open after focus
leaves the editor; add an outside-blur handler to close it by setting showMenu
to false: create refs for the editor input (e.g., editorRef) and the menu
container (menuRef used around the motion.div), then in a useEffect attach a
document 'mousedown' or 'focusin' listener that checks if the event target is
not inside editorRef.current nor menuRef.current and if so calls
setShowMenu(false); make sure to cleanup the listener on unmount and preserve
existing behavior for isComposingRef, handleInput, insertTag, and menuIndex when
closing the menu.
- Around line 139-160: lastSerializedRef is initialized to value which causes
the initial useEffect sync to mistakenly believe the editor is already hydrated;
change initialization and update logic so the editor always hydrates on mount:
initialize lastSerializedRef with an empty string (useRef<string>('')) instead
of value, and in the useEffect that writes el.innerHTML, after setting
el.innerHTML assign lastSerializedRef.current = value ?? '' (so future updates
still compare correctly). This touches the lastSerializedRef declaration and the
useEffect block that reads/writes editorRef.current.innerHTML.
- Around line 220-227: The ArrowUp/ArrowDown handlers in ParalinguisticInput.tsx
use modulo with filteredTags.length which becomes 0 and yields NaN; guard these
branches by checking filteredTags.length > 0 before calling setMenuIndex (or
early-return from the key handler when showMenu is true but filteredTags.length
=== 0) so menuIndex is only updated when there are results; update the
ArrowDown/ArrowUp blocks that call setMenuIndex to run only when
filteredTags.length > 0.
- Around line 333-356: The div retains interactive semantics when disabled and
still receives clicks/focus; update the JSX to make it non-focusable and inert
when disabled by: keep contentEditable={!disabled} and aria-disabled, but set
tabIndex={disabled ? -1 : 0} (or omit tabIndex when you prefer default), and
only attach handlers (onInput, onKeyDown, onPaste, onClick, onFocus) when
!disabled (e.g. onInput={!disabled ? handleInput : undefined}, etc.); this
ensures editorRef-backed element, handlers (handleInput, handleKeyDown,
handlePaste, onClick, onFocus) and keyboard interactions are disabled while
preserving accessible ARIA state.
In `@backend/main.py`:
- Around line 1548-1549: The except block currently re-raises an HTTPException
without chaining the original exception; modify the exception raise in the
except handler so the HTTPException is raised from the caught exception (use
"raise HTTPException(status_code=500, detail=str(e)) from e") to preserve the
original traceback—update the except block where HTTPException is raised in
backend/main.py (the handler catching "Exception as e") to use exception
chaining.
---
Nitpick comments:
In `@backend/backends/chatterbox_turbo_backend.py`:
- Around line 181-213: Extract the duplicated dtype-patching code into a shared
function (e.g., apply_dtype_patches(model)) and call it from both
chatterbox_turbo_backend and chatterbox_backend; specifically move the logic
that accesses model.s3gen.tokenizer and its log_mel_spectrogram original
(__func__), and model.ve and its forward original (__func__), into the new
utility, preserve the MethodType wrapping for _f32_log_mel and _f32_ve_forward,
and then replace the inline patching in both backends with a single call to
apply_dtype_patches(self.model) so updates remain in one place.
In `@backend/main.py`:
- Around line 1519-1539: Move the repeated import of get_tts_backend_for_engine
out of each model_type branch: import get_tts_backend_for_engine once at the
start of the try block, then inside the branches call get_tts_backend_for_engine
with the appropriate engine string (e.g., "luxtts", "chatterbox",
"chatterbox_turbo"); keep the existing logic that checks backend.is_loaded() and
calls backend.unload_model() or returns the not-loaded message, but remove the
duplicate from each elif branch so only the single import remains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 238e008c-5f68-4f6a-9527-09f4761ed161
📒 Files selected for processing (8)
app/src/components/Generation/FloatingGenerateBox.tsxapp/src/components/Generation/GenerationForm.tsxapp/src/components/Generation/ParalinguisticInput.tsxapp/src/components/ServerSettings/ModelManagement.tsxapp/src/lib/api/client.tsbackend/backends/chatterbox_backend.pybackend/backends/chatterbox_turbo_backend.pybackend/main.py
| const lastSerializedRef = useRef<string>(value ?? ''); | ||
| const isComposingRef = useRef(false); | ||
|
|
||
| useImperativeHandle(ref, () => ({ | ||
| focus: () => editorRef.current?.focus(), | ||
| element: editorRef.current, | ||
| })); | ||
|
|
||
| // Filtered tag list for the autocomplete menu | ||
| const filteredTags = PARALINGUISTIC_TAGS.filter((t) => | ||
| t.label.toLowerCase().includes(menuFilter.toLowerCase()), | ||
| ); | ||
|
|
||
| // ── Sync external value → editor ────────────────────────────── | ||
| useEffect(() => { | ||
| const el = editorRef.current; | ||
| if (!el) return; | ||
| // Only update DOM if the external value differs from what we last emitted | ||
| if (value !== undefined && value !== lastSerializedRef.current) { | ||
| lastSerializedRef.current = value; | ||
| el.innerHTML = value ? textToHTML(value) : ''; | ||
| } |
There was a problem hiding this comment.
Initial value can fail to render in the editor on first mount.
lastSerializedRef starts with value, so the first sync can skip innerHTML hydration when value is already non-empty.
💡 Proposed fix
- const lastSerializedRef = useRef<string>(value ?? '');
+ const lastSerializedRef = useRef<string>('');
// ── Sync external value → editor ──────────────────────────────
useEffect(() => {
const el = editorRef.current;
if (!el) return;
- // Only update DOM if the external value differs from what we last emitted
- if (value !== undefined && value !== lastSerializedRef.current) {
- lastSerializedRef.current = value;
- el.innerHTML = value ? textToHTML(value) : '';
- }
+ const next = value ?? '';
+ if (htmlToText(el) !== next) {
+ el.innerHTML = next ? textToHTML(next) : '';
+ }
+ lastSerializedRef.current = next;
}, [value]);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/src/components/Generation/ParalinguisticInput.tsx` around lines 139 -
160, lastSerializedRef is initialized to value which causes the initial
useEffect sync to mistakenly believe the editor is already hydrated; change
initialization and update logic so the editor always hydrates on mount:
initialize lastSerializedRef with an empty string (useRef<string>('')) instead
of value, and in the useEffect that writes el.innerHTML, after setting
el.innerHTML assign lastSerializedRef.current = value ?? '' (so future updates
still compare correctly). This touches the lastSerializedRef declaration and the
useEffect block that reads/writes editorRef.current.innerHTML.
- Initialize lastSerializedRef to empty string so first-mount hydration always runs (fixes initial value not rendering) - Guard arrow-key menu nav against empty filteredTags (avoids NaN index) - Disable ARIA role/multiline and detach event handlers when disabled - Add onBlur to close autocomplete dropdown when editor loses focus - Chain exception with 'from e' in unload endpoint for better tracebacks
Summary
Adds inline tag autocomplete for Chatterbox Turbo's 9 paralinguistic sound effects. Only appears when the engine is set to Chatterbox Turbo.
How it works
/in the text input to open a dropdown with all 9 tags/lashows laugh)[laugh],[sigh]etc. auto-converts to badges[tag]text for the APISupported tags
[laugh][chuckle][gasp][cough][sigh][groan][sniff][shush][clear throat]Implementation
ParalinguisticInputcomponent using acontentEditabledivTextareaonly when engine ischatterbox_turbodocument.bodyand positioned above the caret (since the generate box sits at the bottom of the screen)FloatingGenerateBoxandGenerationFormSummary by CodeRabbit
New Features
Bug Fixes