Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,14 @@ This server is based on the architecture and UI of our [Dia-TTS-Server](https://
- `--upgrade` to update code + dependencies.
- `--reinstall` for a clean reinstall when environments get messy.

### 🧠 LLM-based Preprocessing (OpenAI Endpoint)

- Added **LLM-based preprocessing** for the OpenAI-compatible `/v1/audio/speech` endpoint.
- Send natural language instructions like `"speak excitedly: Hello!"` and the LLM extracts TTS parameters automatically.
- Uses **[litellm](https://docs.litellm.ai/)** for unified access to 100+ LLM providers (Ollama, OpenAI, Anthropic, etc.).
- Extracts parameters: `temperature`, `exaggeration`, `cfg_weight`, `split_text`, `chunk_size`, `language`.
- Configurable prompt, timeout, and fallback behavior via `config.yaml`.

---

## 🗣️ Overview: Enhanced Chatterbox TTS Generation
Expand Down Expand Up @@ -154,6 +162,7 @@ This server application enhances the underlying `chatterbox-tts` engine with the
* **Advanced Generation Features:**
* 🔁 **Hot-Swappable Engines:** Switch between Original Chatterbox and Chatterbox‑Turbo directly in the Web UI.
* 🎭 **Paralinguistic Tags (Turbo):** Native support for `[laugh]`, `[cough]`, `[chuckle]` and other expressive tags.
* 🧠 **LLM Preprocessing (OpenAI Endpoint):** Send natural language instructions through the OpenAI-compatible endpoint. An LLM extracts TTS parameters from instructions like "speak excitedly" or "say slowly and calmly."
* 📚 **Large Text Handling:** Intelligently splits long plain text inputs into chunks based on sentences, generates audio for each, and concatenates the results seamlessly. Configurable via `split_text` and `chunk_size`.
* 📖 **Audiobook Creation:** Perfect for generating complete audiobooks from full-length texts with consistent voice quality and automatic chapter handling.
* 🎤 **Predefined Voices:** Select from curated synthetic voices in the `./voices` directory.
Expand Down Expand Up @@ -573,6 +582,7 @@ The server relies exclusively on `config.yaml` for runtime configuration.
* `ui_state`: Stores the last used text, voice mode, file selections, etc., for UI persistence.
* `ui`: `title`, `show_language_select`, `max_predefined_voices_in_dropdown`.
* `debug`: `save_intermediate_audio`.
* `llm_preprocessing`: LLM preprocessing settings (`enabled`, `model`, `api_base`, `api_key`, `timeout_seconds`, `fallback_on_error`, `prompt`).

⭐ **Remember:** Changes made to `server`, `model`, `tts_engine`, or `paths` sections in `config.yaml` (or via the UI's Server Configuration section) **require a server restart** to take effect. Changes to `generation_defaults` or `ui_state` are applied dynamically or on the next page load.

Expand Down Expand Up @@ -851,6 +861,113 @@ One moment… [cough] sorry about that. Let's get this fixed.

Turbo supports native tags like `[laugh]`, `[cough]`, and `[chuckle]` for more realistic, expressive speech. These tags are ignored when using Original Chatterbox.

### 🧠 LLM Preprocessing (OpenAI Endpoint)

The LLM preprocessing feature allows you to send natural language instructions through the OpenAI-compatible `/v1/audio/speech` endpoint. Instead of manually specifying TTS parameters, simply describe how you want the text spoken and an LLM will extract the appropriate settings.

#### How It Works

When enabled, the server sends your input text to a configured LLM, which extracts:
- **text**: The cleaned text to speak (instructions removed)
- **temperature**: Controls randomness (0.0-1.5)
- **exaggeration**: Controls expressiveness (0.25-2.0)
- **cfg_weight**: Classifier-Free Guidance weight (0.2-1.0)
- **split_text**: Whether to chunk long text
- **chunk_size**: Target characters per chunk (50-500)
- **language**: Language code (e.g., "en", "es")

The extracted parameters are then used to generate speech with the appropriate settings.

#### Example Instructions

| Input | Extracted Parameters |
|-------|---------------------|
| `"speak excitedly: Hello world!"` | text="Hello world!", exaggeration=1.5 |
| `"say calmly and slowly: Take a breath"` | text="Take a breath", exaggeration=0.3 |
| `"whisper this in Spanish: Buenos días"` | text="Buenos días", exaggeration=0.3, language="es" |
| `"read this enthusiastically with high energy: Welcome everyone!"` | text="Welcome everyone!", exaggeration=1.8, temperature=0.8 |

#### Configuration

Enable and configure LLM preprocessing in `config.yaml`:

```yaml
llm_preprocessing:
enabled: true # Master toggle
model: "ollama/qwen2.5:1.5b" # litellm model string
api_base: null # Optional: URL to litellm proxy
api_key: null # Optional: API key for provider
timeout_seconds: 30 # Timeout for LLM requests
fallback_on_error: true # Fall back to original text on errors
prompt: | # System prompt for extraction
Extract TTS parameters from the user's input...
```

#### LLM Provider Setup

The feature uses [litellm](https://docs.litellm.ai/) which supports 100+ LLM providers. Common configurations:

**Ollama (Local, Recommended for Privacy):**
```yaml
llm_preprocessing:
enabled: true
model: "ollama/qwen2.5:1.5b" # Or llama3.2:1b, phi3:mini, etc.
api_base: null # Uses default http://localhost:11434
```

**OpenAI:**
```yaml
llm_preprocessing:
enabled: true
model: "gpt-4o-mini"
api_key: "sk-..." # Or set OPENAI_API_KEY env var
```

**Anthropic:**
```yaml
llm_preprocessing:
enabled: true
model: "claude-3-haiku-20240307"
api_key: "sk-ant-..." # Or set ANTHROPIC_API_KEY env var
```

**LiteLLM Proxy:**
```yaml
llm_preprocessing:
enabled: true
model: "openai/your-model-alias" # Use openai/ prefix + model alias from proxy
api_base: "http://localhost:4000"
api_key: "not-needed" # Required placeholder (even if proxy has no auth)
```

> **Note:** When using a LiteLLM proxy:
> - Use the `openai/` prefix (e.g., `openai/gpt-4o-mini`) because the proxy exposes an OpenAI-compatible API
> - The model name after the prefix should match your proxy's model alias
> - You must provide an `api_key` value (use `"not-needed"` for unauthenticated proxies)

#### API Usage Example

```bash
# With LLM preprocessing enabled, natural language instructions work:
curl -X POST "http://localhost:8004/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{
"input": "speak with excitement and energy: Welcome to our show!",
"voice": "female_voice.wav"
}' \
--output excited_welcome.wav

# The LLM extracts: text="Welcome to our show!", exaggeration=1.5+
# Then generates speech with those parameters
```

#### Error Handling

- **fallback_on_error: true** (default): If the LLM fails, the original text is used as-is with default TTS parameters.
- **fallback_on_error: false**: LLM errors return HTTP 500 with error details.

Errors are logged with full details for debugging. Check server logs if preprocessing isn't working as expected.

### API Endpoints (`/docs` for interactive details)

The primary endpoint for TTS generation is `/tts`, which offers detailed control over the synthesis process.
Expand Down
73 changes: 73 additions & 0 deletions config.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,28 @@
"debug": { # Settings for debugging purposes
"save_intermediate_audio": False # If true, save intermediate audio files for debugging
},
"llm_preprocessing": { # LLM-based preprocessing for OpenAI speech endpoint
"enabled": False, # Master toggle for LLM preprocessing
"model": "ollama/qwen2.5:1.5b", # litellm model string
"api_base": None, # Optional: URL to litellm proxy (e.g., "http://localhost:4000")
"api_key": None, # Optional: API key for proxy or provider
"timeout_seconds": 30, # Timeout for LLM requests
"fallback_on_error": True, # If True, fall back to original text on LLM errors
"prompt": """Extract TTS parameters from the user's input. Return JSON with:
- text: The cleaned text to speak (remove any instructions)
- temperature: 0.0-1.5 (higher = more random)
- exaggeration: 0.25-2.0 (higher = more expressive)
- cfg_weight: 0.2-1.0 (guidance weight)
- split_text: true/false for chunking long text
- chunk_size: 50-500 chars per chunk
- language: language code like "en", "es"

Only include fields explicitly requested. Use null for unspecified params.

Examples:
"speak excitedly: Hello!" → {"text": "Hello!", "exaggeration": 1.5}
"say calmly and slowly: Take a breath" → {"text": "Take a breath", "exaggeration": 0.3}""",
},
}


Expand Down Expand Up @@ -897,4 +919,55 @@ def get_full_config_for_template() -> Dict[str, Any]:
return config_manager._prepare_config_for_saving(config_snapshot)


# LLM Preprocessing Settings Accessors
def get_llm_preprocessing_enabled() -> bool:
"""Returns whether LLM preprocessing is enabled for the OpenAI speech endpoint."""
return config_manager.get_bool(
"llm_preprocessing.enabled",
_get_default_from_structure("llm_preprocessing.enabled"),
)


def get_llm_preprocessing_model() -> str:
"""Returns the litellm model string for LLM preprocessing."""
return config_manager.get_string(
"llm_preprocessing.model",
_get_default_from_structure("llm_preprocessing.model"),
)


def get_llm_preprocessing_api_base() -> Optional[str]:
"""Returns the optional API base URL for litellm proxy."""
return config_manager.get("llm_preprocessing.api_base", None)


def get_llm_preprocessing_api_key() -> Optional[str]:
"""Returns the optional API key for LLM provider or proxy."""
return config_manager.get("llm_preprocessing.api_key", None)


def get_llm_preprocessing_timeout() -> int:
"""Returns the timeout in seconds for LLM requests."""
return config_manager.get_int(
"llm_preprocessing.timeout_seconds",
_get_default_from_structure("llm_preprocessing.timeout_seconds"),
)


def get_llm_preprocessing_fallback_on_error() -> bool:
"""Returns whether to fall back to original text on LLM errors."""
return config_manager.get_bool(
"llm_preprocessing.fallback_on_error",
_get_default_from_structure("llm_preprocessing.fallback_on_error"),
)


def get_llm_preprocessing_prompt() -> str:
"""Returns the system prompt for LLM preprocessing."""
return config_manager.get_string(
"llm_preprocessing.prompt",
_get_default_from_structure("llm_preprocessing.prompt"),
)


# --- End File: config.py ---
125 changes: 125 additions & 0 deletions llm_preprocessor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# File: llm_preprocessor.py
# LLM-based preprocessing for extracting TTS parameters from natural language input.
# Uses litellm for unified access to multiple LLM providers.

import logging
from typing import Optional

import litellm
from litellm import acompletion
from pydantic import BaseModel, Field

from config import (
config_manager,
get_llm_preprocessing_enabled,
get_llm_preprocessing_model,
get_llm_preprocessing_api_base,
get_llm_preprocessing_api_key,
get_llm_preprocessing_timeout,
get_llm_preprocessing_fallback_on_error,
get_llm_preprocessing_prompt,
)

logger = logging.getLogger(__name__)

# Enable schema validation for local models that may not natively support JSON schema
litellm.enable_json_schema_validation = True


class TTSParamsExtraction(BaseModel):
"""
Strongly-typed extraction result for TTS parameters.
Fields match CustomTTSRequest parameters that can be extracted from natural language.
"""

text: str = Field(
description="The cleaned text to synthesize, with instructions removed"
)
temperature: Optional[float] = Field(
None, ge=0.0, le=1.5, description="Controls randomness (0.0-1.5)"
)
exaggeration: Optional[float] = Field(
None, ge=0.25, le=2.0, description="Controls expressiveness (0.25-2.0)"
)
cfg_weight: Optional[float] = Field(
None, ge=0.2, le=1.0, description="Classifier-Free Guidance weight (0.2-1.0)"
)
split_text: Optional[bool] = Field(
None, description="Whether to split long text into chunks"
)
chunk_size: Optional[int] = Field(
None, ge=50, le=500, description="Target chunk size in characters (50-500)"
)
language: Optional[str] = Field(
None, description="Language code (e.g., 'en', 'es', 'fr')"
)


async def preprocess_speech_input(input_text: str) -> TTSParamsExtraction:
"""
Extract TTS parameters from natural language input using a configured LLM.

Args:
input_text: The raw input text that may contain natural language instructions
for TTS generation (e.g., "speak excitedly: Hello world!")

Returns:
TTSParamsExtraction with cleaned text and any extracted parameters.

Raises:
Exception: If LLM call fails and fallback_on_error is False.
"""
if not get_llm_preprocessing_enabled():
logger.debug("LLM preprocessing is disabled, returning original text")
return TTSParamsExtraction(text=input_text)

model = get_llm_preprocessing_model()
prompt = get_llm_preprocessing_prompt()
api_base = get_llm_preprocessing_api_base()
api_key = get_llm_preprocessing_api_key()
timeout = get_llm_preprocessing_timeout()
fallback_on_error = get_llm_preprocessing_fallback_on_error()

logger.info(f"Preprocessing input with LLM model: {model}")
logger.debug(f"Input text: {input_text[:100]}...")

try:
# Build kwargs conditionally to avoid passing None values
kwargs = {
"model": model,
"messages": [
{"role": "system", "content": prompt},
{"role": "user", "content": input_text},
],
"response_format": TTSParamsExtraction,
"timeout": timeout,
}
if api_base:
kwargs["api_base"] = api_base
if api_key:
kwargs["api_key"] = api_key

response = await acompletion(**kwargs)

# Parse the response content into our Pydantic model
content = response.choices[0].message.content
result = TTSParamsExtraction.model_validate_json(content)

logger.info(f"LLM extracted params: text='{result.text[:50]}...', "
f"temperature={result.temperature}, exaggeration={result.exaggeration}, "
f"cfg_weight={result.cfg_weight}, split_text={result.split_text}, "
f"chunk_size={result.chunk_size}, language={result.language}")

return result

except Exception as e:
logger.error(f"LLM preprocessing failed: {e}", exc_info=True)

if fallback_on_error:
logger.warning("Falling back to original text due to LLM error")
return TTSParamsExtraction(text=input_text)
else:
raise


# --- End File: llm_preprocessor.py ---
3 changes: 3 additions & 0 deletions requirements-nvidia-cu128.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,9 @@ inflect
tqdm
hf_transfer

# --- LLM Integration (Optional) ---
litellm>=1.0.0 # Unified LLM API for preprocessing

# --- Audio Post-processing ---
pydub
audiotsm
Expand Down
3 changes: 3 additions & 0 deletions requirements-nvidia.txt
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ inflect
tqdm
hf_transfer # Speed up file transfers

# LLM Integration (Optional)
litellm>=1.0.0 # Unified LLM API for preprocessing

# Audio Post-processing
pydub
audiotsm
Expand Down
3 changes: 3 additions & 0 deletions requirements-rocm.txt
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,6 @@ pydub
praat-parselmouth # For unvoiced segment removal
librosa # for changes to sampling
hf-transfer

# LLM Integration (Optional)
litellm>=1.0.0 # Unified LLM API for preprocessing
5 changes: 4 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,10 @@ python-multipart # Form data parsing for FastAPI
requests # HTTP client library
Jinja2 # Template engine
aiofiles # Async file operations
hf_transfer # Speed up file transfers with the Hugging Face Hub.
hf_transfer # Speed up file transfers with the Hugging Face Hub

# --- LLM Integration (Optional) ---
litellm>=1.0.0 # Unified LLM API for preprocessing (supports 100+ providers)


# --- Configuration & Data Processing ---
Expand Down
Loading