diff --git a/README.md b/README.md index 492a9d3..cac8f2b 100644 --- a/README.md +++ b/README.md @@ -64,6 +64,14 @@ This server is based on the architecture and UI of our [Dia-TTS-Server](https:// - `--upgrade` to update code + dependencies. - `--reinstall` for a clean reinstall when environments get messy. +### 🧠 LLM-based Preprocessing (OpenAI Endpoint) + +- Added **LLM-based preprocessing** for the OpenAI-compatible `/v1/audio/speech` endpoint. +- Send natural language instructions like `"speak excitedly: Hello!"` and the LLM extracts TTS parameters automatically. +- Uses **[litellm](https://docs.litellm.ai/)** for unified access to 100+ LLM providers (Ollama, OpenAI, Anthropic, etc.). +- Extracts parameters: `temperature`, `exaggeration`, `cfg_weight`, `split_text`, `chunk_size`, `language`. +- Configurable prompt, timeout, and fallback behavior via `config.yaml`. + --- ## 🗣️ Overview: Enhanced Chatterbox TTS Generation @@ -154,6 +162,7 @@ This server application enhances the underlying `chatterbox-tts` engine with the * **Advanced Generation Features:** * 🔁 **Hot-Swappable Engines:** Switch between Original Chatterbox and Chatterbox‑Turbo directly in the Web UI. * 🎭 **Paralinguistic Tags (Turbo):** Native support for `[laugh]`, `[cough]`, `[chuckle]` and other expressive tags. + * 🧠 **LLM Preprocessing (OpenAI Endpoint):** Send natural language instructions through the OpenAI-compatible endpoint. An LLM extracts TTS parameters from instructions like "speak excitedly" or "say slowly and calmly." * 📚 **Large Text Handling:** Intelligently splits long plain text inputs into chunks based on sentences, generates audio for each, and concatenates the results seamlessly. Configurable via `split_text` and `chunk_size`. * 📖 **Audiobook Creation:** Perfect for generating complete audiobooks from full-length texts with consistent voice quality and automatic chapter handling. * 🎤 **Predefined Voices:** Select from curated synthetic voices in the `./voices` directory. @@ -573,6 +582,7 @@ The server relies exclusively on `config.yaml` for runtime configuration. * `ui_state`: Stores the last used text, voice mode, file selections, etc., for UI persistence. * `ui`: `title`, `show_language_select`, `max_predefined_voices_in_dropdown`. * `debug`: `save_intermediate_audio`. +* `llm_preprocessing`: LLM preprocessing settings (`enabled`, `model`, `api_base`, `api_key`, `timeout_seconds`, `fallback_on_error`, `prompt`). ⭐ **Remember:** Changes made to `server`, `model`, `tts_engine`, or `paths` sections in `config.yaml` (or via the UI's Server Configuration section) **require a server restart** to take effect. Changes to `generation_defaults` or `ui_state` are applied dynamically or on the next page load. @@ -851,6 +861,113 @@ One moment… [cough] sorry about that. Let's get this fixed. Turbo supports native tags like `[laugh]`, `[cough]`, and `[chuckle]` for more realistic, expressive speech. These tags are ignored when using Original Chatterbox. +### 🧠 LLM Preprocessing (OpenAI Endpoint) + +The LLM preprocessing feature allows you to send natural language instructions through the OpenAI-compatible `/v1/audio/speech` endpoint. Instead of manually specifying TTS parameters, simply describe how you want the text spoken and an LLM will extract the appropriate settings. + +#### How It Works + +When enabled, the server sends your input text to a configured LLM, which extracts: +- **text**: The cleaned text to speak (instructions removed) +- **temperature**: Controls randomness (0.0-1.5) +- **exaggeration**: Controls expressiveness (0.25-2.0) +- **cfg_weight**: Classifier-Free Guidance weight (0.2-1.0) +- **split_text**: Whether to chunk long text +- **chunk_size**: Target characters per chunk (50-500) +- **language**: Language code (e.g., "en", "es") + +The extracted parameters are then used to generate speech with the appropriate settings. + +#### Example Instructions + +| Input | Extracted Parameters | +|-------|---------------------| +| `"speak excitedly: Hello world!"` | text="Hello world!", exaggeration=1.5 | +| `"say calmly and slowly: Take a breath"` | text="Take a breath", exaggeration=0.3 | +| `"whisper this in Spanish: Buenos días"` | text="Buenos días", exaggeration=0.3, language="es" | +| `"read this enthusiastically with high energy: Welcome everyone!"` | text="Welcome everyone!", exaggeration=1.8, temperature=0.8 | + +#### Configuration + +Enable and configure LLM preprocessing in `config.yaml`: + +```yaml +llm_preprocessing: + enabled: true # Master toggle + model: "ollama/qwen2.5:1.5b" # litellm model string + api_base: null # Optional: URL to litellm proxy + api_key: null # Optional: API key for provider + timeout_seconds: 30 # Timeout for LLM requests + fallback_on_error: true # Fall back to original text on errors + prompt: | # System prompt for extraction + Extract TTS parameters from the user's input... +``` + +#### LLM Provider Setup + +The feature uses [litellm](https://docs.litellm.ai/) which supports 100+ LLM providers. Common configurations: + +**Ollama (Local, Recommended for Privacy):** +```yaml +llm_preprocessing: + enabled: true + model: "ollama/qwen2.5:1.5b" # Or llama3.2:1b, phi3:mini, etc. + api_base: null # Uses default http://localhost:11434 +``` + +**OpenAI:** +```yaml +llm_preprocessing: + enabled: true + model: "gpt-4o-mini" + api_key: "sk-..." # Or set OPENAI_API_KEY env var +``` + +**Anthropic:** +```yaml +llm_preprocessing: + enabled: true + model: "claude-3-haiku-20240307" + api_key: "sk-ant-..." # Or set ANTHROPIC_API_KEY env var +``` + +**LiteLLM Proxy:** +```yaml +llm_preprocessing: + enabled: true + model: "openai/your-model-alias" # Use openai/ prefix + model alias from proxy + api_base: "http://localhost:4000" + api_key: "not-needed" # Required placeholder (even if proxy has no auth) +``` + +> **Note:** When using a LiteLLM proxy: +> - Use the `openai/` prefix (e.g., `openai/gpt-4o-mini`) because the proxy exposes an OpenAI-compatible API +> - The model name after the prefix should match your proxy's model alias +> - You must provide an `api_key` value (use `"not-needed"` for unauthenticated proxies) + +#### API Usage Example + +```bash +# With LLM preprocessing enabled, natural language instructions work: +curl -X POST "http://localhost:8004/v1/audio/speech" \ + -H "Content-Type: application/json" \ + -d '{ + "input": "speak with excitement and energy: Welcome to our show!", + "voice": "female_voice.wav" + }' \ + --output excited_welcome.wav + +# The LLM extracts: text="Welcome to our show!", exaggeration=1.5+ +# Then generates speech with those parameters +``` + +#### Error Handling + +- **fallback_on_error: true** (default): If the LLM fails, the original text is used as-is with default TTS parameters. +- **fallback_on_error: false**: LLM errors return HTTP 500 with error details. + +Errors are logged with full details for debugging. Check server logs if preprocessing isn't working as expected. + ### API Endpoints (`/docs` for interactive details) The primary endpoint for TTS generation is `/tts`, which offers detailed control over the synthesis process. diff --git a/config.py b/config.py index d9efc1c..fccfb7e 100644 --- a/config.py +++ b/config.py @@ -104,6 +104,28 @@ "debug": { # Settings for debugging purposes "save_intermediate_audio": False # If true, save intermediate audio files for debugging }, + "llm_preprocessing": { # LLM-based preprocessing for OpenAI speech endpoint + "enabled": False, # Master toggle for LLM preprocessing + "model": "ollama/qwen2.5:1.5b", # litellm model string + "api_base": None, # Optional: URL to litellm proxy (e.g., "http://localhost:4000") + "api_key": None, # Optional: API key for proxy or provider + "timeout_seconds": 30, # Timeout for LLM requests + "fallback_on_error": True, # If True, fall back to original text on LLM errors + "prompt": """Extract TTS parameters from the user's input. Return JSON with: +- text: The cleaned text to speak (remove any instructions) +- temperature: 0.0-1.5 (higher = more random) +- exaggeration: 0.25-2.0 (higher = more expressive) +- cfg_weight: 0.2-1.0 (guidance weight) +- split_text: true/false for chunking long text +- chunk_size: 50-500 chars per chunk +- language: language code like "en", "es" + +Only include fields explicitly requested. Use null for unspecified params. + +Examples: +"speak excitedly: Hello!" → {"text": "Hello!", "exaggeration": 1.5} +"say calmly and slowly: Take a breath" → {"text": "Take a breath", "exaggeration": 0.3}""", + }, } @@ -897,4 +919,55 @@ def get_full_config_for_template() -> Dict[str, Any]: return config_manager._prepare_config_for_saving(config_snapshot) +# LLM Preprocessing Settings Accessors +def get_llm_preprocessing_enabled() -> bool: + """Returns whether LLM preprocessing is enabled for the OpenAI speech endpoint.""" + return config_manager.get_bool( + "llm_preprocessing.enabled", + _get_default_from_structure("llm_preprocessing.enabled"), + ) + + +def get_llm_preprocessing_model() -> str: + """Returns the litellm model string for LLM preprocessing.""" + return config_manager.get_string( + "llm_preprocessing.model", + _get_default_from_structure("llm_preprocessing.model"), + ) + + +def get_llm_preprocessing_api_base() -> Optional[str]: + """Returns the optional API base URL for litellm proxy.""" + return config_manager.get("llm_preprocessing.api_base", None) + + +def get_llm_preprocessing_api_key() -> Optional[str]: + """Returns the optional API key for LLM provider or proxy.""" + return config_manager.get("llm_preprocessing.api_key", None) + + +def get_llm_preprocessing_timeout() -> int: + """Returns the timeout in seconds for LLM requests.""" + return config_manager.get_int( + "llm_preprocessing.timeout_seconds", + _get_default_from_structure("llm_preprocessing.timeout_seconds"), + ) + + +def get_llm_preprocessing_fallback_on_error() -> bool: + """Returns whether to fall back to original text on LLM errors.""" + return config_manager.get_bool( + "llm_preprocessing.fallback_on_error", + _get_default_from_structure("llm_preprocessing.fallback_on_error"), + ) + + +def get_llm_preprocessing_prompt() -> str: + """Returns the system prompt for LLM preprocessing.""" + return config_manager.get_string( + "llm_preprocessing.prompt", + _get_default_from_structure("llm_preprocessing.prompt"), + ) + + # --- End File: config.py --- diff --git a/llm_preprocessor.py b/llm_preprocessor.py new file mode 100644 index 0000000..d9b4fbe --- /dev/null +++ b/llm_preprocessor.py @@ -0,0 +1,125 @@ +# File: llm_preprocessor.py +# LLM-based preprocessing for extracting TTS parameters from natural language input. +# Uses litellm for unified access to multiple LLM providers. + +import logging +from typing import Optional + +import litellm +from litellm import acompletion +from pydantic import BaseModel, Field + +from config import ( + config_manager, + get_llm_preprocessing_enabled, + get_llm_preprocessing_model, + get_llm_preprocessing_api_base, + get_llm_preprocessing_api_key, + get_llm_preprocessing_timeout, + get_llm_preprocessing_fallback_on_error, + get_llm_preprocessing_prompt, +) + +logger = logging.getLogger(__name__) + +# Enable schema validation for local models that may not natively support JSON schema +litellm.enable_json_schema_validation = True + + +class TTSParamsExtraction(BaseModel): + """ + Strongly-typed extraction result for TTS parameters. + Fields match CustomTTSRequest parameters that can be extracted from natural language. + """ + + text: str = Field( + description="The cleaned text to synthesize, with instructions removed" + ) + temperature: Optional[float] = Field( + None, ge=0.0, le=1.5, description="Controls randomness (0.0-1.5)" + ) + exaggeration: Optional[float] = Field( + None, ge=0.25, le=2.0, description="Controls expressiveness (0.25-2.0)" + ) + cfg_weight: Optional[float] = Field( + None, ge=0.2, le=1.0, description="Classifier-Free Guidance weight (0.2-1.0)" + ) + split_text: Optional[bool] = Field( + None, description="Whether to split long text into chunks" + ) + chunk_size: Optional[int] = Field( + None, ge=50, le=500, description="Target chunk size in characters (50-500)" + ) + language: Optional[str] = Field( + None, description="Language code (e.g., 'en', 'es', 'fr')" + ) + + +async def preprocess_speech_input(input_text: str) -> TTSParamsExtraction: + """ + Extract TTS parameters from natural language input using a configured LLM. + + Args: + input_text: The raw input text that may contain natural language instructions + for TTS generation (e.g., "speak excitedly: Hello world!") + + Returns: + TTSParamsExtraction with cleaned text and any extracted parameters. + + Raises: + Exception: If LLM call fails and fallback_on_error is False. + """ + if not get_llm_preprocessing_enabled(): + logger.debug("LLM preprocessing is disabled, returning original text") + return TTSParamsExtraction(text=input_text) + + model = get_llm_preprocessing_model() + prompt = get_llm_preprocessing_prompt() + api_base = get_llm_preprocessing_api_base() + api_key = get_llm_preprocessing_api_key() + timeout = get_llm_preprocessing_timeout() + fallback_on_error = get_llm_preprocessing_fallback_on_error() + + logger.info(f"Preprocessing input with LLM model: {model}") + logger.debug(f"Input text: {input_text[:100]}...") + + try: + # Build kwargs conditionally to avoid passing None values + kwargs = { + "model": model, + "messages": [ + {"role": "system", "content": prompt}, + {"role": "user", "content": input_text}, + ], + "response_format": TTSParamsExtraction, + "timeout": timeout, + } + if api_base: + kwargs["api_base"] = api_base + if api_key: + kwargs["api_key"] = api_key + + response = await acompletion(**kwargs) + + # Parse the response content into our Pydantic model + content = response.choices[0].message.content + result = TTSParamsExtraction.model_validate_json(content) + + logger.info(f"LLM extracted params: text='{result.text[:50]}...', " + f"temperature={result.temperature}, exaggeration={result.exaggeration}, " + f"cfg_weight={result.cfg_weight}, split_text={result.split_text}, " + f"chunk_size={result.chunk_size}, language={result.language}") + + return result + + except Exception as e: + logger.error(f"LLM preprocessing failed: {e}", exc_info=True) + + if fallback_on_error: + logger.warning("Falling back to original text due to LLM error") + return TTSParamsExtraction(text=input_text) + else: + raise + + +# --- End File: llm_preprocessor.py --- diff --git a/requirements-nvidia-cu128.txt b/requirements-nvidia-cu128.txt index 0b6d97d..fc4cdd7 100644 --- a/requirements-nvidia-cu128.txt +++ b/requirements-nvidia-cu128.txt @@ -56,6 +56,9 @@ inflect tqdm hf_transfer +# --- LLM Integration (Optional) --- +litellm>=1.0.0 # Unified LLM API for preprocessing + # --- Audio Post-processing --- pydub audiotsm diff --git a/requirements-nvidia.txt b/requirements-nvidia.txt index e4d3058..22363df 100644 --- a/requirements-nvidia.txt +++ b/requirements-nvidia.txt @@ -38,6 +38,9 @@ inflect tqdm hf_transfer # Speed up file transfers +# LLM Integration (Optional) +litellm>=1.0.0 # Unified LLM API for preprocessing + # Audio Post-processing pydub audiotsm diff --git a/requirements-rocm.txt b/requirements-rocm.txt index 6ecabc7..127aceb 100644 --- a/requirements-rocm.txt +++ b/requirements-rocm.txt @@ -31,3 +31,6 @@ pydub praat-parselmouth # For unvoiced segment removal librosa # for changes to sampling hf-transfer + +# LLM Integration (Optional) +litellm>=1.0.0 # Unified LLM API for preprocessing diff --git a/requirements.txt b/requirements.txt index 20af33f..3c03e7f 100644 --- a/requirements.txt +++ b/requirements.txt @@ -42,7 +42,10 @@ python-multipart # Form data parsing for FastAPI requests # HTTP client library Jinja2 # Template engine aiofiles # Async file operations -hf_transfer # Speed up file transfers with the Hugging Face Hub. +hf_transfer # Speed up file transfers with the Hugging Face Hub + +# --- LLM Integration (Optional) --- +litellm>=1.0.0 # Unified LLM API for preprocessing (supports 100+ providers) # --- Configuration & Data Processing --- diff --git a/server.py b/server.py index 761b512..b10fc69 100644 --- a/server.py +++ b/server.py @@ -57,6 +57,8 @@ get_audio_sample_rate, get_full_config_for_template, get_audio_output_format, + get_llm_preprocessing_enabled, + get_llm_preprocessing_fallback_on_error, ) import engine # TTS Engine interface @@ -66,6 +68,7 @@ UpdateStatusResponse, ) import utils # Utility functions +from llm_preprocessor import preprocess_speech_input, TTSParamsExtraction from pydantic import BaseModel, Field @@ -1220,7 +1223,82 @@ async def custom_tts_endpoint( @app.post("/v1/audio/speech", tags=["OpenAI Compatible"]) -async def openai_speech_endpoint(request: OpenAISpeechRequest): +async def openai_speech_endpoint( + request: OpenAISpeechRequest, background_tasks: BackgroundTasks +): + """ + OpenAI-compatible speech endpoint with optional LLM preprocessing. + + When LLM preprocessing is enabled, natural language instructions in the input + are parsed to extract TTS parameters (e.g., "speak excitedly: Hello!" extracts + exaggeration=1.5 and text="Hello!"). The request is then delegated to the + custom_tts_endpoint for full parameter support. + + When disabled, behaves as a standard OpenAI-compatible endpoint. + """ + # --- LLM Preprocessing Path --- + if get_llm_preprocessing_enabled(): + try: + logger.info("LLM preprocessing enabled, extracting TTS parameters...") + extracted = await preprocess_speech_input(request.input_) + + # Determine voice mode by checking both paths (same logic as standard behavior) + predefined_voices_path = get_predefined_voices_path(ensure_absolute=True) + reference_audio_path = get_reference_audio_path(ensure_absolute=True) + voice_path_predefined = predefined_voices_path / request.voice + voice_path_reference = reference_audio_path / request.voice + + if voice_path_predefined.is_file(): + voice_mode = "predefined" + predefined_voice_id = request.voice + reference_audio_filename = None + elif voice_path_reference.is_file(): + voice_mode = "clone" + predefined_voice_id = None + reference_audio_filename = request.voice + else: + raise HTTPException( + status_code=404, detail=f"Voice file '{request.voice}' not found." + ) + + # Build CustomTTSRequest from extracted params + # Override with OpenAI request params (voice, speed, seed, output_format) + custom_request = CustomTTSRequest( + text=extracted.text, + voice_mode=voice_mode, + predefined_voice_id=predefined_voice_id, + reference_audio_filename=reference_audio_filename, + output_format=request.response_format, + speed_factor=request.speed, + seed=request.seed, + # Use extracted params if present, otherwise None (will use defaults) + temperature=extracted.temperature, + exaggeration=extracted.exaggeration, + cfg_weight=extracted.cfg_weight, + split_text=extracted.split_text, + chunk_size=extracted.chunk_size, + language=extracted.language, + ) + + logger.info( + f"Delegating to custom_tts_endpoint with extracted params: " + f"text='{custom_request.text[:50]}...', temp={custom_request.temperature}, " + f"exag={custom_request.exaggeration}, cfg={custom_request.cfg_weight}" + ) + + # Delegate to custom_tts_endpoint for full processing + return await custom_tts_endpoint(custom_request, background_tasks) + + except Exception as e: + logger.error(f"LLM preprocessing failed: {e}", exc_info=True) + if not get_llm_preprocessing_fallback_on_error(): + raise HTTPException( + status_code=500, detail=f"LLM preprocessing failed: {e}" + ) + logger.warning("Falling back to standard OpenAI endpoint behavior") + # Fall through to standard behavior below + + # --- Standard OpenAI Endpoint Behavior --- # Determine the audio prompt path based on the voice parameter predefined_voices_path = get_predefined_voices_path(ensure_absolute=True) reference_audio_path = get_reference_audio_path(ensure_absolute=True)