devnen · jcustenborder · Jan 7, 2026 · Jan 7, 2026 · Jan 8, 2026
diff --git a/README.md b/README.md
@@ -64,6 +64,14 @@ This server is based on the architecture and UI of our [Dia-TTS-Server](https://
   - `--upgrade` to update code + dependencies.
   - `--reinstall` for a clean reinstall when environments get messy.
 
+### 🧠 LLM-based Preprocessing (OpenAI Endpoint)
+
+- Added **LLM-based preprocessing** for the OpenAI-compatible `/v1/audio/speech` endpoint.
+- Send natural language instructions like `"speak excitedly: Hello!"` and the LLM extracts TTS parameters automatically.
+- Uses **[litellm](https://docs.litellm.ai/)** for unified access to 100+ LLM providers (Ollama, OpenAI, Anthropic, etc.).
+- Extracts parameters: `temperature`, `exaggeration`, `cfg_weight`, `split_text`, `chunk_size`, `language`.
+- Configurable prompt, timeout, and fallback behavior via `config.yaml`.
+
 ---
 
 ## 🗣️ Overview: Enhanced Chatterbox TTS Generation
@@ -154,6 +162,7 @@ This server application enhances the underlying `chatterbox-tts` engine with the
 *   **Advanced Generation Features:**
     *   🔁 **Hot-Swappable Engines:** Switch between Original Chatterbox and Chatterbox‑Turbo directly in the Web UI.
     *   🎭 **Paralinguistic Tags (Turbo):** Native support for `[laugh]`, `[cough]`, `[chuckle]` and other expressive tags.
+    *   🧠 **LLM Preprocessing (OpenAI Endpoint):** Send natural language instructions through the OpenAI-compatible endpoint. An LLM extracts TTS parameters from instructions like "speak excitedly" or "say slowly and calmly."
     *   📚 **Large Text Handling:** Intelligently splits long plain text inputs into chunks based on sentences, generates audio for each, and concatenates the results seamlessly. Configurable via `split_text` and `chunk_size`.
     *   📖 **Audiobook Creation:** Perfect for generating complete audiobooks from full-length texts with consistent voice quality and automatic chapter handling.
     *   🎤 **Predefined Voices:** Select from curated synthetic voices in the `./voices` directory.
@@ -573,6 +582,7 @@ The server relies exclusively on `config.yaml` for runtime configuration.
 *   `ui_state`: Stores the last used text, voice mode, file selections, etc., for UI persistence.
 *   `ui`: `title`, `show_language_select`, `max_predefined_voices_in_dropdown`.
 *   `debug`: `save_intermediate_audio`.
+*   `llm_preprocessing`: LLM preprocessing settings (`enabled`, `model`, `api_base`, `api_key`, `timeout_seconds`, `fallback_on_error`, `prompt`).
 
 ⭐ **Remember:** Changes made to `server`, `model`, `tts_engine`, or `paths` sections in `config.yaml` (or via the UI's Server Configuration section) **require a server restart** to take effect. Changes to `generation_defaults` or `ui_state` are applied dynamically or on the next page load.
 
@@ -851,6 +861,113 @@ One moment… [cough] sorry about that. Let's get this fixed.
 
 Turbo supports native tags like `[laugh]`, `[cough]`, and `[chuckle]` for more realistic, expressive speech. These tags are ignored when using Original Chatterbox.
 
+### 🧠 LLM Preprocessing (OpenAI Endpoint)
+
+The LLM preprocessing feature allows you to send natural language instructions through the OpenAI-compatible `/v1/audio/speech` endpoint. Instead of manually specifying TTS parameters, simply describe how you want the text spoken and an LLM will extract the appropriate settings.
+
+#### How It Works
+
+When enabled, the server sends your input text to a configured LLM, which extracts:
+- **text**: The cleaned text to speak (instructions removed)
+- **temperature**: Controls randomness (0.0-1.5)
+- **exaggeration**: Controls expressiveness (0.25-2.0)
+- **cfg_weight**: Classifier-Free Guidance weight (0.2-1.0)
+- **split_text**: Whether to chunk long text
+- **chunk_size**: Target characters per chunk (50-500)
+- **language**: Language code (e.g., "en", "es")
+
+The extracted parameters are then used to generate speech with the appropriate settings.
+
+#### Example Instructions
+
+| Input | Extracted Parameters |
+|-------|---------------------|
+| `"speak excitedly: Hello world!"` | text="Hello world!", exaggeration=1.5 |
+| `"say calmly and slowly: Take a breath"` | text="Take a breath", exaggeration=0.3 |
+| `"whisper this in Spanish: Buenos días"` | text="Buenos días", exaggeration=0.3, language="es" |
+| `"read this enthusiastically with high energy: Welcome everyone!"` | text="Welcome everyone!", exaggeration=1.8, temperature=0.8 |
+
+#### Configuration
+
+Enable and configure LLM preprocessing in `config.yaml`:
+
+```yaml
+llm_preprocessing:
+  enabled: true                          # Master toggle
+  model: "ollama/qwen2.5:1.5b"          # litellm model string
+  api_base: null                         # Optional: URL to litellm proxy
+  api_key: null                          # Optional: API key for provider
+  timeout_seconds: 30                    # Timeout for LLM requests
+  fallback_on_error: true               # Fall back to original text on errors
+  prompt: |                              # System prompt for extraction
+    Extract TTS parameters from the user's input...
+```
+
+#### LLM Provider Setup
+
+The feature uses [litellm](https://docs.litellm.ai/) which supports 100+ LLM providers. Common configurations:
+
+**Ollama (Local, Recommended for Privacy):**
+```yaml
+llm_preprocessing:
+  enabled: true
+  model: "ollama/qwen2.5:1.5b"    # Or llama3.2:1b, phi3:mini, etc.
+  api_base: null                   # Uses default http://localhost:11434
+```
+
+**OpenAI:**
+```yaml
+llm_preprocessing:
+  enabled: true
+  model: "gpt-4o-mini"
+  api_key: "sk-..."               # Or set OPENAI_API_KEY env var
+```
+
+**Anthropic:**
+```yaml
+llm_preprocessing:
+  enabled: true
+  model: "claude-3-haiku-20240307"
+  api_key: "sk-ant-..."           # Or set ANTHROPIC_API_KEY env var
+```
+
+**LiteLLM Proxy:**
+```yaml
+llm_preprocessing:
+  enabled: true
+  model: "openai/your-model-alias"   # Use openai/ prefix + model alias from proxy
+  api_base: "http://localhost:4000"
+  api_key: "not-needed"              # Required placeholder (even if proxy has no auth)
+```
+
+> **Note:** When using a LiteLLM proxy:
+> - Use the `openai/` prefix (e.g., `openai/gpt-4o-mini`) because the proxy exposes an OpenAI-compatible API
+> - The model name after the prefix should match your proxy's model alias
+> - You must provide an `api_key` value (use `"not-needed"` for unauthenticated proxies)
+
+#### API Usage Example
+
+```bash
+# With LLM preprocessing enabled, natural language instructions work:
+curl -X POST "http://localhost:8004/v1/audio/speech" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "input": "speak with excitement and energy: Welcome to our show!",
+    "voice": "female_voice.wav"
+  }' \
+  --output excited_welcome.wav
+
+# The LLM extracts: text="Welcome to our show!", exaggeration=1.5+
+# Then generates speech with those parameters
+```
+
+#### Error Handling
+
+- **fallback_on_error: true** (default): If the LLM fails, the original text is used as-is with default TTS parameters.
+- **fallback_on_error: false**: LLM errors return HTTP 500 with error details.
+
+Errors are logged with full details for debugging. Check server logs if preprocessing isn't working as expected.
+
 ### API Endpoints (`/docs` for interactive details)
 
 The primary endpoint for TTS generation is `/tts`, which offers detailed control over the synthesis process.

diff --git a/config.py b/config.py
@@ -104,6 +104,28 @@
     "debug": {  # Settings for debugging purposes
         "save_intermediate_audio": False  # If true, save intermediate audio files for debugging
     },
+    "llm_preprocessing": {  # LLM-based preprocessing for OpenAI speech endpoint
+        "enabled": False,  # Master toggle for LLM preprocessing
+        "model": "ollama/qwen2.5:1.5b",  # litellm model string
+        "api_base": None,  # Optional: URL to litellm proxy (e.g., "http://localhost:4000")
+        "api_key": None,  # Optional: API key for proxy or provider
+        "timeout_seconds": 30,  # Timeout for LLM requests
+        "fallback_on_error": True,  # If True, fall back to original text on LLM errors
+        "prompt": """Extract TTS parameters from the user's input. Return JSON with:
+- text: The cleaned text to speak (remove any instructions)
+- temperature: 0.0-1.5 (higher = more random)
+- exaggeration: 0.25-2.0 (higher = more expressive)
+- cfg_weight: 0.2-1.0 (guidance weight)
+- split_text: true/false for chunking long text
+- chunk_size: 50-500 chars per chunk
+- language: language code like "en", "es"
+
+Only include fields explicitly requested. Use null for unspecified params.
+
+Examples:
+"speak excitedly: Hello!" → {"text": "Hello!", "exaggeration": 1.5}
+"say calmly and slowly: Take a breath" → {"text": "Take a breath", "exaggeration": 0.3}""",
+    },
 }
 
 
@@ -897,4 +919,55 @@ def get_full_config_for_template() -> Dict[str, Any]:
     return config_manager._prepare_config_for_saving(config_snapshot)
 
 
+# LLM Preprocessing Settings Accessors
+def get_llm_preprocessing_enabled() -> bool:
+    """Returns whether LLM preprocessing is enabled for the OpenAI speech endpoint."""
+    return config_manager.get_bool(
+        "llm_preprocessing.enabled",
+        _get_default_from_structure("llm_preprocessing.enabled"),
+    )
+
+
+def get_llm_preprocessing_model() -> str:
+    """Returns the litellm model string for LLM preprocessing."""
+    return config_manager.get_string(
+        "llm_preprocessing.model",
+        _get_default_from_structure("llm_preprocessing.model"),
+    )
+
+
+def get_llm_preprocessing_api_base() -> Optional[str]:
+    """Returns the optional API base URL for litellm proxy."""
+    return config_manager.get("llm_preprocessing.api_base", None)
+
+
+def get_llm_preprocessing_api_key() -> Optional[str]:
+    """Returns the optional API key for LLM provider or proxy."""
+    return config_manager.get("llm_preprocessing.api_key", None)
+
+
+def get_llm_preprocessing_timeout() -> int:
+    """Returns the timeout in seconds for LLM requests."""
+    return config_manager.get_int(
+        "llm_preprocessing.timeout_seconds",
+        _get_default_from_structure("llm_preprocessing.timeout_seconds"),
+    )
+
+
+def get_llm_preprocessing_fallback_on_error() -> bool:
+    """Returns whether to fall back to original text on LLM errors."""
+    return config_manager.get_bool(
+        "llm_preprocessing.fallback_on_error",
+        _get_default_from_structure("llm_preprocessing.fallback_on_error"),
+    )
+
+
+def get_llm_preprocessing_prompt() -> str:
+    """Returns the system prompt for LLM preprocessing."""
+    return config_manager.get_string(
+        "llm_preprocessing.prompt",
+        _get_default_from_structure("llm_preprocessing.prompt"),
+    )
+
+
 # --- End File: config.py ---
diff --git a/llm_preprocessor.py b/llm_preprocessor.py
@@ -0,0 +1,125 @@
+# File: llm_preprocessor.py
+# LLM-based preprocessing for extracting TTS parameters from natural language input.
+# Uses litellm for unified access to multiple LLM providers.
+
+import logging
+from typing import Optional
+
+import litellm
+from litellm import acompletion
+from pydantic import BaseModel, Field
+
+from config import (
+    config_manager,
+    get_llm_preprocessing_enabled,
+    get_llm_preprocessing_model,
+    get_llm_preprocessing_api_base,
+    get_llm_preprocessing_api_key,
+    get_llm_preprocessing_timeout,
+    get_llm_preprocessing_fallback_on_error,
+    get_llm_preprocessing_prompt,
+)
+
+logger = logging.getLogger(__name__)
+
+# Enable schema validation for local models that may not natively support JSON schema
+litellm.enable_json_schema_validation = True
+
+
+class TTSParamsExtraction(BaseModel):
+    """
+    Strongly-typed extraction result for TTS parameters.
+    Fields match CustomTTSRequest parameters that can be extracted from natural language.
+    """
+
+    text: str = Field(
+        description="The cleaned text to synthesize, with instructions removed"
+    )
+    temperature: Optional[float] = Field(
+        None, ge=0.0, le=1.5, description="Controls randomness (0.0-1.5)"
+    )
+    exaggeration: Optional[float] = Field(
+        None, ge=0.25, le=2.0, description="Controls expressiveness (0.25-2.0)"
+    )
+    cfg_weight: Optional[float] = Field(
+        None, ge=0.2, le=1.0, description="Classifier-Free Guidance weight (0.2-1.0)"
+    )
+    split_text: Optional[bool] = Field(
+        None, description="Whether to split long text into chunks"
+    )
+    chunk_size: Optional[int] = Field(
+        None, ge=50, le=500, description="Target chunk size in characters (50-500)"
+    )
+    language: Optional[str] = Field(
+        None, description="Language code (e.g., 'en', 'es', 'fr')"
+    )
+
+
+async def preprocess_speech_input(input_text: str) -> TTSParamsExtraction:
+    """
+    Extract TTS parameters from natural language input using a configured LLM.
+
+    Args:
+        input_text: The raw input text that may contain natural language instructions
+                   for TTS generation (e.g., "speak excitedly: Hello world!")
+
+    Returns:
+        TTSParamsExtraction with cleaned text and any extracted parameters.
+
+    Raises:
+        Exception: If LLM call fails and fallback_on_error is False.
+    """
+    if not get_llm_preprocessing_enabled():
+        logger.debug("LLM preprocessing is disabled, returning original text")
+        return TTSParamsExtraction(text=input_text)
+
+    model = get_llm_preprocessing_model()
+    prompt = get_llm_preprocessing_prompt()
+    api_base = get_llm_preprocessing_api_base()
+    api_key = get_llm_preprocessing_api_key()
+    timeout = get_llm_preprocessing_timeout()
+    fallback_on_error = get_llm_preprocessing_fallback_on_error()
+
+    logger.info(f"Preprocessing input with LLM model: {model}")
+    logger.debug(f"Input text: {input_text[:100]}...")
+
+    try:
+        # Build kwargs conditionally to avoid passing None values
+        kwargs = {
+            "model": model,
+            "messages": [
+                {"role": "system", "content": prompt},
+                {"role": "user", "content": input_text},
+            ],
+            "response_format": TTSParamsExtraction,
+            "timeout": timeout,
+        }
+        if api_base:
+            kwargs["api_base"] = api_base
+        if api_key:
+            kwargs["api_key"] = api_key
+
+        response = await acompletion(**kwargs)
+
+        # Parse the response content into our Pydantic model
+        content = response.choices[0].message.content
+        result = TTSParamsExtraction.model_validate_json(content)
+
+        logger.info(f"LLM extracted params: text='{result.text[:50]}...', "
+                   f"temperature={result.temperature}, exaggeration={result.exaggeration}, "
+                   f"cfg_weight={result.cfg_weight}, split_text={result.split_text}, "
+                   f"chunk_size={result.chunk_size}, language={result.language}")
+
+        return result
+
+    except Exception as e:
+        logger.error(f"LLM preprocessing failed: {e}", exc_info=True)
+
+        if fallback_on_error:
+            logger.warning("Falling back to original text due to LLM error")
+            return TTSParamsExtraction(text=input_text)
+        else:
+            raise
+
+
+# --- End File: llm_preprocessor.py ---
diff --git a/requirements-nvidia-cu128.txt b/requirements-nvidia-cu128.txt
@@ -56,6 +56,9 @@ inflect
 tqdm
 hf_transfer
 
+# --- LLM Integration (Optional) ---
+litellm>=1.0.0                  # Unified LLM API for preprocessing
+
 # --- Audio Post-processing ---
 pydub
 audiotsm

diff --git a/requirements-nvidia.txt b/requirements-nvidia.txt
@@ -38,6 +38,9 @@ inflect
 tqdm
 hf_transfer                     # Speed up file transfers
 
+# LLM Integration (Optional)
+litellm>=1.0.0                  # Unified LLM API for preprocessing
+
 # Audio Post-processing
 pydub
 audiotsm

diff --git a/requirements-rocm.txt b/requirements-rocm.txt
@@ -31,3 +31,6 @@ pydub
 praat-parselmouth # For unvoiced segment removal
 librosa # for changes to sampling
 hf-transfer
+
+# LLM Integration (Optional)
+litellm>=1.0.0                  # Unified LLM API for preprocessing
diff --git a/requirements.txt b/requirements.txt
@@ -42,7 +42,10 @@ python-multipart               # Form data parsing for FastAPI
 requests                       # HTTP client library
 Jinja2                        # Template engine
 aiofiles                      # Async file operations
-hf_transfer                     # Speed up file transfers with the Hugging Face Hub.
+hf_transfer                   # Speed up file transfers with the Hugging Face Hub
+
+# --- LLM Integration (Optional) ---
+litellm>=1.0.0                # Unified LLM API for preprocessing (supports 100+ providers)
 
 
 # --- Configuration & Data Processing ---