Skip to content

camucamulemon7/MioTTS-OpenAI-Adapter

Repository files navigation

MioTTS-OpenAI-Adapter

Run Aratako/MioTTS-Inference and expose an OpenAI-compatible Text-to-Speech API for tools such as OpenWebUI.

This project starts three processes inside one container:

  • vLLM serving a MioTTS-compatible model
  • MioTTS-Inference REST API
  • an OpenAI-compatible adapter exposing /v1/audio/speech

What This Project Does

MioTTS-Inference does not expose an OpenAI-compatible TTS API out of the box. This repository adds a thin compatibility layer so OpenAI-style clients can call:

  • POST /v1/audio/speech
  • GET /v1/models
  • GET /health

The adapter also supports:

  • OpenAI-style voice
  • response_format
  • output_format
  • speed

Requirements

  • Docker
  • NVIDIA GPU support for Docker
  • enough VRAM for the model you plan to run
  • a Hugging Face token if the model requires gated access

Quick Start

1. Build the image

Place the required NLTK data archives under local/nltk_data/packages before building:

  • local/nltk_data/packages/tokenizers/punkt.zip
  • local/nltk_data/packages/tokenizers/punkt_tab.zip
  • local/nltk_data/packages/taggers/averaged_perceptron_tagger.zip
  • local/nltk_data/packages/corpora/cmudict.zip
cd MioTTS-openai
docker build -t miotts-openai-adapter:latest .

If you build behind an HTTP/HTTPS proxy, pass the proxy environment variables to docker build, for example:

docker build \
  --build-arg HTTP_PROXY="$HTTP_PROXY" \
  --build-arg HTTPS_PROXY="$HTTPS_PROXY" \
  --build-arg NO_PROXY="$NO_PROXY" \
  -t miotts-openai-adapter:latest .

2. Run the container

mkdir -p ./huggingface ./presets

docker run -d \
  --name vllm-miotts \
  --gpus all \
  --restart unless-stopped \
  -p 8005:8080 \
  -e HF_TOKEN="${HF_TOKEN:-}" \
  -e MIOTTS_MODEL="your-model-id" \
  -v "$PWD/huggingface:/home/app/.cache/huggingface" \
  -v "$PWD/presets:/opt/MioTTS-Inference/presets" \
  miotts-openai-adapter:latest

You can add optional environment variables such as:

-e VLLM_GPU_MEMORY_UTILIZATION="0.50" \
-e VLLM_MAX_MODEL_LEN="1024" \
-e VLLM_TENSOR_PARALLEL_SIZE="1" \
-e OPENAI_TTS_MODEL_NAME="miotts" \
-e OPENAI_TTS_DEFAULT_VOICE="default" \
-e OPENAI_TTS_DEFAULT_RESPONSE_FORMAT="mp3"

By default, the image does not install flash-attn, because building it can be heavy and unstable on some environments. If you want to try it to avoid the slower SDPA fallback in miocodec, enable it explicitly with:

docker build \
  --build-arg INSTALL_FLASH_ATTN=1 \
  --build-arg FLASH_ATTN_MAX_JOBS=4 \
  -t miotts-openai-adapter:latest .

If you prefer a helper script for your local environment:

cd MioTTS-openai
chmod +x run.sh
./run.sh

OpenWebUI Configuration

Use the OpenAI TTS engine and point it to this adapter:

AUDIO_TTS_ENGINE=openai
AUDIO_TTS_OPENAI_API_BASE_URL=http://host.docker.internal:8005/v1
AUDIO_TTS_OPENAI_API_KEY=dummy
AUDIO_TTS_MODEL=miotts
AUDIO_TTS_VOICE=default

You can also pass additional OpenAI-style parameters from OpenWebUI, for example:

{"speed": 1.2}

Custom Presets

Preset availability and naming depend on the upstream MioTTS-Inference setup. For the latest built-in presets and reference behavior, check the upstream project documentation.

The bundled presets may not be suitable for commercial use. For production or commercial deployments, generate your own presets from audio you are legally allowed to use.

MioTTS-Inference includes a preset generator:

python3 /opt/MioTTS-Inference/scripts/generate_preset.py \
  --audio /path/to/reference.wav \
  --preset-id myvoice \
  --output-dir /opt/MioTTS-Inference/presets

This project automatically seeds the default preset directory on startup if the mounted presets/ directory is empty.

API Example

curl -X POST http://localhost:8005/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "miotts",
    "input": "Hello from MioTTS.",
    "voice": "default",
    "response_format": "mp3",
    "speed": 1.0
  }' \
  --output sample.mp3

Notes

  • speed is applied as a post-processing tempo adjustment in the adapter.
  • The adapter keeps generated audio in memory and returns it directly in the HTTP response.
  • Model and cache files are stored under the mounted Hugging Face cache directory.
  • Required NLTK data is expected under local/nltk_data, so the image build does not need to download it from blocked external URLs.
  • If you see a warning about FlashAttention not being installed, the stack can still work, but performance may be lower.
  • Model selection, presets, and quality characteristics ultimately depend on the upstream MioTTS-Inference stack and the model you choose.

Repository Layout

  • Dockerfile: container build definition
  • entrypoint.sh: startup orchestration for all three services
  • openai_tts_adapter.py: OpenAI-compatible TTS adapter
  • run.sh: convenience launcher

License and Upstream

Please review the licenses and usage terms of:

  • MioTTS-Inference
  • MioTTS models
  • MioCodec
  • any presets or reference audio you use

This repository is an adapter layer and does not change the original licensing terms of upstream models or voice assets.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors