Skip to content

ValyrianTech/VibeVoice_server

Repository files navigation

VibeVoice Server

A FastAPI-based TTS server using Microsoft's VibeVoice model for high-quality text-to-speech synthesis with voice cloning capabilities.

Features

  • High-quality text-to-speech synthesis
  • Voice cloning from reference audio
  • Support for multiple voice presets
  • Voice conversion (change voice of existing audio)
  • Docker support with RunPod compatibility
  • RTX 5090 (Blackwell) GPU support

API Endpoints

Method Endpoint Description
GET /base_tts/ TTS with default voice
GET /synthesize_speech/ TTS with custom voice
POST /upload_audio/ Upload reference audio for voice cloning
POST /change_voice/ Voice conversion on existing audio

Endpoint Details

GET /base_tts/

GET /base_tts/?text=Hello%20world&speed=1.0
  • text (required): Text to synthesize
  • speed (optional, default=1.0): Speech speed (0.8-1.2)

GET /synthesize_speech/

GET /synthesize_speech/?text=Hello%20world&voice=my_voice&speed=1.0
  • text (required): Text to synthesize
  • voice (required): Voice label (must match uploaded audio)
  • speed (optional, default=1.0): Speech speed (0.8-1.2)

POST /upload_audio/

Upload a reference audio file for voice cloning.

  • audio_file_label (form): Label for the voice
  • file (file): Audio file (wav, mp3, flac, ogg, max 5MB)

POST /change_voice/

Convert the voice of an existing audio file.

  • reference_speaker (form): Voice label to use
  • file (file): Audio file to convert

Installation

Option 1: Docker (Recommended)

Run the pre-built image:

docker run -p 7860:7860 \
  -v /path/to/models:/workspace/models/vibevoice \
  --gpus all \
  valyriantech/vibevoice_server:latest

Models are automatically downloaded on first start. To persist models across container restarts, mount a volume to /workspace/models/vibevoice.

Building from source (optional)

docker build -t vibevoice_server .

Option 2: Local Installation

  1. Install dependencies:
pip install -r requirements.txt
  1. Clone and install VibeVoice:
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice
pip install -e .
  1. Download models:
./install_models.sh
  1. Set environment variables:
export VIBEVOICE_MODEL_PATH=/path/to/VibeVoice-Large
export VIBEVOICE_TOKENIZER_PATH=/path/to/tokenizer
  1. Run the server:
./start.sh

Environment Variables

Variable Default Description
VIBEVOICE_MODEL_PATH /workspace/models/vibevoice/VibeVoice-Large Path to VibeVoice model
VIBEVOICE_TOKENIZER_PATH /workspace/models/vibevoice/tokenizer Path to Qwen tokenizer

Model Requirements

  • VibeVoice-Large: ~18.7GB, requires ~20GB VRAM
  • Tokenizer: Qwen2.5-1.5B tokenizer

For lower VRAM, consider using quantized models:

  • VibeVoice-Large-Q8: ~12GB VRAM
  • VibeVoice-Large-Q4: ~8GB VRAM

Voice Cloning Tips

  • Use clear audio with minimal background noise
  • Recommended: 10-30 seconds of speech
  • Audio is automatically resampled to 24kHz

Notes

  • Speed parameter: Clamped to 0.8-1.2 range
  • Voice cloning: Uses audio prefill for natural voice reproduction
  • Voice conversion: Uses Whisper for transcription (installed by default)

License

MIT License (same as VibeVoice)

About

API server for VibeVoice

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors