VibeVoice Server

A FastAPI-based TTS server using Microsoft's VibeVoice model for high-quality text-to-speech synthesis with voice cloning capabilities.

Features

High-quality text-to-speech synthesis
Voice cloning from reference audio
Support for multiple voice presets
Voice conversion (change voice of existing audio)
Docker support with RunPod compatibility
RTX 5090 (Blackwell) GPU support

API Endpoints

Method	Endpoint	Description
`GET`	`/base_tts/`	TTS with default voice
`GET`	`/synthesize_speech/`	TTS with custom voice
`POST`	`/upload_audio/`	Upload reference audio for voice cloning
`POST`	`/change_voice/`	Voice conversion on existing audio

Endpoint Details

GET /base_tts/

GET /base_tts/?text=Hello%20world&speed=1.0

text (required): Text to synthesize
speed (optional, default=1.0): Speech speed (0.8-1.2)

GET /synthesize_speech/

GET /synthesize_speech/?text=Hello%20world&voice=my_voice&speed=1.0

text (required): Text to synthesize
voice (required): Voice label (must match uploaded audio)
speed (optional, default=1.0): Speech speed (0.8-1.2)

POST /upload_audio/

Upload a reference audio file for voice cloning.

audio_file_label (form): Label for the voice
file (file): Audio file (wav, mp3, flac, ogg, max 5MB)

POST /change_voice/

Convert the voice of an existing audio file.

reference_speaker (form): Voice label to use
file (file): Audio file to convert

Installation

Option 1: Docker (Recommended)

Run the pre-built image:

docker run -p 7860:7860 \
  -v /path/to/models:/workspace/models/vibevoice \
  --gpus all \
  valyriantech/vibevoice_server:latest

Models are automatically downloaded on first start. To persist models across container restarts, mount a volume to /workspace/models/vibevoice.

Building from source (optional)

docker build -t vibevoice_server .

Option 2: Local Installation

Install dependencies:

pip install -r requirements.txt

Clone and install VibeVoice:

git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice
pip install -e .

Download models:

./install_models.sh

Set environment variables:

export VIBEVOICE_MODEL_PATH=/path/to/VibeVoice-Large
export VIBEVOICE_TOKENIZER_PATH=/path/to/tokenizer

Run the server:

./start.sh

Environment Variables

Variable	Default	Description
`VIBEVOICE_MODEL_PATH`	`/workspace/models/vibevoice/VibeVoice-Large`	Path to VibeVoice model
`VIBEVOICE_TOKENIZER_PATH`	`/workspace/models/vibevoice/tokenizer`	Path to Qwen tokenizer

Model Requirements

VibeVoice-Large: ~18.7GB, requires ~20GB VRAM
Tokenizer: Qwen2.5-1.5B tokenizer

For lower VRAM, consider using quantized models:

VibeVoice-Large-Q8: ~12GB VRAM
VibeVoice-Large-Q4: ~8GB VRAM

Voice Cloning Tips

Use clear audio with minimal background noise
Recommended: 10-30 seconds of speech
Audio is automatically resampled to 24kHz

Notes

Speed parameter: Clamped to 0.8-1.2 range
Voice cloning: Uses audio prefill for natural voice reproduction
Voice conversion: Uses Whisper for transcription (installed by default)

License

MIT License (same as VibeVoice)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
RESEARCH_NOTES.md		RESEARCH_NOTES.md
build_docker.py		build_docker.py
demo_speaker0.mp3		demo_speaker0.mp3
install_models.sh		install_models.sh
requirements.txt		requirements.txt
server.py		server.py
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VibeVoice Server

Features

API Endpoints

Endpoint Details

GET /base_tts/

GET /synthesize_speech/

POST /upload_audio/

POST /change_voice/

Installation

Option 1: Docker (Recommended)

Building from source (optional)

Option 2: Local Installation

Environment Variables

Model Requirements

Voice Cloning Tips

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VibeVoice Server

Features

API Endpoints

Endpoint Details

GET /base_tts/

GET /synthesize_speech/

POST /upload_audio/

POST /change_voice/

Installation

Option 1: Docker (Recommended)

Building from source (optional)

Option 2: Local Installation

Environment Variables

Model Requirements

Voice Cloning Tips

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages