VoiceGenHub

Simple, user-friendly Text-to-Speech (TTS) library with CLI and Python API. Supports multiple free and commercial TTS providers.

Installation

pip install voicegenhub
# or
poetry add voicegenhub

Optional Dependencies

Microsoft Edge TTS (free, cloud-based)
Kokoro TTS (Apache 2.0 licensed, self-hosted lightweight TTS)
Bark TTS (MIT licensed, self-hosted high-naturalness TTS with prosody control)
Chatterbox TTS (MIT licensed, multilingual with emotion control) - Works on CPU or GPU
Qwen 3 TTS (Apache 2.0 licensed, multilingual with voice design and cloning) - State-of-the-art quality
ElevenLabs TTS (commercial, high-quality voices)

Voice Cloning Support

For voice cloning features with Chatterbox TTS:

pip install voicegenhub[voice-cloning]
# or
poetry install -E voice-cloning

Voice cloning requirements:

FFmpeg (manual installation required)
PyTorch (standard version)

On Windows: Download the "full-shared" FFmpeg build from ffmpeg.org and add the bin directory to your system PATH.

Note: VoiceGenHub includes a compatibility layer to ensure stable execution on CPU-only systems and prevents common import-time crashes related to experimental dependencies like TorchCodec. Standard TTS and voice cloning mechanisms will automatically fall back to supported audio loaders if needed.

Usage

Chatterbox TTS

poetry run voicegenhub synthesize "Hello, world!" --provider chatterbox --voice chatterbox-default --output hello.wav

Chatterbox features:

Model selection via voice: Choose between standard, turbo, or multilingual models using the --voice flag
Emotion/intensity control with exaggeration parameter (0.0-1.0)
Zero-shot voice cloning from audio samples
MIT License - fully commercial compatible
State-of-the-art quality (competitive with ElevenLabs)
Built-in Perth watermarking for responsible AI

Chatterbox voices:

chatterbox-default: Standard English model with emotion control
chatterbox-turbo: Turbo English model (faster generation, English only)
chatterbox-<lang>: Multilingual model for specific languages (e.g., chatterbox-es for Spanish)

Chatterbox parameters:

--exaggeration: Emotion intensity (0.0-1.0, default 0.5). Higher values = more dramatic/emotional.
--cfg-weight: Classifier-free guidance weight (0.0-1.0, default 0.5). Controls the influence of the text prompt.
--audio-prompt: Path to reference audio for voice cloning (optional).
temperature, max_new_tokens, repetition_penalty, min_p, top_p: Advanced generation parameters (available in Python API).

Multilingual Support: Chatterbox supports 23 languages. Use the appropriate voice for the target language:

poetry run voicegenhub synthesize "Hola, esto es una prueba de voz en español." --provider chatterbox --voice chatterbox-es --output spanish.wav

Chatterbox supported languages: ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh

Chatterbox Installation Requirements:

TorchCodec (optional): Required for voice cloning features. Install with pip install torchcodec or poetry install -E voice-cloning.
FFmpeg: Required when TorchCodec is installed for voice cloning. On Windows, install the "full-shared" build from ffmpeg.org and ensure FFmpeg's bin directory is in your system PATH.
PyTorch Compatibility: TorchCodec 0.9.1 requires PyTorch ≤ 2.4.x. If you have a newer PyTorch version, voice cloning will be automatically disabled with a fallback to standard TTS.
Without TorchCodec/FFmpeg, basic TTS will work but voice cloning (--audio-prompt) will gracefully fall back to standard TTS without cloning.

Qwen 3 TTS

poetry run voicegenhub synthesize "Hello, world!" --provider qwen --voice Ryan --output hello.wav

Qwen 3 TTS features:

Three generation modes: CustomVoice (predefined speakers), VoiceDesign (natural language voice description), VoiceClone (reference audio-based)
10 languages: Chinese, English, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish
Native speakers: Automatic selection of native speakers per language for natural, accent-free speech
Voice control via natural language: Use instruct parameter to control emotion, tone, speaking rate, and style
Ultra-low latency: Streaming generation with <100ms first-token latency
Apache 2.0 License: Fully commercial compatible
State-of-the-art quality: Competitive with ElevenLabs, developed by Alibaba's Qwen team

Mode 1: CustomVoice (Predefined Speakers)

Use predefined premium speakers with optional emotion/style control:

# Basic usage with auto-selected native speaker
poetry run voicegenhub synthesize "Hello, this is a test." --provider qwen --language en --output output.wav

# Explicit speaker selection
poetry run voicegenhub synthesize "Hello, this is a test." --provider qwen --language en --voice Ryan --output output.wav

# With emotion instruction
poetry run voicegenhub synthesize "I'm so excited about this news!" --provider qwen --language en --voice Ryan --instruct "Speak with excitement and joy" --output happy.wav

Available speakers and their native languages:

Speaker	Description	Native Language	Best For
Ryan	Dynamic male voice with strong rhythmic drive	English	English content, presentations
Aiden	Sunny American male voice with clear midrange	English	English content, narration
Vivian	Bright, slightly edgy young female voice	Chinese	Mandarin content, audiobooks
Serena	Warm, gentle young female voice	Chinese	Mandarin content, customer service
Uncle_Fu	Seasoned male voice with low, mellow timbre	Chinese	Mandarin narration, mature content
Dylan	Youthful Beijing male voice, natural timbre	Chinese (Beijing)	Beijing dialect content
Eric	Lively Chengdu male voice, slightly husky	Chinese (Sichuan)	Sichuan dialect content
Ono_Anna	Playful Japanese female, light and nimble	Japanese	Japanese content, anime
Sohee	Warm Korean female with rich emotion	Korean	Korean content, storytelling

Auto-speaker selection: If no speaker is specified, Qwen 3 TTS automatically selects a native speaker based on the target language (e.g., Ryan for English, Serena for Chinese).

Emotion and style control: Use the --instruct parameter with natural language to control voice characteristics:

"Speak with excitement and joy"
"Very angry tone"
"Whisper gently"
"Speak slowly and calmly"
"Energetic and enthusiastic"

Mode 2: VoiceDesign (Natural Language Voice Description)

Design custom voices using natural language instructions (requires Qwen3-TTS-VoiceDesign model):

from voicegenhub.providers.factory import provider_factory
from voicegenhub.providers.base import TTSRequest

config = {
    "model_name_or_path": "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    "generation_mode": "voice_design",
}

await provider_factory.discover_provider("qwen")
provider = await provider_factory.create_provider("qwen", config=config)

request = TTSRequest(
    text="Welcome to our demonstration.",
    language="en",
    voice_id="default",
    extra_params={
        "instruct": "Male, 30 years old, confident and professional tone, deep voice with clear articulation"
    }
)
response = await provider.synthesize(request)

VoiceDesign instruction examples:

"Female, 25 years old, cheerful and energetic, slightly high-pitched with playful intonation"
"Male, 17 years old, gaining confidence, deeper breath support, vowels tighten when nervous"
"Elderly male, 70 years old, wise and gentle, slightly raspy with warm timbre"

Mode 3: VoiceClone (Reference Audio-Based)

Clone voices from 3-second audio samples (requires Qwen3-TTS-Base model):

from voicegenhub.providers.factory import provider_factory
from voicegenhub.providers.base import TTSRequest

config = {
    "model_name_or_path": "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    "generation_mode": "voice_clone",
}

await provider_factory.discover_provider("qwen")
provider = await provider_factory.create_provider("qwen", config=config)

request = TTSRequest(
    text="This is synthesized using the cloned voice.",
    language="en",
    voice_id="default",
    extra_params={
        "ref_audio": "path/to/reference.wav",  # Can be local path, URL, or numpy array
        "ref_text": "Transcript of the reference audio",  # Required for best quality
        "x_vector_only_mode": False  # Set True to skip ref_text (lower quality)
    }
)
response = await provider.synthesize(request)

Voice cloning tips:

Use clear, noise-free reference audio (3-10 seconds)
Provide accurate transcript (ref_text) for best cloning quality
Supports multilingual cloning (clone any language, synthesize in any language)
Combine with VoiceDesign to create reusable custom voices

Word Emphasis and Pause Control

Note: Qwen 3 TTS does not support explicit word-level emphasis markup (like SSML tags) or pause control. Instead, the model intelligently interprets text and applies natural prosody based on:

Context understanding: The model reads the entire sentence and applies appropriate emphasis to important words automatically
Natural language instructions: Use the instruct parameter to guide overall tone and pacing:
- "Speak slowly with emphasis on key words"
- "Pause dramatically between sentences"
- "Fast-paced and energetic delivery"
Punctuation: The model respects punctuation for natural pauses (commas, periods, ellipses, em-dashes)

Example:

# The model will naturally emphasize "incredible results" due to context
poetry run voicegenhub synthesize "We achieved incredible results!" --provider qwen --voice Ryan --instruct "Speak with excitement and emphasis" --output emphasized.wav

Model Selection

Qwen 3 TTS offers multiple models optimized for different use cases:

Model	Size	Best For	Streaming	GPU Recommended	Supports
`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	600M	Default, fast generation, predefined speakers	✅	Optional	CustomVoice
`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	1.7B	Higher quality, predefined speakers	✅	Yes	CustomVoice
`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	1.7B	Custom voice design via natural language	✅	Yes	VoiceDesign
`Qwen/Qwen3-TTS-12Hz-1.7B-Base`	1.7B	Voice cloning from audio samples	✅	Yes	VoiceClone
`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	600M	Voice cloning, faster generation	✅	Optional	VoiceClone

Installation:

pip install voicegenhub[qwen]
# or
poetry install --with qwen

Qwen 3 TTS parameters (Python API):

model_name_or_path: Model to use (see table above)
device: "cuda", "cpu", or "auto" (default: auto)
dtype: "float32", "float16", "bfloat16" (default: bfloat16)
attn_implementation: "eager", "sdpa", "flash_attention_2" (default: eager)
generation_mode: "custom_voice", "voice_design", "voice_clone"
speaker: Speaker name for CustomVoice mode
instruct: Emotion/style instruction (for CustomVoice) or voice description (for VoiceDesign)
temperature, top_p, top_k, repetition_penalty, max_new_tokens: Advanced sampling parameters

Bark

poetry run voicegenhub synthesize "Hello, world!" --provider bark --voice bark-en_speaker_0 --output hello.wav

Bark features:

Highest naturalness among open-source TTS
Prosody markers for emotional expression: [laughs], [sighs], [pause], [whisper]
100+ speaker presets
Sound effects generation

Bark supported voices: Use preset names like bark-en_speaker_0, bark-en_speaker_1, etc.

Edge TTS

poetry run voicegenhub synthesize "Hello, world!" --provider edge --voice en-US-AriaNeural --output hello.mp3

Edge TTS supported voices: Check the list of supported voices here.

Kokoro TTS

poetry run voicegenhub synthesize "Hello, world!" --provider kokoro --voice kokoro-af_alloy --output hello.wav

Kokoro supported voices: Check the list of supported voices here.

ElevenLabs

poetry run voicegenhub synthesize "Hello, world!" --provider elevenlabs --voice elevenlabs-EXAVITQu4vr4xnSDxMaL --output hello.mp3

Set your API key in config/elevenlabs-api-key.json (the key should be stored as the value for "ELEVENLABS_API_KEY" in the JSON file).

ElevenLabs supported voices: Check the list of supported voices here.

Print all available voices per provider

poetry run voicegenhub voices --language en --provider chatterbox
poetry run voicegenhub voices --language en --provider bark
poetry run voicegenhub voices --language en --provider edge
poetry run voicegenhub voices --language en --provider kokoro
poetry run voicegenhub voices --language en --provider elevenlabs

Batch Processing with Concurrency Control

Process multiple texts concurrently with automatic provider-specific resource management:

# Process multiple texts (auto-numbered output files)
poetry run voicegenhub synthesize "First text" "Second text" "Third text" --provider edge --output batch_output

# Control concurrency (auto-configured per provider if not specified)
poetry run voicegenhub synthesize "Text 1" "Text 2" --provider bark --max-concurrent 2 --output output

Provider Concurrency Limits (automatic):

Fast providers (Edge, Kokoro, ElevenLabs): Use all CPU cores
Heavy providers (Bark: 2 concurrent, Chatterbox: 1 concurrent)

Benefits:

Model instances are shared across concurrent jobs (no reloading)
Automatic resource management prevents system overload
Progress tracking for each job
Failed jobs don't stop the batch

Voice Cloning with Kokoro and Chatterbox

VoiceGenHub supports zero-shot voice cloning by combining Kokoro's lightweight voices with Chatterbox's advanced cloning capabilities. This allows you to create custom voices that sound like Kokoro but with Chatterbox's superior quality and emotion control.

Step-by-Step Guide

Generate a Kokoro voice sample (modify as desired or keep undistorted):

# Undistorted voice
poetry run voicegenhub synthesize "Sample text for cloning." --provider kokoro --voice kokoro-am_michael --output reference.wav --format wav

# Or with effects (e.g., horror/distortion)
poetry run voicegenhub synthesize "Sample text for cloning." --provider kokoro --voice kokoro-am_adam --output reference.wav --format wav --pitch-shift -2 --distortion 0.02 --lowpass 2000 --normalize

Clone the voice with Chatterbox:

poetry run voicegenhub synthesize "Your longer text here." --provider chatterbox --voice chatterbox-default --output cloned_voice.wav --audio-prompt reference.wav

Optional: Adjust emotion and style:

poetry run voicegenhub synthesize "Your text." --provider chatterbox --voice chatterbox-default --output cloned_voice.wav --audio-prompt reference.wav --exaggeration 0.8 --cfg-weight 0.7

Tips:

Use short, clear reference audio (5-10 seconds) for best cloning results
Combine multiple Kokoro samples with FFmpeg for richer voice profiles
Experiment with Kokoro effects to create unique voice characteristics before cloning
Chatterbox supports multilingual cloning from any language reference audio

Concurrency and Memory Management

Async Concurrency (Recommended):

Use the synthesize command with multiple texts for safe concurrent processing within a single process
Models are loaded once and shared across concurrent jobs
Prevents out-of-memory (OOM) errors from duplicate model loading
Automatic provider-specific limits ensure stability

Multiprocessing Risks:

Running multiple CLI processes simultaneously (e.g., via scripts or parallel jobs) loads separate model instances
Heavy models like Chatterbox (3.7GB) and Bark (4GB) can cause OOM when duplicated across processes
Recommendation: Use async batch processing instead of multiprocessing for heavy providers
For light providers (Edge, Kokoro), multiprocessing is safer due to minimal memory footprint

Performance Comparison: All TTS Providers

Here's how all providers compare in terms of speed and quality:

Provider	Quality (MOS)	Startup Time	Sequential (per req)	Async (3x parallel)	Model Size	Commercial Licensed
Edge TTS	3.8/5	4.9s	3.2s	2.5s	0MB (cloud)	✅ Free
Kokoro	3.5/5	94s	14.2s	2.5s	625MB	✅ Apache 2.0
Bark	4.2/5	180s	25-40s	8-12s	4GB	✅ MIT
Chatterbox	4.3/5	120s	15-30s	5-15s	3.7GB	✅ MIT
ElevenLabs	4.5/5*	2s	3-5s	2-3s	0MB (cloud)	⚠️ Paid API

*ElevenLabs quality estimate based on provider reputation; not yet tested with API key.

Key Findings:

Chatterbox: Excellent quality with emotion control and multilingual support; MIT licensed, works on CPU
Bark: Highest naturalness for premium narration; MIT licensed (full commercial freedom)
Kokoro: Best balance of quality vs speed for offline use; Apache 2.0 licensed
Edge TTS: Best for real-time, low-latency applications; cloud-based (Microsoft)
ElevenLabs: Highest quality but requires paid API and credit card
For commercial purposes: Use Bark (MIT), Chatterbox (MIT), or Kokoro (Apache 2.0)

Chatterbox Concurrency Analysis

Memory Safety: Chatterbox uses a shared model instance (3.6GB) across all threads - no duplication. Safe to use 2-8 concurrent threads without OOM risk.

Performance: ~2.8x speedup at 4 threads on CPU. Optimal thread count: 2-4 threads.

View Interactive Performance Analysis - Shows speedup curves, memory usage, and timing breakdowns.

Commercial Licensing

✅ Commercially Safe Models:

Bark (MIT License) - Unrestricted commercial use, no attribution required ⭐
Chatterbox (MIT License) - Unrestricted commercial use, no attribution required
Qwen 3 TTS (Apache 2.0) - Commercial use allowed, attribution required
Kokoro (Apache 2.0) - Commercial use allowed, attribution required
Edge TTS (Microsoft) - Commercial use allowed
ElevenLabs (Paid API) - Commercial use with valid subscription

Provider Licenses

For transparency and compliance, here are direct links to the official license terms for each supported TTS provider:

Edge TTS (Microsoft): Microsoft Terms of Use
Kokoro TTS: Apache License 2.0
ElevenLabs TTS: ElevenLabs Terms of Service
Bark TTS: MIT License
Chatterbox TTS: MIT License
Qwen 3 TTS: Apache License 2.0

Optional Dependencies

Install optional TTS providers:

# Install Kokoro TTS (self-hosted lightweight TTS)
pip install voicegenhub[kokoro]

# Install Bark (self-hosted high-naturalness TTS)
pip install voicegenhub[bark]

# Install Chatterbox TTS (MIT licensed, multilingual with emotion control)
pip install chatterbox-tts

# Install Qwen 3 TTS (Apache 2.0 licensed, state-of-the-art multilingual TTS)
pip install voicegenhub[qwen]

Kokoro TTS Installation

Kokoro TTS requires Python 3.11 or higher.

Windows & Python 3.13+ Build Limitation

Important: On Windows with Python 3.13+, Kokoro TTS (via curated-tokenizers) may require compiling native code if pre-built wheels are not available. This requires Microsoft Visual C++ Build Tools.

If you see errors about missing C++ compilers or build failures when installing Kokoro, follow these steps:

Download and install Microsoft Visual C++ Build Tools.
During installation, select "Desktop development with C++" workload.
After installation, restart your terminal and retry installation:

poetry install --with kokoro
# or
pip install voicegenhub[kokoro]

If you still see build errors, check for available wheels for curated-tokenizers on PyPI. If no wheel is available for your Python version, you must build from source (requires Visual C++).

Recommendation: For easiest installation, use Python 3.11 or 3.12 on Windows until wheels for Python 3.13+ are published.

Installation

# Using Poetry (recommended):
poetry add voicegenhub[kokoro]
# or:
poetry install --with kokoro

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
src/voicegenhub		src/voicegenhub
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODEOWNERS		CODEOWNERS
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceGenHub

Installation

Optional Dependencies

Voice Cloning Support

Usage

Chatterbox TTS

Qwen 3 TTS

Mode 1: CustomVoice (Predefined Speakers)

Mode 2: VoiceDesign (Natural Language Voice Description)

Mode 3: VoiceClone (Reference Audio-Based)

Word Emphasis and Pause Control

Model Selection

Bark

Edge TTS

Kokoro TTS

ElevenLabs

Print all available voices per provider

Batch Processing with Concurrency Control

Voice Cloning with Kokoro and Chatterbox

Step-by-Step Guide

Concurrency and Memory Management

Performance Comparison: All TTS Providers

Chatterbox Concurrency Analysis

Commercial Licensing

✅ Commercially Safe Models:

Provider Licenses

Optional Dependencies

Kokoro TTS Installation

Windows & Python 3.13+ Build Limitation

Installation

About

Uh oh!

Releases 28

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoiceGenHub

Installation

Optional Dependencies

Voice Cloning Support

Usage

Chatterbox TTS

Qwen 3 TTS

Mode 1: CustomVoice (Predefined Speakers)

Mode 2: VoiceDesign (Natural Language Voice Description)

Mode 3: VoiceClone (Reference Audio-Based)

Word Emphasis and Pause Control

Model Selection

Bark

Edge TTS

Kokoro TTS

ElevenLabs

Print all available voices per provider

Batch Processing with Concurrency Control

Voice Cloning with Kokoro and Chatterbox

Step-by-Step Guide

Concurrency and Memory Management

Performance Comparison: All TTS Providers

Chatterbox Concurrency Analysis

Commercial Licensing

✅ Commercially Safe Models:

Provider Licenses

Optional Dependencies

Kokoro TTS Installation

Windows & Python 3.13+ Build Limitation

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 28

Contributors

Uh oh!

Languages