Note: This project is fully vibecoded by Codex, so parts may or may not work as-is in your environment.
Two command-line tools for Qwen3 text-to-speech:
qwen3_tts_cli.py: text-to-speech withvoice-designandcustom-voiceqwen3_clone_cli.py: voice cloning from a reference WAVqwen3_tui.py: interactive text UI wizard that wraps both scripts
Included showcase sample:
prospector_cartoon2.wav
Use Python 3.11+ (3.12 recommended):
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtFirst run downloads model assets from Hugging Face.
On some environments, qwen-tts dependencies can fail with:
RuntimeError: cannot cache function '__o_fold'
If you see that, prefix commands with:
NUMBA_CACHE_DIR=/tmp/numba-cacheIf your machine runs without this, skip it.
Help:
python qwen3_tts_cli.py --helptext(optional): text to synthesize- If omitted, the tool reads from piped stdin
-m, --model {voice-design,custom-voice}: select model family--model-id: override model ID/path-l, --language: target language (defaultAuto)-i, --instruct: style instruction (required forvoice-design)-s, --speaker: speaker name (required forcustom-voice)--list-speakers: print predefined speakers and exit-o, --output: output WAV path (defaultoutput.wav)
--top-k(default50)--top-p(default1.0)--temperature(default0.9)--repetition-penalty(default1.05)--max-new-tokens(default2048)
--device:auto|cuda:0|mps|cpu--dtype:auto|bfloat16|float16|float32--attn-impl: attention implementation override-v, --verbose: print load/generation timing
- With default output
output.wav, the tool avoids overwrite by creatingoutput-2.wav,output-3.wav, etc. - For
voice-design,--instructis mandatory. - For
custom-voice,--speakeris mandatory.
Voice-design:
python qwen3_tts_cli.py \
--model voice-design \
--language English \
--instruct "Warm natural narration, mid-pitch, clear pacing." \
--output output.wav \
"Hello from Qwen3 voice design."Custom-voice:
python qwen3_tts_cli.py \
--model custom-voice \
--speaker Vivian \
--output custom_voice.wav \
"This is a custom speaker test."Help:
python qwen3_clone_cli.py --helptext(required): text to synthesize in cloned voice
-r, --ref-audio(required): reference WAV path/URL/base64 supported byqwen-tts-t, --ref-text: transcript for reference audio--x-vector-only: embedding-only mode; allows omitting--ref-text-l, --language: target language (defaultAuto)-o, --output: output WAV path (defaultclone.wav)--model: model ID/path (defaultQwen/Qwen3-TTS-12Hz-1.7B-Base)
Supports the same --top-k, --top-p, --temperature, --repetition-penalty, --max-new-tokens, --device, --dtype, --attn-impl, and -v flags as qwen3_tts_cli.py.
--ref-textis required unless--x-vector-onlyis set.- Standard cloning quality is usually better with accurate
--ref-text.
Transcript-guided clone:
python qwen3_clone_cli.py \
--ref-audio your_reference.wav \
--ref-text "Transcript of the reference audio." \
--output cloned.wav \
"New sentence in the cloned voice."Embedding-only clone:
python qwen3_clone_cli.py \
--ref-audio your_reference.wav \
--x-vector-only \
--output cloned_xvector.wav \
"Quick cloned sample without ref text."Run the interactive wizard:
python qwen3_tui.pyWhat it does:
- Presents a menu for
voice-design,custom-voice, orvoice clone - Prompts for required fields (text, output path, instruct/speaker/ref-audio/ref-text)
- Optionally prompts for advanced generation/runtime settings
- Shows the final command before execution and asks for confirmation
- Runs the underlying scripts (
qwen3_tts_cli.py/qwen3_clone_cli.py) - After successful generation, optionally plays audio via macOS
afplay
By default, it sets NUMBA_CACHE_DIR=/tmp/numba-cache for launched commands.
Disable that behavior:
python qwen3_tui.py --no-numba-cacheBased on Qwen3-TTS guidance from the official release/docs:
- Separate speaker identity from speaking style:
- Identity: age range, gender presentation, accent, timbre
- Style: mood, pacing, emphasis, rhythm, energy
- Be explicit and concrete:
- Better:
elderly prospector, bright nasal twang, frontier drawl, energetic exclamations - Worse:
funny voice
- Better:
- Include delivery constraints:
- examples:
clear articulation,stable pacing,avoid distortion,natural pauses
- examples:
- Avoid contradictory instructions:
very slow and very fastin one prompt hurts consistency
- Iterate with small changes:
- change one trait at a time (pitch, speed, emotion) to steer output predictably
- For multilingual or mixed-language text:
- keep language expectations explicit in both
--languageand--instruct
- keep language expectations explicit in both
Reusable template:
[persona + age/context], [timbre + pitch + accent], [emotion + pacing + rhythm], [constraints for clarity/stability]
Command used to generate prospector_cartoon2.wav:
python qwen3_tts_cli.py \
--model voice-design \
--language English \
--output prospector_cartoon2.wav \
--instruct "Cartoon old-timey Wild West prospector, wiry elderly male voice, bright nasal twang with exaggerated frontier drawl, high-energy comedic shouting, big pitch jumps on excited words, dramatic pauses and gleeful yelps, crisp intelligible words without distortion." \
"There's gold in these hills! Gold!! Yippeee!! Ho-ho! Strike up the banjo, we're rich by sunset!"If needed:
NUMBA_CACHE_DIR=/tmp/numba-cache python qwen3_tts_cli.py ...ls -lh *.wav
file *.wav- Official blog URL requested: https://qwen.ai/blog?id=qwen3tts-0115
- Qwen3-TTS model usage docs and examples: https://www.alibabacloud.com/help/en/model-studio/qwen-tts
- Qwen3-TTS release overview mirror: https://www.alibabacloud.com/blog/602401