The fastest open-source TTS engine with voice cloning that runs entirely on CPU.
Single-file C++ inference runtime for Pocket TTS, Kyutai's lightweight text-to-speech model. Runs via ONNX Runtime with zero-shot voice cloning from short audio samples.
One file (pocket_tts.cpp), no frameworks, no Python dependency at runtime.
9.2x realtime (RTFx) with 30ms time-to-first-audio on a Ryzen 7 3800X, INT8 precision.
- Single-file implementation — all inference logic in one C++ source file
- CLI, HTTP server, and shared library — built from the same source
- Pipelined streaming — latent generation and audio decoding run in parallel for low latency (~30ms first chunk)
- Voice cloning — clone any voice from a short audio sample (WAV, MP3, FLAC)
- Two-layer disk cache — voice embeddings (
.emb) and transformer KV state (.kv) are cached to disk, making repeated use of the same voice near-instant - INT8 / FP32 precision — INT8 by default for ~4x smaller models at comparable quality
- Built-in profiler —
--profileflag for per-operation timing - OpenAI-compatible API — drop-in replacement for
/v1/audio/speechendpoint
- CMake 3.28+
- C++17 compiler (GCC, Clang)
- Linux, macOS, or Windows
All dependencies (ONNX Runtime, SentencePiece, dr_wav) are fetched automatically by CMake.
The included export_onnx.py script exports, quantizes, and validates all ONNX models from the upstream Pocket TTS weights. Requires uv:
uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cpu
uv pip install "pocket-tts @ git+https://github.com/kyutai-labs/pocket-tts.git"
uv pip install onnx onnxruntime
python export_onnx.pyThis downloads the weights automatically and produces the following in models/:
models/
├── flow_lm_flow_int8.onnx
├── flow_lm_flow.onnx
├── flow_lm_main_int8.onnx
├── flow_lm_main.onnx
├── mimi_decoder_int8.onnx
├── mimi_decoder.onnx
├── mimi_encoder.onnx
├── text_conditioner.onnx
└── tokenizer.model
PocketTTS.cpp/
├── CMakeLists.txt
├── pocket_tts.cpp
├── export_onnx.py
├── models/ ← generated by export_onnx.py
└── voices/
└── YourVoice.wav
Place at least one .wav voice sample in voices/. For INT8 inference (default), you need the _int8 variants plus mimi_encoder.onnx and text_conditioner.onnx.
cmake -B .build -DCMAKE_BUILD_TYPE=Release
cmake --build .build -j$(nproc)This produces the pocket-tts CLI executable. To also build the shared library for FFI:
cmake -B .build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIB=ON
cmake --build .build -j$(nproc)# Generate speech to a WAV file
./pocket-tts "Hello, world." voice.wav output.wav
# Pipe raw PCM to another process (e.g. aplay, ffplay, sox)
./pocket-tts --stdout "Hello, world." voice.wav | aplay -f FLOAT_LE -r 24000 -c 1
# INT8 (default) or FP32
./pocket-tts --precision fp32 "Hello." voice.wav output.wav
# Adjust generation parameters
./pocket-tts --temperature 0.5 --lsd-steps 10 "Hello." voice.wav output.wavThe voice argument can be a filename in the voices/ directory (e.g. voice.wav) or an absolute path to any WAV, MP3, or FLAC file.
./pocket-tts --server --port 8080Endpoints:
POST /v1/audio/speech— OpenAI-compatible TTS (JSON body:{"input": "...", "voice": "..."})POST /tts— streaming TTS (JSON body:{"text": "...", "voice": "..."})GET /health— health check
The /v1/audio/speech endpoint is compatible with the OpenAI TTS API. Any client that supports OpenAI's TTS (SillyTavern, Open WebUI, etc.) can use PocketTTS.cpp as a drop-in replacement by pointing the base URL to http://localhost:8080. The model and speed fields are accepted but ignored. Supported response_format values are wav (default) and pcm.
curl -X POST http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "tts-1", "input": "Hello world!", "voice": "voice", "response_format": "wav"}' \
--output speech.wavThe /tts endpoint streams raw chunked PCM (audio/pcm;rate=24000;encoding=float;bits=32) for low-latency applications.
Build with -DBUILD_SHARED_LIB=ON to produce libpocket_tts.so. The C API:
void* ptt_create(const char* models_dir, const char* voices_dir,
const char* tokenizer_path, const char* precision,
float temperature, int lsd_steps, int num_threads);
double ptt_warmup(void* handle);
void ptt_free_audio(float* samples);
void ptt_destroy(void* handle);
// Streaming
void* ptt_stream_start(void* handle, const char* text, const char* voice);
int ptt_stream_read(void* stream_ctx, float** out_samples, int* out_len);
void ptt_stream_end(void* stream_ctx);PocketTTS uses two layers of disk caching, both stored under voices/.cache/:
Voice embeddings (.emb) — The output of the Mimi encoder for each voice sample. Avoids re-encoding the same WAV file on every run. Generated automatically on first use.
KV state snapshots (.kv) — The transformer's internal KV cache state after voice conditioning. This is the expensive part — on a cold start, voice conditioning takes hundreds of milliseconds. A cached .kv file restores in ~4ms. For multi-sentence input, the KV snapshot is also held in memory so only the first sentence pays the disk load cost.
Cache files are invalidated automatically when the source WAV is modified. To clear all caches:
rm -rf voices/.cache/To disable caching entirely, pass --no-cache.
| Flag | Default | Description |
|---|---|---|
-h, --help |
— | Show usage and all options |
--precision |
int8 |
Model precision (int8 or fp32) |
--temperature |
0.7 |
Sampling temperature |
--lsd-steps |
1 |
Flow matching ODE solver steps |
--eos-threshold |
-4.0 |
EOS detection threshold (lower = later cutoff) |
--eos-extra |
-1 |
Extra frames after EOS (-1 = auto from text length) |
--noise-clamp |
0 |
Clamp noise magnitude (0 = disabled, matches upstream) |
--threads |
0 |
Total thread budget (0 = half of available cores) |
--models-dir |
models |
Path to ONNX model directory |
--voices-dir |
voices |
Path to voice samples directory |
--tokenizer |
models/tokenizer.model |
Path to SentencePiece tokenizer |
--no-cache |
— | Disable all disk caching (.emb and .kv files) |
--stdout |
— | Output raw f32le PCM to stdout |
--verbose |
— | Enable verbose output |
--profile |
— | Print per-operation timing report after generation |
--server |
— | Start HTTP server mode |
--port |
8080 |
Server port |
- Kyutai Labs — Pocket TTS model and original Python implementation (MIT)
- Verylicious/pocket-tts-ungated — Ungated model weights and tokenizer (CC-BY-4.0)
MIT — see LICENSE.