PocketTTS.cpp

The fastest open-source TTS engine with voice cloning that runs entirely on CPU.

Single-file C++ inference runtime for Pocket TTS, Kyutai's lightweight text-to-speech model. Runs via ONNX Runtime with zero-shot voice cloning from short audio samples.

One file (pocket_tts.cpp), no frameworks, no Python dependency at runtime.

9.2x realtime (RTFx) with 30ms time-to-first-audio on a Ryzen 7 3800X, INT8 precision.

Features

Single-file implementation — all inference logic in one C++ source file
CLI, HTTP server, and shared library — built from the same source
Pipelined streaming — latent generation and audio decoding run in parallel for low latency (~30ms first chunk)
Voice cloning — clone any voice from a short audio sample (WAV, MP3, FLAC)
Two-layer disk cache — voice embeddings (.emb) and transformer KV state (.kv) are cached to disk, making repeated use of the same voice near-instant
INT8 / FP32 precision — INT8 by default for ~4x smaller models at comparable quality
Built-in profiler — --profile flag for per-operation timing
OpenAI-compatible API — drop-in replacement for /v1/audio/speech endpoint

Requirements

CMake 3.28+
C++17 compiler (GCC, Clang)
Linux, macOS, or Windows

All dependencies (ONNX Runtime, SentencePiece, dr_wav) are fetched automatically by CMake.

Setup

Export ONNX Models

The included export_onnx.py script exports, quantizes, and validates all ONNX models from the upstream Pocket TTS weights. Requires uv:

uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cpu
uv pip install "pocket-tts @ git+https://github.com/kyutai-labs/pocket-tts.git"
uv pip install onnx onnxruntime
python export_onnx.py

This downloads the weights automatically and produces the following in models/:

models/
├── flow_lm_flow_int8.onnx
├── flow_lm_flow.onnx
├── flow_lm_main_int8.onnx
├── flow_lm_main.onnx
├── mimi_decoder_int8.onnx
├── mimi_decoder.onnx
├── mimi_encoder.onnx
├── text_conditioner.onnx
└── tokenizer.model

Directory Structure

PocketTTS.cpp/
├── CMakeLists.txt
├── pocket_tts.cpp
├── export_onnx.py
├── models/          ← generated by export_onnx.py
└── voices/
    └── YourVoice.wav

Place at least one .wav voice sample in voices/. For INT8 inference (default), you need the _int8 variants plus mimi_encoder.onnx and text_conditioner.onnx.

Build

cmake -B .build -DCMAKE_BUILD_TYPE=Release
cmake --build .build -j$(nproc)

This produces the pocket-tts CLI executable. To also build the shared library for FFI:

cmake -B .build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIB=ON
cmake --build .build -j$(nproc)

Usage

CLI

# Generate speech to a WAV file
./pocket-tts "Hello, world." voice.wav output.wav

# Pipe raw PCM to another process (e.g. aplay, ffplay, sox)
./pocket-tts --stdout "Hello, world." voice.wav | aplay -f FLOAT_LE -r 24000 -c 1

# INT8 (default) or FP32
./pocket-tts --precision fp32 "Hello." voice.wav output.wav

# Adjust generation parameters
./pocket-tts --temperature 0.5 --lsd-steps 10 "Hello." voice.wav output.wav

The voice argument can be a filename in the voices/ directory (e.g. voice.wav) or an absolute path to any WAV, MP3, or FLAC file.

HTTP Server

./pocket-tts --server --port 8080

Endpoints:

POST /v1/audio/speech — OpenAI-compatible TTS (JSON body: {"input": "...", "voice": "..."})
POST /tts — streaming TTS (JSON body: {"text": "...", "voice": "..."})
GET /health — health check

The /v1/audio/speech endpoint is compatible with the OpenAI TTS API. Any client that supports OpenAI's TTS (SillyTavern, Open WebUI, etc.) can use PocketTTS.cpp as a drop-in replacement by pointing the base URL to http://localhost:8080. The model and speed fields are accepted but ignored. Supported response_format values are wav (default) and pcm.

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello world!", "voice": "voice", "response_format": "wav"}' \
  --output speech.wav

The /tts endpoint streams raw chunked PCM (audio/pcm;rate=24000;encoding=float;bits=32) for low-latency applications.

Shared Library (FFI)

Build with -DBUILD_SHARED_LIB=ON to produce libpocket_tts.so. The C API:

void*  ptt_create(const char* models_dir, const char* voices_dir,
                  const char* tokenizer_path, const char* precision,
                  float temperature, int lsd_steps, int num_threads);
double ptt_warmup(void* handle);
void   ptt_free_audio(float* samples);
void   ptt_destroy(void* handle);

// Streaming
void*  ptt_stream_start(void* handle, const char* text, const char* voice);
int    ptt_stream_read(void* stream_ctx, float** out_samples, int* out_len);
void   ptt_stream_end(void* stream_ctx);

Caching

PocketTTS uses two layers of disk caching, both stored under voices/.cache/:

Voice embeddings (.emb) — The output of the Mimi encoder for each voice sample. Avoids re-encoding the same WAV file on every run. Generated automatically on first use.

KV state snapshots (.kv) — The transformer's internal KV cache state after voice conditioning. This is the expensive part — on a cold start, voice conditioning takes hundreds of milliseconds. A cached .kv file restores in ~4ms. For multi-sentence input, the KV snapshot is also held in memory so only the first sentence pays the disk load cost.

Cache files are invalidated automatically when the source WAV is modified. To clear all caches:

rm -rf voices/.cache/

To disable caching entirely, pass --no-cache.

Options

Flag	Default	Description
`-h`, `--help`	—	Show usage and all options
`--precision`	`int8`	Model precision (`int8` or `fp32`)
`--temperature`	`0.7`	Sampling temperature
`--lsd-steps`	`1`	Flow matching ODE solver steps
`--eos-threshold`	`-4.0`	EOS detection threshold (lower = later cutoff)
`--eos-extra`	`-1`	Extra frames after EOS (`-1` = auto from text length)
`--noise-clamp`	`0`	Clamp noise magnitude (`0` = disabled, matches upstream)
`--threads`	`0`	Total thread budget (`0` = half of available cores)
`--models-dir`	`models`	Path to ONNX model directory
`--voices-dir`	`voices`	Path to voice samples directory
`--tokenizer`	`models/tokenizer.model`	Path to SentencePiece tokenizer
`--no-cache`	—	Disable all disk caching (`.emb` and `.kv` files)
`--stdout`	—	Output raw f32le PCM to stdout
`--verbose`	—	Enable verbose output
`--profile`	—	Print per-operation timing report after generation
`--server`	—	Start HTTP server mode
`--port`	`8080`	Server port

Acknowledgments

Kyutai Labs — Pocket TTS model and original Python implementation (MIT)
Verylicious/pocket-tts-ungated — Ungated model weights and tokenizer (CC-BY-4.0)

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
export_onnx.py		export_onnx.py
pocket_tts.cpp		pocket_tts.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PocketTTS.cpp

Features

Requirements

Setup

Export ONNX Models

Directory Structure

Build

Usage

CLI

HTTP Server

Shared Library (FFI)

Caching

Options

Acknowledgments

License

About

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

PocketTTS.cpp

Features

Requirements

Setup

Export ONNX Models

Directory Structure

Build

Usage

CLI

HTTP Server

Shared Library (FFI)

Caching

Options

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages