Skip to content

VolgaGerm/PocketTTS.cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PocketTTS.cpp

The fastest open-source TTS engine with voice cloning that runs entirely on CPU.

Single-file C++ inference runtime for Pocket TTS, Kyutai's lightweight text-to-speech model. Runs via ONNX Runtime with zero-shot voice cloning from short audio samples.

One file (pocket_tts.cpp), no frameworks, no Python dependency at runtime.

9.2x realtime (RTFx) with 30ms time-to-first-audio on a Ryzen 7 3800X, INT8 precision.

Features

  • Single-file implementation — all inference logic in one C++ source file
  • CLI, HTTP server, and shared library — built from the same source
  • Pipelined streaming — latent generation and audio decoding run in parallel for low latency (~30ms first chunk)
  • Voice cloning — clone any voice from a short audio sample (WAV, MP3, FLAC)
  • Two-layer disk cache — voice embeddings (.emb) and transformer KV state (.kv) are cached to disk, making repeated use of the same voice near-instant
  • INT8 / FP32 precision — INT8 by default for ~4x smaller models at comparable quality
  • Built-in profiler--profile flag for per-operation timing
  • OpenAI-compatible API — drop-in replacement for /v1/audio/speech endpoint

Requirements

  • CMake 3.28+
  • C++17 compiler (GCC, Clang)
  • Linux, macOS, or Windows

All dependencies (ONNX Runtime, SentencePiece, dr_wav) are fetched automatically by CMake.

Setup

Export ONNX Models

The included export_onnx.py script exports, quantizes, and validates all ONNX models from the upstream Pocket TTS weights. Requires uv:

uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cpu
uv pip install "pocket-tts @ git+https://github.com/kyutai-labs/pocket-tts.git"
uv pip install onnx onnxruntime
python export_onnx.py

This downloads the weights automatically and produces the following in models/:

models/
├── flow_lm_flow_int8.onnx
├── flow_lm_flow.onnx
├── flow_lm_main_int8.onnx
├── flow_lm_main.onnx
├── mimi_decoder_int8.onnx
├── mimi_decoder.onnx
├── mimi_encoder.onnx
├── text_conditioner.onnx
└── tokenizer.model

Directory Structure

PocketTTS.cpp/
├── CMakeLists.txt
├── pocket_tts.cpp
├── export_onnx.py
├── models/          ← generated by export_onnx.py
└── voices/
    └── YourVoice.wav

Place at least one .wav voice sample in voices/. For INT8 inference (default), you need the _int8 variants plus mimi_encoder.onnx and text_conditioner.onnx.

Build

cmake -B .build -DCMAKE_BUILD_TYPE=Release
cmake --build .build -j$(nproc)

This produces the pocket-tts CLI executable. To also build the shared library for FFI:

cmake -B .build -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIB=ON
cmake --build .build -j$(nproc)

Usage

CLI

# Generate speech to a WAV file
./pocket-tts "Hello, world." voice.wav output.wav

# Pipe raw PCM to another process (e.g. aplay, ffplay, sox)
./pocket-tts --stdout "Hello, world." voice.wav | aplay -f FLOAT_LE -r 24000 -c 1

# INT8 (default) or FP32
./pocket-tts --precision fp32 "Hello." voice.wav output.wav

# Adjust generation parameters
./pocket-tts --temperature 0.5 --lsd-steps 10 "Hello." voice.wav output.wav

The voice argument can be a filename in the voices/ directory (e.g. voice.wav) or an absolute path to any WAV, MP3, or FLAC file.

HTTP Server

./pocket-tts --server --port 8080

Endpoints:

  • POST /v1/audio/speech — OpenAI-compatible TTS (JSON body: {"input": "...", "voice": "..."})
  • POST /tts — streaming TTS (JSON body: {"text": "...", "voice": "..."})
  • GET /health — health check

The /v1/audio/speech endpoint is compatible with the OpenAI TTS API. Any client that supports OpenAI's TTS (SillyTavern, Open WebUI, etc.) can use PocketTTS.cpp as a drop-in replacement by pointing the base URL to http://localhost:8080. The model and speed fields are accepted but ignored. Supported response_format values are wav (default) and pcm.

curl -X POST http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello world!", "voice": "voice", "response_format": "wav"}' \
  --output speech.wav

The /tts endpoint streams raw chunked PCM (audio/pcm;rate=24000;encoding=float;bits=32) for low-latency applications.

Shared Library (FFI)

Build with -DBUILD_SHARED_LIB=ON to produce libpocket_tts.so. The C API:

void*  ptt_create(const char* models_dir, const char* voices_dir,
                  const char* tokenizer_path, const char* precision,
                  float temperature, int lsd_steps, int num_threads);
double ptt_warmup(void* handle);
void   ptt_free_audio(float* samples);
void   ptt_destroy(void* handle);

// Streaming
void*  ptt_stream_start(void* handle, const char* text, const char* voice);
int    ptt_stream_read(void* stream_ctx, float** out_samples, int* out_len);
void   ptt_stream_end(void* stream_ctx);

Caching

PocketTTS uses two layers of disk caching, both stored under voices/.cache/:

Voice embeddings (.emb) — The output of the Mimi encoder for each voice sample. Avoids re-encoding the same WAV file on every run. Generated automatically on first use.

KV state snapshots (.kv) — The transformer's internal KV cache state after voice conditioning. This is the expensive part — on a cold start, voice conditioning takes hundreds of milliseconds. A cached .kv file restores in ~4ms. For multi-sentence input, the KV snapshot is also held in memory so only the first sentence pays the disk load cost.

Cache files are invalidated automatically when the source WAV is modified. To clear all caches:

rm -rf voices/.cache/

To disable caching entirely, pass --no-cache.

Options

Flag Default Description
-h, --help Show usage and all options
--precision int8 Model precision (int8 or fp32)
--temperature 0.7 Sampling temperature
--lsd-steps 1 Flow matching ODE solver steps
--eos-threshold -4.0 EOS detection threshold (lower = later cutoff)
--eos-extra -1 Extra frames after EOS (-1 = auto from text length)
--noise-clamp 0 Clamp noise magnitude (0 = disabled, matches upstream)
--threads 0 Total thread budget (0 = half of available cores)
--models-dir models Path to ONNX model directory
--voices-dir voices Path to voice samples directory
--tokenizer models/tokenizer.model Path to SentencePiece tokenizer
--no-cache Disable all disk caching (.emb and .kv files)
--stdout Output raw f32le PCM to stdout
--verbose Enable verbose output
--profile Print per-operation timing report after generation
--server Start HTTP server mode
--port 8080 Server port

Acknowledgments

License

MIT — see LICENSE.

About

Single-file C++ TTS runtime for Pocket TTS with ONNX Runtime — voice cloning, streaming, HTTP server, FFI C API

Topics

Resources

License

Stars

Watchers

Forks