Fast, local, multi-protocol Text-to-Speech server powered by Kyutai's Pocket TTS.
- π Runs at 1.5x real-time on older CPUs (tested on Haswell)
- π Voice cloning support - use your own
.wavfiles - β‘ Optimized Loading - converts voices to
.safetensorsfor instant startup - π¦ Audio caching - instant response for repeated phrases
- π°οΈ Streaming Support - real-time audio generation
- π‘οΈ Stuttering Protection - runs with High Priority to prevent choppiness under load
- π Multi-protocol: OpenAI standard, XTTS-compatible, WebSocket, and GET streaming
- π€ SillyTavern ready - works with XTTSv2 and OpenAI Compatible TTS providers
git clone https://github.com/IceFog72/pocket-tts-openapi
cd pocket-tts-openapi- Run
install.bat- sets up Python venv and installs dependencies - Run
start.bat- starts the server (automatically sets High Priority)
- Run
chmod +x install.sh start.sh update.sh(first time only) - Run
./install.sh- sets up Python venv and installs dependencies - Run
./start.sh- starts the server
To get the latest version of the project:
- Windows: Run
update.bat - Linux: Run
./update.sh
Server: http://localhost:8005
| Method | Path | Description |
|---|---|---|
| GET | /health |
Server status |
| POST | /v1/audio/speech |
Generate audio (OpenAI standard) |
| GET | /tts_stream?text=...&voice=... |
Stream audio via GET |
| POST | /tts_to_audio/ |
Generate audio (XTTS format) |
| GET | /v1/voices |
Voice list (OpenAI format) |
| GET | /speakers |
Voice list (XTTS format) |
| WS | /v1/audio/stream |
WebSocket streaming |
curl http://localhost:8005/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello!", "voice": "nova", "response_format": "mp3"}' \
--output hello.mp3curl "http://localhost:8005/tts_stream?text=Hello&voice=nova&format=mp3" \
--output hello.mp3import asyncio, json, websockets
async def stream_tts():
async with websockets.connect("ws://localhost:8005/v1/audio/stream") as ws:
await ws.send(json.dumps({"text": "Hello", "voice": "nova", "format": "mp3"}))
while True:
msg = await ws.recv()
if isinstance(msg, bytes):
with open("out.mp3", "ab") as f:
f.write(msg)
elif json.loads(msg).get("status") == "done":
break
asyncio.run(stream_tts())The server works with SillyTavern in three ways:
The SillyTavern-PocketTTS-WebSocket extension provides the best experience:
- Persistent WebSocket connection β no reconnect overhead per sentence
- Sentence-split generation β each sentence gets exact audio duration, no gaps
- Built-in TTS playback bar with seek, volume, speed controls
- Model selection (CPU/GPU, fast/quality)
- Voice auto-discovery
- Streaming Response via async generator β audio plays while server generates
Use SillyTavern's built-in XTTSv2 provider β set endpoint to http://host:8005.
Use SillyTavern's built-in OpenAI Compatible provider β set endpoint to http://host:8005/v1/audio/speech.
| Provider | Set endpoint to | Voices auto-discovered | Sentence streaming |
|---|---|---|---|
| PocketTTS Extension | http://host:8005 |
Yes | Yes |
| XTTSv2 | http://host:8005 |
Yes | No |
| OpenAI Compatible | http://host:8005/v1/audio/speech |
Yes | No |
For a desktop experience and AI integration, use the Ice Open TTS Proxy.
- π¨ Desktop GUI: Text input, voice selection, playback controls.
- β‘ Live Mode: Speaks as you type with real-time setting sync.
- π€ AI Agent Bridge: OpenAI-compatible API server on port 8181.
- Ensure the main TTS server is running (Step 2 above).
- Go to the
ice-open-tts-test-proxy/directory. - Windows: Run
start_ice_gui.bat - Linux: Run
./start_ice_gui.sh
See AGENTS.md for detailed AI Agent integration.
- Pocket TTS:
alba,marius,javert,jean,fantine,cosette,eponine,azelma - OpenAI aliases:
alloy,echo,fable,onyx,nova,shimmer - Custom: place
.wavfiles invoices/(auto-converted to.safetensors)
voices/: Place your source.wavfiles here (~10 seconds for best results).embeddings/: Optimized.safetensorsare stored here for instant loading.
- Accept license at https://huggingface.co/kyutai/pocket-tts
- Login:
huggingface-cli login - Restart the server
- High Priority Mode: Auto-runs as High Priority on Windows.
- Quality Parameters:
temperature(0.0-2.0),lsd_decode_steps(1-50). - Large Block Handling: Auto-splits long text into sentences.
- Model Tiers:
tts-1(fast),tts-1-hd(quality),tts-1-cuda,tts-1-hd-cuda.
- Auto-caches generated files (default: 10).
- Cache includes voice, text, and quality parameters.
- Cache hit = instant response.
- 401 Unauthorized β Run
huggingface-cli login - Port conflict β Server auto-selects next free port
- Slow first run β Downloads ~236MB model
- Platform: Windows and Linux
- Dependencies: Python 3.10+, FFmpeg (for MP3/AAC/etc)
- Cache:
./audio_cache/ - Model cache:
~/.cache/huggingface
Discord: https://discord.gg/2tJcWeMjFQ β’ SillyTavern Discord
Inspired by kyutai-tts-openai-api