Heads up: this repo was vibe-coded — but tested in production. 6 audiobooks (~50 hours of audio) synthesized end-to-end across multiple Vast.ai pods, bugs surfaced and fixed in the wild, ~$12 of GPU time burned validating the orchestrator
- chunks-only pipeline. Use it, fork it, file issues.
Convert PDF/EPUB books to MP3 audiobooks. Two backends:
- Kokoro — fast, free, runs on CPU/MPS/CUDA, no voice cloning.
- Chatterbox — slower, GPU-friendly, clones any voice from a 10-30s reference clip. Cloud GPU orchestrator included for batch runs on Vast.ai or RunPod.
Output MP3s are tagged with title and artist from EPUB metadata, named Title - Author.mp3.
git clone https://github.com/heitzlki/book2audio
cd book2audio
pip install -e .[kokoro] # ~2 min, no GPU needed
book2audio mybook.epub --out-dir out/
# → out/<Title> - <Author>.mp3Works on macOS (M1/M2/M3 via MPS), Linux, Windows. Real-time factor on M1 Max ≈ 0.4 (a 10-hour book takes ~4 hours of CPU).
# Kokoro voice options (run with -v <voice>):
book2audio mybook.epub --out-dir out/ -v af_bella # American female
book2audio mybook.epub --out-dir out/ -v am_adam # American male
book2audio mybook.epub --out-dir out/ -v bf_emma # British female
# Full list: https://github.com/hexgrad/kokoroYou'll need a voice reference: a 10-30s clean audio clip of the voice you want, in WAV format. You provide your own — this repo ships no voice samples. Cloning a real person's voice without consent is your responsibility, not the library's.
pip install -e .[chatterbox]
book2audio mybook.epub --out-dir out/ -e chatterbox \
--voice-ref ./my_voice.wav
# Auto-detects CUDA → MPS → CPU. Force with --device cuda.Local Chatterbox on a 4090 ≈ ~3-4× real-time. On MPS (Apple Silicon) it's ~10× slower than CUDA — usable for short books, painful for full novels.
For multi-book runs, rent GPUs. The orchestrator at scripts/remote_run.py handles provisioning, sync, the actual run, fetching MP3s back, and tearing down. Three-layer cost safety chain (graceful self-destruct, deadman timer, provider credit cap) caps your worst-case spend even if your laptop sleeps mid-run.
# 1. Set up Vast.ai
export VAST_API_KEY=...
vastai set api-key $VAST_API_KEY # one-time, ~/.vast_api_key
# 2. Run
python scripts/remote_run.py all \
--provider vast --gpu rtx5090 --max-price 0.45 \
--books books/ \
--voice-ref ./voice.wav \
--output out/
# Provisions pod → rsyncs project + books + voice → bootstraps Chatterbox →
# synthesizes → rsyncs MP3s back → destroys pod.Cost on a single RTX 5090 (~$0.34/hr): a 600-page book → ~$5–7, ~3-4 hours wall.
For batch runs, parallel-all provisions one pod per book (auto-splits books over --max-chars-per-pod into chunks, reassembles via ffmpeg at the end):
caffeinate -i nohup python scripts/remote_run.py parallel-all \
--provider vast --gpu rtx5090 --max-price 0.45 \
--min-cpu-cores 16 --min-dlperf 200 --min-pcie-gen 4 \
--books books/ \
--voice-ref ./voice.wav \
--output out/ \
--max-chars-per-pod 600000 \
--checkpoint-interval 60 \
--log-dir /tmp/pod_logs \
--yes > /tmp/orchestrator.log 2>&1 &
# Live dashboard in another terminal:
python scripts/live_dash.py out/Wall-clock ≈ time for the slowest pod's book (not the sum). Per-pod chunk-level checkpoints back up to local every minute, so a dead pod can be recovered without losing work.
Vast.ai hosts vary wildly even at the same --max-price. Without filters, you can land an 8-core CPU + PCIe 4 + dlperf 155 host that's 2.5× slower than a 32-core PCIe 5 dlperf 200 peer at the same hourly rate. Recommended:
--min-cpu-cores 16 # Chatterbox is bottlenecked by CPU↔GPU transfers
--min-dlperf 200 # Vast's synthetic GPU benchmark
--min-pcie-gen 4 # 5 is rare; 4 is the safe floor
--min-inet-up 500 # Pod upload Mbps; this is what rsync_down speed bottlenecks on
RunPod has no equivalent host-quality filter — its pricing tiers reflect quality. Use Secure tier for predictable throughput.
Measured on Vast.ai RTX 5090 with Chatterbox 0.1.x:
| Hardware tier | Throughput (chunks/min) | $/1k chunks |
|---|---|---|
| 32-core PCIe 5, dlperf 200+ | 17–19 | $0.32 |
| 24-core PCIe 5, dlperf 190 | 14–16 | $0.40 |
| 16-core PCIe 4, dlperf 200 | 9–11 | $0.55 |
| 8-core PCIe 4, dlperf 155 | 7 | $0.80 |
Each chunk is ~300 chars. A typical novel is ~30k chunks → $10-25 with the right host. Apply the recommended --min-* filters to avoid the bottom row.
Pipeline:
- extract — EPUB via ebooklib + BeautifulSoup; PDF via pymupdf4llm; plain text passed through.
- chunk — sentence-aware split, ~300 chars per chunk (Chatterbox limit).
--chunk M-of-Nsplits a book across pods. - synthesize —
TTSAdapter.synthesize(text)returns float32 samples; each chunk persisted to<book>.chunks/NNNNN.npyfor resume. - stitch — concatenate chunks with 0.8s silence; pydub → ffmpeg → MP3 with
title/artistID3 tags.
For cloud runs, scripts/remote_run.py wraps the pipeline in a provision → rsync → bootstrap → fetch → destroy lifecycle and parallelizes across pods. Pods run book2audio with --no-mp3 (chunks only); the orchestrator rsyncs chunks home and stitches locally. Two consequences: GPU pods don't waste cycles on CPU-bound stitching, and a dropped pod never costs more than the chunks already on disk.
Three layers, defense in depth:
- Graceful self-destruct — bootstrap's last act on success is to write
0to/workspace/deadline.epoch, triggering the deadman watcher to destroy within 60s. So pods die as soon as their work is done, not when the orchestrator notices. - Deadman timer — every pod runs a background watcher that polls
/workspace/deadline.epoch(default: 7h from boot). When the deadline passes, the watcher calls Vast's destroy API on its own instance ID. Cap the worst case regardless of laptop state. Extend withpython scripts/remote_run.py extend-deadman --output out/ --hours 3. - Provider credit cap — load Vast/RunPod with a fixed amount, no auto-billing. When credit hits zero, the provider destroys all your instances. Hard ceiling.
You shouldn't have to rely on layer 3. But it's there.
- Vast SSH-key propagation race: occasionally a pod is TCP-reachable but
authorized_keyshasn't been baked yet. We probe SSH+auth before continuing (wait_sshdoes this), but worst case the pod fails to provision and we re-run that single book. - RunPod has no host-quality filter: their
gpu_typeis the only selector. Use Secure tier. - Chatterbox 0.1.x has a 300-char per-chunk limit. We respect it; the chunker splits at sentence boundaries to keep seams clean.
- Voice cloning consent is your responsibility. This repo ships no voice samples. Cloning the voice of a real person without their consent may be illegal in your jurisdiction.
- One TTS engine = one Docker image for cloud runs. The pre-baked image is Chatterbox-only. To use Kokoro on cloud, build your own image (see
docker/).
The pre-baked ghcr.io/heitzlki/book2audio-chatterbox:latest is what the orchestrator pulls by default. To use your own (a fork, a different TTS engine, a CUDA version pin):
# Build + push (one time)
cd docker/
./build_and_push.sh ghcr.io/<you>/book2audio-chatterbox:latest
# Tell the orchestrator to use it
export BOOK2AUDIO_IMAGE=ghcr.io/<you>/book2audio-chatterbox:latest
python scripts/remote_run.py parallel-all ...remote_run.py provision # one pod, no work
remote_run.py wait # block until SSH ready
remote_run.py sync-up # rsync project + books + voice up
remote_run.py run # bash /workspace/remote_bootstrap.sh
remote_run.py sync-down # rsync MP3s down
remote_run.py destroy # tear it down
remote_run.py all # all of the above, single pod
remote_run.py parallel-all # all of the above, N pods in parallel
remote_run.py monitor # poll a running pod (rsync + log)
remote_run.py status # summarize a parallel-all run state
remote_run.py recover # re-sync + destroy after orchestrator crash
remote_run.py extend-deadman # push the deadman deadline back
book2audio sample.epub --out-dir out/ # default: kokoro
book2audio sample.epub --out-dir out/ -e chatterbox --voice-ref voice.wav
book2audio --helpuv sync --extra dev
pre-commit install
pytest tests/pre-commit install wires the hooks. Every git commit then runs ruff (lint + format), mypy, and the standard hygiene checks.
Issues + PRs welcome. No CLA.
MIT — see LICENSE.