book2audio

Heads up: this repo was vibe-coded — but tested in production. 6 audiobooks (~50 hours of audio) synthesized end-to-end across multiple Vast.ai pods, bugs surfaced and fixed in the wild, ~$12 of GPU time burned validating the orchestrator

chunks-only pipeline. Use it, fork it, file issues.

Convert PDF/EPUB books to MP3 audiobooks. Two backends:

Kokoro — fast, free, runs on CPU/MPS/CUDA, no voice cloning.
Chatterbox — slower, GPU-friendly, clones any voice from a 10-30s reference clip. Cloud GPU orchestrator included for batch runs on Vast.ai or RunPod.

Output MP3s are tagged with title and artist from EPUB metadata, named Title - Author.mp3.

Quick start (local, free)

git clone https://github.com/heitzlki/book2audio
cd book2audio
pip install -e .[kokoro]            # ~2 min, no GPU needed
book2audio mybook.epub --out-dir out/
# → out/<Title> - <Author>.mp3

Works on macOS (M1/M2/M3 via MPS), Linux, Windows. Real-time factor on M1 Max ≈ 0.4 (a 10-hour book takes ~4 hours of CPU).

# Kokoro voice options (run with -v <voice>):
book2audio mybook.epub --out-dir out/ -v af_bella    # American female
book2audio mybook.epub --out-dir out/ -v am_adam     # American male
book2audio mybook.epub --out-dir out/ -v bf_emma     # British female
# Full list: https://github.com/hexgrad/kokoro

Voice cloning (Chatterbox, local GPU or cloud)

You'll need a voice reference: a 10-30s clean audio clip of the voice you want, in WAV format. You provide your own — this repo ships no voice samples. Cloning a real person's voice without consent is your responsibility, not the library's.

pip install -e .[chatterbox]
book2audio mybook.epub --out-dir out/ -e chatterbox \
    --voice-ref ./my_voice.wav
# Auto-detects CUDA → MPS → CPU. Force with --device cuda.

Local Chatterbox on a 4090 ≈ ~3-4× real-time. On MPS (Apple Silicon) it's ~10× slower than CUDA — usable for short books, painful for full novels.

Cloud GPU batch (Vast.ai or RunPod)

For multi-book runs, rent GPUs. The orchestrator at scripts/remote_run.py handles provisioning, sync, the actual run, fetching MP3s back, and tearing down. Three-layer cost safety chain (graceful self-destruct, deadman timer, provider credit cap) caps your worst-case spend even if your laptop sleeps mid-run.

One book, one pod

# 1. Set up Vast.ai
export VAST_API_KEY=...
vastai set api-key $VAST_API_KEY                 # one-time, ~/.vast_api_key

# 2. Run
python scripts/remote_run.py all \
    --provider vast --gpu rtx5090 --max-price 0.45 \
    --books books/ \
    --voice-ref ./voice.wav \
    --output out/
# Provisions pod → rsyncs project + books + voice → bootstraps Chatterbox →
# synthesizes → rsyncs MP3s back → destroys pod.

Cost on a single RTX 5090 (~$0.34/hr): a 600-page book → ~$5–7, ~3-4 hours wall.

Many books, parallel pods

For batch runs, parallel-all provisions one pod per book (auto-splits books over --max-chars-per-pod into chunks, reassembles via ffmpeg at the end):

caffeinate -i nohup python scripts/remote_run.py parallel-all \
    --provider vast --gpu rtx5090 --max-price 0.45 \
    --min-cpu-cores 16 --min-dlperf 200 --min-pcie-gen 4 \
    --books books/ \
    --voice-ref ./voice.wav \
    --output out/ \
    --max-chars-per-pod 600000 \
    --checkpoint-interval 60 \
    --log-dir /tmp/pod_logs \
    --yes > /tmp/orchestrator.log 2>&1 &

# Live dashboard in another terminal:
python scripts/live_dash.py out/

Wall-clock ≈ time for the slowest pod's book (not the sum). Per-pod chunk-level checkpoints back up to local every minute, so a dead pod can be recovered without losing work.

Hardware filters (Vast.ai only)

Vast.ai hosts vary wildly even at the same --max-price. Without filters, you can land an 8-core CPU + PCIe 4 + dlperf 155 host that's 2.5× slower than a 32-core PCIe 5 dlperf 200 peer at the same hourly rate. Recommended:

--min-cpu-cores 16   # Chatterbox is bottlenecked by CPU↔GPU transfers
--min-dlperf 200     # Vast's synthetic GPU benchmark
--min-pcie-gen 4     # 5 is rare; 4 is the safe floor
--min-inet-up 500    # Pod upload Mbps; this is what rsync_down speed bottlenecks on

RunPod has no equivalent host-quality filter — its pricing tiers reflect quality. Use Secure tier for predictable throughput.

Cost budgeting

Measured on Vast.ai RTX 5090 with Chatterbox 0.1.x:

Hardware tier	Throughput (chunks/min)	$/1k chunks
32-core PCIe 5, dlperf 200+	17–19	$0.32
24-core PCIe 5, dlperf 190	14–16	$0.40
16-core PCIe 4, dlperf 200	9–11	$0.55
8-core PCIe 4, dlperf 155	7	$0.80

Each chunk is ~300 chars. A typical novel is ~30k chunks → $10-25 with the right host. Apply the recommended --min-* filters to avoid the bottom row.

Architecture

Pipeline:

extract — EPUB via ebooklib + BeautifulSoup; PDF via pymupdf4llm; plain text passed through.
chunk — sentence-aware split, ~300 chars per chunk (Chatterbox limit). --chunk M-of-N splits a book across pods.
synthesize — TTSAdapter.synthesize(text) returns float32 samples; each chunk persisted to <book>.chunks/NNNNN.npy for resume.
stitch — concatenate chunks with 0.8s silence; pydub → ffmpeg → MP3 with title/artist ID3 tags.

For cloud runs, scripts/remote_run.py wraps the pipeline in a provision → rsync → bootstrap → fetch → destroy lifecycle and parallelizes across pods. Pods run book2audio with --no-mp3 (chunks only); the orchestrator rsyncs chunks home and stitches locally. Two consequences: GPU pods don't waste cycles on CPU-bound stitching, and a dropped pod never costs more than the chunks already on disk.

Cost safety chain (cloud runs)

Three layers, defense in depth:

Graceful self-destruct — bootstrap's last act on success is to write 0 to /workspace/deadline.epoch, triggering the deadman watcher to destroy within 60s. So pods die as soon as their work is done, not when the orchestrator notices.
Deadman timer — every pod runs a background watcher that polls /workspace/deadline.epoch (default: 7h from boot). When the deadline passes, the watcher calls Vast's destroy API on its own instance ID. Cap the worst case regardless of laptop state. Extend with python scripts/remote_run.py extend-deadman --output out/ --hours 3.
Provider credit cap — load Vast/RunPod with a fixed amount, no auto-billing. When credit hits zero, the provider destroys all your instances. Hard ceiling.

You shouldn't have to rely on layer 3. But it's there.

Known limitations

Vast SSH-key propagation race: occasionally a pod is TCP-reachable but authorized_keys hasn't been baked yet. We probe SSH+auth before continuing (wait_ssh does this), but worst case the pod fails to provision and we re-run that single book.
RunPod has no host-quality filter: their gpu_type is the only selector. Use Secure tier.
Chatterbox 0.1.x has a 300-char per-chunk limit. We respect it; the chunker splits at sentence boundaries to keep seams clean.
Voice cloning consent is your responsibility. This repo ships no voice samples. Cloning the voice of a real person without their consent may be illegal in your jurisdiction.
One TTS engine = one Docker image for cloud runs. The pre-baked image is Chatterbox-only. To use Kokoro on cloud, build your own image (see docker/).

Configuring your own Docker image

The pre-baked ghcr.io/heitzlki/book2audio-chatterbox:latest is what the orchestrator pulls by default. To use your own (a fork, a different TTS engine, a CUDA version pin):

# Build + push (one time)
cd docker/
./build_and_push.sh ghcr.io/<you>/book2audio-chatterbox:latest

# Tell the orchestrator to use it
export BOOK2AUDIO_IMAGE=ghcr.io/<you>/book2audio-chatterbox:latest
python scripts/remote_run.py parallel-all ...

Subcommands reference

remote_run.py provision     # one pod, no work
remote_run.py wait          # block until SSH ready
remote_run.py sync-up       # rsync project + books + voice up
remote_run.py run           # bash /workspace/remote_bootstrap.sh
remote_run.py sync-down     # rsync MP3s down
remote_run.py destroy       # tear it down
remote_run.py all           # all of the above, single pod
remote_run.py parallel-all  # all of the above, N pods in parallel
remote_run.py monitor       # poll a running pod (rsync + log)
remote_run.py status        # summarize a parallel-all run state
remote_run.py recover       # re-sync + destroy after orchestrator crash
remote_run.py extend-deadman  # push the deadman deadline back

book2audio sample.epub --out-dir out/                # default: kokoro
book2audio sample.epub --out-dir out/ -e chatterbox --voice-ref voice.wav
book2audio --help

Development

uv sync --extra dev
pre-commit install
pytest tests/

pre-commit install wires the hooks. Every git commit then runs ruff (lint + format), mypy, and the standard hygiene checks.

Contributing

Issues + PRs welcome. No CLA.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
docker		docker
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

book2audio

Quick start (local, free)

Voice cloning (Chatterbox, local GPU or cloud)

Cloud GPU batch (Vast.ai or RunPod)

One book, one pod

Many books, parallel pods

Hardware filters (Vast.ai only)

Cost budgeting

Architecture

Cost safety chain (cloud runs)

Known limitations

Configuring your own Docker image

Subcommands reference

Development

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

book2audio

Quick start (local, free)

Voice cloning (Chatterbox, local GPU or cloud)

Cloud GPU batch (Vast.ai or RunPod)

One book, one pod

Many books, parallel pods

Hardware filters (Vast.ai only)

Cost budgeting

Architecture

Cost safety chain (cloud runs)

Known limitations

Configuring your own Docker image

Subcommands reference

Development

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages