Skip to content

heitzlki/book2audio

Repository files navigation

book2audio

build-image

Heads up: this repo was vibe-coded — but tested in production. 6 audiobooks (~50 hours of audio) synthesized end-to-end across multiple Vast.ai pods, bugs surfaced and fixed in the wild, ~$12 of GPU time burned validating the orchestrator

  • chunks-only pipeline. Use it, fork it, file issues.

Convert PDF/EPUB books to MP3 audiobooks. Two backends:

  • Kokoro — fast, free, runs on CPU/MPS/CUDA, no voice cloning.
  • Chatterbox — slower, GPU-friendly, clones any voice from a 10-30s reference clip. Cloud GPU orchestrator included for batch runs on Vast.ai or RunPod.

Output MP3s are tagged with title and artist from EPUB metadata, named Title - Author.mp3.


Quick start (local, free)

git clone https://github.com/heitzlki/book2audio
cd book2audio
pip install -e .[kokoro]            # ~2 min, no GPU needed
book2audio mybook.epub --out-dir out/
# → out/<Title> - <Author>.mp3

Works on macOS (M1/M2/M3 via MPS), Linux, Windows. Real-time factor on M1 Max ≈ 0.4 (a 10-hour book takes ~4 hours of CPU).

# Kokoro voice options (run with -v <voice>):
book2audio mybook.epub --out-dir out/ -v af_bella    # American female
book2audio mybook.epub --out-dir out/ -v am_adam     # American male
book2audio mybook.epub --out-dir out/ -v bf_emma     # British female
# Full list: https://github.com/hexgrad/kokoro

Voice cloning (Chatterbox, local GPU or cloud)

You'll need a voice reference: a 10-30s clean audio clip of the voice you want, in WAV format. You provide your own — this repo ships no voice samples. Cloning a real person's voice without consent is your responsibility, not the library's.

pip install -e .[chatterbox]
book2audio mybook.epub --out-dir out/ -e chatterbox \
    --voice-ref ./my_voice.wav
# Auto-detects CUDA → MPS → CPU. Force with --device cuda.

Local Chatterbox on a 4090 ≈ ~3-4× real-time. On MPS (Apple Silicon) it's ~10× slower than CUDA — usable for short books, painful for full novels.


Cloud GPU batch (Vast.ai or RunPod)

For multi-book runs, rent GPUs. The orchestrator at scripts/remote_run.py handles provisioning, sync, the actual run, fetching MP3s back, and tearing down. Three-layer cost safety chain (graceful self-destruct, deadman timer, provider credit cap) caps your worst-case spend even if your laptop sleeps mid-run.

One book, one pod

# 1. Set up Vast.ai
export VAST_API_KEY=...
vastai set api-key $VAST_API_KEY                 # one-time, ~/.vast_api_key

# 2. Run
python scripts/remote_run.py all \
    --provider vast --gpu rtx5090 --max-price 0.45 \
    --books books/ \
    --voice-ref ./voice.wav \
    --output out/
# Provisions pod → rsyncs project + books + voice → bootstraps Chatterbox →
# synthesizes → rsyncs MP3s back → destroys pod.

Cost on a single RTX 5090 (~$0.34/hr): a 600-page book → ~$5–7, ~3-4 hours wall.

Many books, parallel pods

For batch runs, parallel-all provisions one pod per book (auto-splits books over --max-chars-per-pod into chunks, reassembles via ffmpeg at the end):

caffeinate -i nohup python scripts/remote_run.py parallel-all \
    --provider vast --gpu rtx5090 --max-price 0.45 \
    --min-cpu-cores 16 --min-dlperf 200 --min-pcie-gen 4 \
    --books books/ \
    --voice-ref ./voice.wav \
    --output out/ \
    --max-chars-per-pod 600000 \
    --checkpoint-interval 60 \
    --log-dir /tmp/pod_logs \
    --yes > /tmp/orchestrator.log 2>&1 &

# Live dashboard in another terminal:
python scripts/live_dash.py out/

Wall-clock ≈ time for the slowest pod's book (not the sum). Per-pod chunk-level checkpoints back up to local every minute, so a dead pod can be recovered without losing work.

Hardware filters (Vast.ai only)

Vast.ai hosts vary wildly even at the same --max-price. Without filters, you can land an 8-core CPU + PCIe 4 + dlperf 155 host that's 2.5× slower than a 32-core PCIe 5 dlperf 200 peer at the same hourly rate. Recommended:

--min-cpu-cores 16   # Chatterbox is bottlenecked by CPU↔GPU transfers
--min-dlperf 200     # Vast's synthetic GPU benchmark
--min-pcie-gen 4     # 5 is rare; 4 is the safe floor
--min-inet-up 500    # Pod upload Mbps; this is what rsync_down speed bottlenecks on

RunPod has no equivalent host-quality filter — its pricing tiers reflect quality. Use Secure tier for predictable throughput.


Cost budgeting

Measured on Vast.ai RTX 5090 with Chatterbox 0.1.x:

Hardware tier Throughput (chunks/min) $/1k chunks
32-core PCIe 5, dlperf 200+ 17–19 $0.32
24-core PCIe 5, dlperf 190 14–16 $0.40
16-core PCIe 4, dlperf 200 9–11 $0.55
8-core PCIe 4, dlperf 155 7 $0.80

Each chunk is ~300 chars. A typical novel is ~30k chunks → $10-25 with the right host. Apply the recommended --min-* filters to avoid the bottom row.


Architecture

Pipeline:

  • extract — EPUB via ebooklib + BeautifulSoup; PDF via pymupdf4llm; plain text passed through.
  • chunk — sentence-aware split, ~300 chars per chunk (Chatterbox limit). --chunk M-of-N splits a book across pods.
  • synthesizeTTSAdapter.synthesize(text) returns float32 samples; each chunk persisted to <book>.chunks/NNNNN.npy for resume.
  • stitch — concatenate chunks with 0.8s silence; pydub → ffmpeg → MP3 with title/artist ID3 tags.

For cloud runs, scripts/remote_run.py wraps the pipeline in a provision → rsync → bootstrap → fetch → destroy lifecycle and parallelizes across pods. Pods run book2audio with --no-mp3 (chunks only); the orchestrator rsyncs chunks home and stitches locally. Two consequences: GPU pods don't waste cycles on CPU-bound stitching, and a dropped pod never costs more than the chunks already on disk.


Cost safety chain (cloud runs)

Three layers, defense in depth:

  1. Graceful self-destruct — bootstrap's last act on success is to write 0 to /workspace/deadline.epoch, triggering the deadman watcher to destroy within 60s. So pods die as soon as their work is done, not when the orchestrator notices.
  2. Deadman timer — every pod runs a background watcher that polls /workspace/deadline.epoch (default: 7h from boot). When the deadline passes, the watcher calls Vast's destroy API on its own instance ID. Cap the worst case regardless of laptop state. Extend with python scripts/remote_run.py extend-deadman --output out/ --hours 3.
  3. Provider credit cap — load Vast/RunPod with a fixed amount, no auto-billing. When credit hits zero, the provider destroys all your instances. Hard ceiling.

You shouldn't have to rely on layer 3. But it's there.


Known limitations

  • Vast SSH-key propagation race: occasionally a pod is TCP-reachable but authorized_keys hasn't been baked yet. We probe SSH+auth before continuing (wait_ssh does this), but worst case the pod fails to provision and we re-run that single book.
  • RunPod has no host-quality filter: their gpu_type is the only selector. Use Secure tier.
  • Chatterbox 0.1.x has a 300-char per-chunk limit. We respect it; the chunker splits at sentence boundaries to keep seams clean.
  • Voice cloning consent is your responsibility. This repo ships no voice samples. Cloning the voice of a real person without their consent may be illegal in your jurisdiction.
  • One TTS engine = one Docker image for cloud runs. The pre-baked image is Chatterbox-only. To use Kokoro on cloud, build your own image (see docker/).

Configuring your own Docker image

The pre-baked ghcr.io/heitzlki/book2audio-chatterbox:latest is what the orchestrator pulls by default. To use your own (a fork, a different TTS engine, a CUDA version pin):

# Build + push (one time)
cd docker/
./build_and_push.sh ghcr.io/<you>/book2audio-chatterbox:latest

# Tell the orchestrator to use it
export BOOK2AUDIO_IMAGE=ghcr.io/<you>/book2audio-chatterbox:latest
python scripts/remote_run.py parallel-all ...

Subcommands reference

remote_run.py provision     # one pod, no work
remote_run.py wait          # block until SSH ready
remote_run.py sync-up       # rsync project + books + voice up
remote_run.py run           # bash /workspace/remote_bootstrap.sh
remote_run.py sync-down     # rsync MP3s down
remote_run.py destroy       # tear it down
remote_run.py all           # all of the above, single pod
remote_run.py parallel-all  # all of the above, N pods in parallel
remote_run.py monitor       # poll a running pod (rsync + log)
remote_run.py status        # summarize a parallel-all run state
remote_run.py recover       # re-sync + destroy after orchestrator crash
remote_run.py extend-deadman  # push the deadman deadline back
book2audio sample.epub --out-dir out/                # default: kokoro
book2audio sample.epub --out-dir out/ -e chatterbox --voice-ref voice.wav
book2audio --help

Development

uv sync --extra dev
pre-commit install
pytest tests/

pre-commit install wires the hooks. Every git commit then runs ruff (lint + format), mypy, and the standard hygiene checks.


Contributing

Issues + PRs welcome. No CLA.


License

MIT — see LICENSE.

About

Converte any PDF/EPUB to Audio w/ Custom Voice

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors