Extract structured data from any epub — books, cookbooks, manuals, papers.
pip install ebooklib beautifulsoup4 lxml pyyamlScan an epub to see what CSS classes it uses:
python -m epubkit discover book.epubThis generates a book.yaml config listing every CSS class with examples and occurrence counts. You see the examples, you label the class.
Open the YAML and label each class with a role:
classes:
# (267x) Examples:
# The network model
# The relational model
# Comparison to document databases
sect3: section_header
# (513x) Examples:
# Mathematics has been going on for a long time...
# We humans always try to make things better...
p: text
# (12x) Examples:
# Chapter 1. Reliable, Scalable, and Maintainable Applications
# Chapter 2. Data Models and Query Languages
chapter: chapter_titleTwo structural roles control grouping:
chapter_title— groups entries into chapterstitle— starts a new entry within a chapter
All other roles (text, section_header, note, headnote, or any custom name) are stored as content blocks on the current entry.
Prose mode: if no class maps to title, each chapter_title becomes an entry. This is the natural mode for novels, textbooks, and papers.
Item mode: if a class maps to title, entries are sub-items within chapters. This is for cookbooks (recipes), anthologies (poems), reference books (entries).
Any role name works — custom roles render as <p class="role">.
Add tags with +: "new-recipe": title+new marks entries with a new tag.
python -m epubkit parse book.epub book.yaml # → HTML
python -m epubkit parse book.epub book.yaml --format=json # → JSONfrom epubkit import discover, parse, to_html, to_json
# Discover
result = discover("book.epub")
result.save("book.yaml")
# Parse
result = parse("book.epub", "book.yaml")
# Search and filter
hits = result.search("network")
ch2 = result.filter(chapter="Chapter 2")
new = result.filter(tags=["new"])
# Access entries
for entry in result:
print(entry.title)
print(entry.texts()) # all text blocks
print(entry.texts("section_header")) # just headers
print(entry.to_dict()) # export as dict
# Export
to_html(result, "output/")
to_json(result, "output/items.json")In your YAML config:
options:
title_case: true # convert ALL CAPS titles to Title Case
exclude_images: # glob patterns for images to skip
- "inline-*"
- "ch*-map*"
extract_tables: true # include <table> data in entriesepubkit resolves internal cross-references automatically:
linking:
anchor_patterns:
- "dsh*" # primary entry anchors
sub_anchor_patterns:
- "page_*" # secondary anchors (page refs)
resolve_here: true # resolve "here" links by matching nearby titles
here_words: ["here"]Books from commercial publishers often have dozens of CSS classes that map directly to semantic roles. The discover → configure → parse workflow handles these well.
Many Project Gutenberg epubs use raw HTML tags with no CSS classes. Use EpubReader directly:
from epubkit import EpubReader, make_id
reader = EpubReader("book.epub")
for doc in reader.documents():
soup = doc["soup"]
for h4 in soup.find_all("h4"):
title = h4.get_text().strip()
# Walk siblings until next heading...See demos/mrs-beeton/ for a full example.
| Class / Function | Description |
|---|---|
discover(path) |
Scan epub, return Discovery with classes, images, anchors |
parse(path, config) |
Parse epub with config, return ParseResult |
to_html(result, dir) |
Write browsable HTML to directory |
to_json(result, path) |
Write entries as JSON |
EpubReader(path) |
Low-level: iterate documents and images |
LinkEngine(config) |
Cross-reference resolution engine |
Entry |
Dataclass: .title, .content, .texts(), .images, .to_dict() |
clean_text(html) |
Strip tags, decode entities, normalize whitespace |
make_id(text) |
Text → URL-safe slug |
title_case(text) |
ALL CAPS → Title Case |
demos/origin-of-species/— Darwin's On the Origin of Species. 14 CSS classes, tables, 17 chapters via YAML config.demos/mrs-beeton/— Mrs Beeton's Book of Household Management (1861). No CSS classes, 642 recipes via customEpubReaderparsing.