epubkit

Extract structured data from any epub — books, cookbooks, manuals, papers.

Install

pip install ebooklib beautifulsoup4 lxml pyyaml

Quick start

1. Discover

Scan an epub to see what CSS classes it uses:

python -m epubkit discover book.epub

This generates a book.yaml config listing every CSS class with examples and occurrence counts. You see the examples, you label the class.

2. Configure

Open the YAML and label each class with a role:

classes:
  # (267x) Examples:
  #   The network model
  #   The relational model
  #   Comparison to document databases
  sect3: section_header

  # (513x) Examples:
  #   Mathematics has been going on for a long time...
  #   We humans always try to make things better...
  p: text

  # (12x) Examples:
  #   Chapter 1. Reliable, Scalable, and Maintainable Applications
  #   Chapter 2. Data Models and Query Languages
  chapter: chapter_title

Two structural roles control grouping:

chapter_title — groups entries into chapters
title — starts a new entry within a chapter

All other roles (text, section_header, note, headnote, or any custom name) are stored as content blocks on the current entry.

Prose mode: if no class maps to title, each chapter_title becomes an entry. This is the natural mode for novels, textbooks, and papers.

Item mode: if a class maps to title, entries are sub-items within chapters. This is for cookbooks (recipes), anthologies (poems), reference books (entries).

Any role name works — custom roles render as <p class="role">.

Add tags with +: "new-recipe": title+new marks entries with a new tag.

3. Parse

python -m epubkit parse book.epub book.yaml           # → HTML
python -m epubkit parse book.epub book.yaml --format=json  # → JSON

4. Use as a library

from epubkit import discover, parse, to_html, to_json

# Discover
result = discover("book.epub")
result.save("book.yaml")

# Parse
result = parse("book.epub", "book.yaml")

# Search and filter
hits = result.search("network")
ch2 = result.filter(chapter="Chapter 2")
new = result.filter(tags=["new"])

# Access entries
for entry in result:
    print(entry.title)
    print(entry.texts())                    # all text blocks
    print(entry.texts("section_header"))    # just headers
    print(entry.to_dict())                  # export as dict

# Export
to_html(result, "output/")
to_json(result, "output/items.json")

Options

In your YAML config:

options:
  title_case: true          # convert ALL CAPS titles to Title Case
  exclude_images:           # glob patterns for images to skip
    - "inline-*"
    - "ch*-map*"
  extract_tables: true      # include <table> data in entries

Linking

epubkit resolves internal cross-references automatically:

linking:
  anchor_patterns:
    - "dsh*"                # primary entry anchors
  sub_anchor_patterns:
    - "page_*"              # secondary anchors (page refs)
  resolve_here: true        # resolve "here" links by matching nearby titles
  here_words: ["here"]

Handling complex epubs

Rich CSS classes (ideal case)

Books from commercial publishers often have dozens of CSS classes that map directly to semantic roles. The discover → configure → parse workflow handles these well.

Minimal CSS (Project Gutenberg)

Many Project Gutenberg epubs use raw HTML tags with no CSS classes. Use EpubReader directly:

from epubkit import EpubReader, make_id

reader = EpubReader("book.epub")

for doc in reader.documents():
    soup = doc["soup"]
    for h4 in soup.find_all("h4"):
        title = h4.get_text().strip()
        # Walk siblings until next heading...

See demos/mrs-beeton/ for a full example.

API reference

Class / Function	Description
`discover(path)`	Scan epub, return `Discovery` with classes, images, anchors
`parse(path, config)`	Parse epub with config, return `ParseResult`
`to_html(result, dir)`	Write browsable HTML to directory
`to_json(result, path)`	Write entries as JSON
`EpubReader(path)`	Low-level: iterate documents and images
`LinkEngine(config)`	Cross-reference resolution engine
`Entry`	Dataclass: `.title`, `.content`, `.texts()`, `.images`, `.to_dict()`
`clean_text(html)`	Strip tags, decode entities, normalize whitespace
`make_id(text)`	Text → URL-safe slug
`title_case(text)`	ALL CAPS → Title Case

Demos

demos/origin-of-species/ — Darwin's On the Origin of Species. 14 CSS classes, tables, 17 chapters via YAML config.
demos/mrs-beeton/ — Mrs Beeton's Book of Household Management (1861). No CSS classes, 642 recipes via custom EpubReader parsing.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
demos		demos
epubkit		epubkit
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

epubkit

Install

Quick start

1. Discover

2. Configure

3. Parse

4. Use as a library

Options

Linking

Handling complex epubs

Rich CSS classes (ideal case)

Minimal CSS (Project Gutenberg)

API reference

Demos

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

epubkit

Install

Quick start

1. Discover

2. Configure

3. Parse

4. Use as a library

Options

Linking

Handling complex epubs

Rich CSS classes (ideal case)

Minimal CSS (Project Gutenberg)

API reference

Demos

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages