Skip to content

bramheerink/epubkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

epubkit

Extract structured data from any epub — books, cookbooks, manuals, papers.

Install

pip install ebooklib beautifulsoup4 lxml pyyaml

Quick start

1. Discover

Scan an epub to see what CSS classes it uses:

python -m epubkit discover book.epub

This generates a book.yaml config listing every CSS class with examples and occurrence counts. You see the examples, you label the class.

2. Configure

Open the YAML and label each class with a role:

classes:
  # (267x) Examples:
  #   The network model
  #   The relational model
  #   Comparison to document databases
  sect3: section_header

  # (513x) Examples:
  #   Mathematics has been going on for a long time...
  #   We humans always try to make things better...
  p: text

  # (12x) Examples:
  #   Chapter 1. Reliable, Scalable, and Maintainable Applications
  #   Chapter 2. Data Models and Query Languages
  chapter: chapter_title

Two structural roles control grouping:

  • chapter_title — groups entries into chapters
  • title — starts a new entry within a chapter

All other roles (text, section_header, note, headnote, or any custom name) are stored as content blocks on the current entry.

Prose mode: if no class maps to title, each chapter_title becomes an entry. This is the natural mode for novels, textbooks, and papers.

Item mode: if a class maps to title, entries are sub-items within chapters. This is for cookbooks (recipes), anthologies (poems), reference books (entries).

Any role name works — custom roles render as <p class="role">.

Add tags with +: "new-recipe": title+new marks entries with a new tag.

3. Parse

python -m epubkit parse book.epub book.yaml           # → HTML
python -m epubkit parse book.epub book.yaml --format=json  # → JSON

4. Use as a library

from epubkit import discover, parse, to_html, to_json

# Discover
result = discover("book.epub")
result.save("book.yaml")

# Parse
result = parse("book.epub", "book.yaml")

# Search and filter
hits = result.search("network")
ch2 = result.filter(chapter="Chapter 2")
new = result.filter(tags=["new"])

# Access entries
for entry in result:
    print(entry.title)
    print(entry.texts())                    # all text blocks
    print(entry.texts("section_header"))    # just headers
    print(entry.to_dict())                  # export as dict

# Export
to_html(result, "output/")
to_json(result, "output/items.json")

Options

In your YAML config:

options:
  title_case: true          # convert ALL CAPS titles to Title Case
  exclude_images:           # glob patterns for images to skip
    - "inline-*"
    - "ch*-map*"
  extract_tables: true      # include <table> data in entries

Linking

epubkit resolves internal cross-references automatically:

linking:
  anchor_patterns:
    - "dsh*"                # primary entry anchors
  sub_anchor_patterns:
    - "page_*"              # secondary anchors (page refs)
  resolve_here: true        # resolve "here" links by matching nearby titles
  here_words: ["here"]

Handling complex epubs

Rich CSS classes (ideal case)

Books from commercial publishers often have dozens of CSS classes that map directly to semantic roles. The discoverconfigureparse workflow handles these well.

Minimal CSS (Project Gutenberg)

Many Project Gutenberg epubs use raw HTML tags with no CSS classes. Use EpubReader directly:

from epubkit import EpubReader, make_id

reader = EpubReader("book.epub")

for doc in reader.documents():
    soup = doc["soup"]
    for h4 in soup.find_all("h4"):
        title = h4.get_text().strip()
        # Walk siblings until next heading...

See demos/mrs-beeton/ for a full example.

API reference

Class / Function Description
discover(path) Scan epub, return Discovery with classes, images, anchors
parse(path, config) Parse epub with config, return ParseResult
to_html(result, dir) Write browsable HTML to directory
to_json(result, path) Write entries as JSON
EpubReader(path) Low-level: iterate documents and images
LinkEngine(config) Cross-reference resolution engine
Entry Dataclass: .title, .content, .texts(), .images, .to_dict()
clean_text(html) Strip tags, decode entities, normalize whitespace
make_id(text) Text → URL-safe slug
title_case(text) ALL CAPS → Title Case

Demos

  • demos/origin-of-species/ — Darwin's On the Origin of Species. 14 CSS classes, tables, 17 chapters via YAML config.
  • demos/mrs-beeton/ — Mrs Beeton's Book of Household Management (1861). No CSS classes, 642 recipes via custom EpubReader parsing.

About

Extract structured data from any epub. Discover CSS classes, configure via YAML, parse into items. Works as CLI and Python library.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages