Description
Currently, the _epub_converter.py in MarkItDown has two significant limitations when handling EPUB files:
- Noise in output: XHTML files often include XML declarations and
<style> blocks. If the BeautifulSoup search for <body> fails (common with namespaces), the entire raw file content is included in the Markdown output.
- Missing Content: The simplistic path joining logic
f"{base_path}/{manifest[item_id]}" fails to correctly resolve relative paths (e.g., ../Text/...) used in many commercial EPUB manifests.
Potential Solution
A more robust manifest parser that handles relative paths and a more aggressive XHTML cleaner would resolve these issues. I have implemented a workaround for a private project and would be happy to contribute a PR if interested.
Description
Currently, the
_epub_converter.pyin MarkItDown has two significant limitations when handling EPUB files:<style>blocks. If theBeautifulSoupsearch for<body>fails (common with namespaces), the entire raw file content is included in the Markdown output.f"{base_path}/{manifest[item_id]}"fails to correctly resolve relative paths (e.g.,../Text/...) used in many commercial EPUB manifests.Potential Solution
A more robust manifest parser that handles relative paths and a more aggressive XHTML cleaner would resolve these issues. I have implemented a workaround for a private project and would be happy to contribute a PR if interested.