Skip to content

newjordan/PDF-Chapter-Chunker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF Chapter Chunker πŸ“š

An intelligent PDF splitting tool that can automatically detect chapters from a table of contents and split PDFs into digestible, topic-focused chunks.

✨ Features

  • πŸ” Smart Chapter Detection: Automatically extracts table of contents and splits by chapters
  • πŸ“„ Page-Based Chunking: Fallback to fixed-page chunks when no TOC is found
  • πŸ“ Organized Output: Creates clean folder structures with descriptive filenames
  • 🏷️ Rich Metadata: Adds bookmarks and metadata to generated PDFs
  • πŸ›‘οΈ Error Handling: Robust error handling and graceful fallbacks
  • πŸ“‹ Multiple TOC Formats: Supports various table of contents patterns
  • ⚑ Fast Processing: Efficient PDF processing with minimal memory usage

πŸš€ Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Install Dependencies

pip install pypdf

Download the Script

# Clone the repository
git clone https://github.com/newjordan/PDF-Chapter-Chunker.git
cd PDF-Chapter-Chunker

# Or download directly
wget https://raw.githubusercontent.com/newjordan/PDF-Chapter-Chunker/main/pdf_chapter_chunker.py

πŸ“– Usage

Basic Usage

# Split by chapters (default mode)
python pdf_chapter_chunker.py book.pdf

# Split by chapters explicitly
python pdf_chapter_chunker.py book.pdf --mode chapters

# Split by pages (99 pages per chunk)
python pdf_chapter_chunker.py book.pdf --mode pages

# Split by pages with custom chunk size
python pdf_chapter_chunker.py book.pdf --mode pages --size 50

Advanced Options

# Custom output directory
python pdf_chapter_chunker.py book.pdf --output ./my_chunks

# Quiet mode (minimal output)
python pdf_chapter_chunker.py book.pdf --quiet

# Get help
python pdf_chapter_chunker.py --help

🎯 Examples

Chapter Mode (Recommended)

python pdf_chapter_chunker.py "Technical_Manual.pdf"

Output:

Processing Technical_Manual.pdf (450 pages)
Extracting table of contents from first chunk...
Found 23 chapter entries:
  - Introduction (page 5)
  - Chapter 1: Getting Started (page 12)
  - Chapter 2: Advanced Topics (page 45)
  ...

Created: 001_Introduction.pdf (7 pages)
Created: 002_Chapter 1_ Getting Started.pdf (33 pages)
Created: 003_Chapter 2_ Advanced Topics.pdf (28 pages)
...

βœ… Successfully split PDF into 23 chunks!

Page Mode

python pdf_chapter_chunker.py "Large_Document.pdf" --mode pages --size 25

Output:

Processing Large_Document.pdf (300 pages)
Creating 12 chunks of up to 25 pages each

Created: Large_Document_chunk_001.pdf (25 pages)
Created: Large_Document_chunk_002.pdf (25 pages)
...

βœ… Successfully split PDF into 12 chunks!

πŸ“‚ Output Structure

Chapter Mode

book_name_chapters/
β”œβ”€β”€ 001_Introduction.pdf
β”œβ”€β”€ 002_Chapter 1_ Getting Started.pdf
β”œβ”€β”€ 003_Chapter 2_ Advanced Concepts.pdf
└── ...

Page Mode

book_name_pages/
β”œβ”€β”€ book_name_chunk_001.pdf
β”œβ”€β”€ book_name_chunk_002.pdf
β”œβ”€β”€ book_name_chunk_003.pdf
└── ...

πŸ”§ How It Works

Chapter Detection

The tool analyzes the first 25 pages of a PDF to find table of contents using multiple patterns:

  • Chapter Title ... Page Number
  • 1.1 Section Title ... 45
  • Chapter 1: Title 23
  • And more formats...

Smart Filename Generation

  • Removes invalid characters (<>:"/\\|?*)
  • Limits filename length for compatibility
  • Preserves meaningful chapter titles
  • Sequential numbering for organization

Fallback Strategy

If no table of contents is detected:

  1. Warns the user
  2. Automatically switches to page-based chunking
  3. Uses sensible defaults (99 pages per chunk)

🎨 Supported TOC Formats

The tool recognizes various table of contents patterns:

βœ… Chapter 1: Introduction .............. 15
βœ… 1.1 Getting Started .................. 23  
βœ… Section 2.3 Advanced Topics .......... 67
βœ… Appendix A 145
βœ… Bibliography ......................... 200

βš™οΈ Command Line Options

Option Description Default
input_pdf Path to input PDF file Required
--mode Chunking mode: chapters or pages chapters
--output, -o Output directory Same as input
--size, -s Pages per chunk (page mode only) 99
--quiet, -q Suppress verbose output False
--help Show help message -
--version Show version -

πŸ› Troubleshooting

Common Issues

"No table of contents found"

  • The PDF may not have a standard TOC format
  • Tool will automatically fall back to page-based chunking
  • Try adjusting the search depth or patterns

"Error reading PDF"

  • Ensure the PDF file is not corrupted
  • Check file permissions
  • Try with a different PDF

Memory issues with large PDFs

  • The tool processes pages efficiently
  • For extremely large files (>1000 pages), consider splitting first

Debug Mode

Run with verbose output to see detailed processing:

python pdf_chapter_chunker.py book.pdf  # Verbose by default

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

git clone https://github.com/newjordan/PDF-Chapter-Chunker.git
cd PDF-Chapter-Chunker
pip install -r requirements.txt

Adding New TOC Patterns

Edit the toc_patterns list in the PDFChunker class to support new formats.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Built with pypdf for PDF processing
  • Inspired by the need for better technical document organization
  • Thanks to the open source community for feedback and contributions

πŸ“Š Performance

  • Speed: Processes ~100 pages per second on modern hardware
  • Memory: Minimal RAM usage, suitable for large documents
  • Compatibility: Works with most PDF versions and formats

Made with ❀️ for better document organization

About

Break a PDF into chapters with metadata. Creates a folder with the chapters properly labeled. Do the Truffle Shuffle!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages