xgen-doc2chunk

xgen-doc2chunk is a document processing library that converts raw documents into AI-understandable context. It analyzes, restructures, and normalizes content so that language models can reason over documents with higher accuracy and consistency.

Features

Multi-format Support: Process a wide variety of document formats including:
- PDF (with table detection, OCR fallback, and complex layout handling)
- Microsoft Office: DOCX, DOC, PPTX, PPT, XLSX, XLS
- Korean documents: HWP, HWPX (Hangul Word Processor)
- Text formats: TXT, MD, RTF, CSV, HTML
- Code files: Python, JavaScript, TypeScript, and 20+ languages
Intelligent Text Extraction:
- Preserves document structure (headings, paragraphs, lists)
- Extracts tables as HTML with proper rowspan/colspan handling
- Handles merged cells and complex table layouts
- Extracts and processes inline images
OCR Integration:
- Pluggable OCR engine architecture
- Supports OpenAI, Anthropic, Google Gemini, and vLLM backends
- Automatic OCR fallback for scanned documents or image-based PDFs
Smart Chunking:
- Semantic text chunking with configurable size and overlap
- Table-aware chunking that preserves table integrity
- Protected regions for code blocks and special content
Metadata Extraction:
- Extracts document metadata (title, author, creation date, etc.)
- Formats metadata in a structured, parseable format

Installation

pip install xgen-doc2chunk

Or using uv:

uv add xgen-doc2chunk

Quick Start

Basic Usage

from xgen_doc2chunk import DocumentProcessor

# Create processor instance
processor = DocumentProcessor()

# Extract text from a document
text = processor.extract_text("document.pdf")
print(text)

# Extract text and chunk in one step
result = processor.extract_chunks(
    "document.pdf",
    chunk_size=1000,
    chunk_overlap=200
)

# Access chunks
for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i + 1}: {chunk[:100]}...")

# Save chunks to markdown file
result.save_to_md("output/chunks.md")

With OCR Processing

from xgen_doc2chunk import DocumentProcessor
from xgen_doc2chunk.ocr.ocr_engine.openai_ocr import OpenAIOCREngine

# Initialize OCR engine
ocr_engine = OpenAIOCREngine(api_key="sk-...", model="gpt-4o")

# Create processor with OCR
processor = DocumentProcessor(ocr_engine=ocr_engine)

# Extract text with OCR processing enabled
text = processor.extract_text(
    "scanned_document.pdf",
    ocr_processing=True
)

Supported Formats

Category	Extensions
Documents	`.pdf`, `.docx`, `.doc`, `.pptx`, `.ppt`, `.hwp`, `.hwpx`
Spreadsheets	`.xlsx`, `.xls`, `.csv`, `.tsv`
Text	`.txt`, `.md`, `.rtf`
Web	`.html`, `.htm`, `.xml`
Code	`.py`, `.js`, `.ts`, `.java`, `.cpp`, `.c`, `.go`, `.rs`, and more
Config	`.json`, `.yaml`, `.yml`, `.toml`, `.ini`, `.env`

Architecture

libs/
├── core/
│   ├── document_processor.py    # Main entry point
│   ├── processor/               # Format-specific handlers
│   │   ├── pdf_handler.py       # PDF processing with V4 engine
│   │   ├── docx_handler.py      # DOCX processing
│   │   ├── ppt_handler.py       # PowerPoint processing
│   │   ├── excel_handler.py     # Excel processing
│   │   ├── hwp_processor.py     # HWP 5.0 OLE processing
│   │   ├── hwpx_processor.py    # HWPX (ZIP/XML) processing
│   │   └── ...
│   └── functions/
│       └── img_processor.py     # Image handling utilities
├── chunking/
│   ├── chunking.py              # Main chunking interface
│   ├── text_chunker.py          # Text-based chunking
│   ├── table_chunker.py         # Table-aware chunking
│   └── page_chunker.py          # Page-based chunking
└── ocr/
    ├── base.py                  # OCR base class
    ├── ocr_processor.py         # OCR processing utilities
    └── ocr_engine/              # OCR engine implementations
        ├── openai_ocr.py
        ├── anthropic_ocr.py
        ├── gemini_ocr.py
        └── vllm_ocr.py

Requirements

Python 3.12+
Required dependencies are automatically installed (see pyproject.toml)

System Dependencies

For full functionality, you may need:

Tesseract OCR: For local OCR fallback
LibreOffice: For DOC/RTF conversion (optional)
Poppler: For PDF image extraction

Configuration

# Custom configuration
config = {
    "pdf": {
        "extract_images": True,
        "ocr_fallback": True,
    },
    "chunking": {
        "default_size": 1000,
        "default_overlap": 200,
    }
}

processor = DocumentProcessor(config=config)

License

Apache License 2.0 - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
xgen_doc2chunk		xgen_doc2chunk
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Process Logic.md		Process Logic.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

xgen-doc2chunk

Features

Installation

Quick Start

Basic Usage

With OCR Processing

Supported Formats

Architecture

Requirements

System Dependencies

Configuration

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

xgen-doc2chunk

Features

Installation

Quick Start

Basic Usage

With OCR Processing

Supported Formats

Architecture

Requirements

System Dependencies

Configuration

License

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages