Agentic Document Extraction

A comprehensive Python-based pipeline for extracting structured data from documents (PDFs) with an emphasis on modularity, adaptability, and extensibility.

Overview

Agentic Document Extraction seamlessly handles both digital text PDFs and scanned image PDFs. The system performs:

AI-driven layout analysis
Text content extraction (for both digital and scanned PDFs via OCR)
Table extraction and structuring
Image and chart extraction with analysis
Context-aware reasoning and validation
Structured output in multiple formats (JSON, CSV)

The system can run as a standalone CLI tool or be deployed via API for real-time use.

Features

Dynamic Document Type Handling: Automatically detects and processes both text-based and scanned PDFs
Layout Analysis: Uses AI models (LayoutLM, LayoutParser) to understand document structure
Multi-modal Extraction: Extracts text, tables, images, and charts with specialized components
Content Reasoning: Applies NLP and validation to understand context and relationships between extracted elements
Configurable Pipeline: Mix and match components via configuration
Multiple Output Formats: JSON, CSV, and optional database integration
API Deployment: Can be deployed as a REST API service

Installation

# Clone the repository
git clone https://github.com/yagna-1/agentic-document-extractor.git
cd agentic-document-extractor

# Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download any required models (optional)
python -m src.utils download_models

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
data		data
models		models
src		src
tests		tests
.DS_Store		.DS_Store
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic Document Extraction

Overview

Features

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic Document Extraction

Overview

Features

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages