A comprehensive Python-based pipeline for extracting structured data from documents (PDFs) with an emphasis on modularity, adaptability, and extensibility.
Agentic Document Extraction seamlessly handles both digital text PDFs and scanned image PDFs. The system performs:
- AI-driven layout analysis
- Text content extraction (for both digital and scanned PDFs via OCR)
- Table extraction and structuring
- Image and chart extraction with analysis
- Context-aware reasoning and validation
- Structured output in multiple formats (JSON, CSV)
The system can run as a standalone CLI tool or be deployed via API for real-time use.
- Dynamic Document Type Handling: Automatically detects and processes both text-based and scanned PDFs
- Layout Analysis: Uses AI models (LayoutLM, LayoutParser) to understand document structure
- Multi-modal Extraction: Extracts text, tables, images, and charts with specialized components
- Content Reasoning: Applies NLP and validation to understand context and relationships between extracted elements
- Configurable Pipeline: Mix and match components via configuration
- Multiple Output Formats: JSON, CSV, and optional database integration
- API Deployment: Can be deployed as a REST API service
# Clone the repository
git clone https://github.com/yagna-1/agentic-document-extractor.git
cd agentic-document-extractor
# Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download any required models (optional)
python -m src.utils download_models