Skip to content

yagna-1/gentic-document-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic Document Extraction

Version Python License

A comprehensive Python-based pipeline for extracting structured data from documents (PDFs) with an emphasis on modularity, adaptability, and extensibility.

Overview

Agentic Document Extraction seamlessly handles both digital text PDFs and scanned image PDFs. The system performs:

  • AI-driven layout analysis
  • Text content extraction (for both digital and scanned PDFs via OCR)
  • Table extraction and structuring
  • Image and chart extraction with analysis
  • Context-aware reasoning and validation
  • Structured output in multiple formats (JSON, CSV)

The system can run as a standalone CLI tool or be deployed via API for real-time use.

Features

  • Dynamic Document Type Handling: Automatically detects and processes both text-based and scanned PDFs
  • Layout Analysis: Uses AI models (LayoutLM, LayoutParser) to understand document structure
  • Multi-modal Extraction: Extracts text, tables, images, and charts with specialized components
  • Content Reasoning: Applies NLP and validation to understand context and relationships between extracted elements
  • Configurable Pipeline: Mix and match components via configuration
  • Multiple Output Formats: JSON, CSV, and optional database integration
  • API Deployment: Can be deployed as a REST API service

Installation

# Clone the repository
git clone https://github.com/yagna-1/agentic-document-extractor.git
cd agentic-document-extractor

# Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download any required models (optional)
python -m src.utils download_models

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages