OCR Processor Enterprise - Complete Documentation

Overview
Features
Architecture
Installation
Configuration
Usage
API Reference
Development
Deployment
Administration
Troubleshooting
Security

Overview

The OCR Processor Enterprise is a comprehensive PDF OCR (Optical Character Recognition) processing toolkit that combines multiple OCR approaches into a unified, feature-rich solution for document digitization and analysis. Built on top of OCRmyPDF, it offers both simple and advanced processing modes for various use cases.

Key Capabilities

Multi-format Input: Process individual PDF files or entire directories
Multiple Interfaces: REST API, Command Line Interface, and Graphical User Interface
Multi-language Support: Hebrew, English, and other languages via Tesseract
Visual Analysis: Bounding box visualization and HOCR output
Enterprise Features: Job queuing, progress tracking, notifications, and audit logging

Version Information

Current Version: 2.0.1
Python Support: 3.11+
License: Part of the VirtualBox Technologies toolkit

Features

Processing Modes

Mode	Description	Use Case
`cli`	Fast processing that preserves existing text	Quick OCR enhancement
`force`	Complete OCR with visual highlights and compression	Full text replacement
`visual`	Processing with bounding box overlays	Layout analysis

Input Processing

Single File: Process individual PDF documents
Directory Processing: Batch process entire folders
Recursive Search: Automatically find PDFs in subdirectories
Smart Filtering: Only processes PDF files

Visual Analysis Features

HOCR Generation: Extract spatial layout information
Bounding Box Visualization: Generate highlighted page images
Sidecar Text Output: Plain text extraction alongside PDF processing

Output Management

PDF/A Format: Standards-compliant archival output
Timestamped Folders: Organized results with unique timestamps
Comprehensive Logging: Detailed processing logs per file
Optional Archiving: Backup original files before processing
ZIP Compression: Automatic packaging for force mode

Multi-language Support

Hebrew + English (default): heb+eng
English Only: eng
Custom Languages: Support for any Tesseract language pack

Architecture

Project Structure

ocr-processor/
├── src/                    # Enterprise application code
│   ├── api_server.py      # REST API server (FastAPI)
│   ├── config.py           # Configuration management
│   ├── database_manager.py # Database operations (SQLAlchemy)
│   ├── error_handler.py    # Error handling and recovery
│   ├── logger.py           # Structured logging (structlog)
│   ├── notification_manager.py # Notifications and alerts
│   ├── progress_tracker.py # Job progress tracking
│   ├── security_validator.py # Input validation and security
│   └── ocr_utils.py        # Core OCR processing utilities
├── cli/                    # Unified CLI and GUI tools
│   ├── ocr_combined.py     # Unified OCR processing script (all modes)
│   ├── pdf_ocr_gui.py      # Full graphical user interface
│   ├── pdf_ocr_gui_simple.py # Simple GUI (CLI mode only)
│   └── PDF/                # Test PDF files
├── docker/                 # Docker configuration
│   ├── Dockerfile
│   ├── docker-compose.yml
│   ├── nginx.conf
│   ├── init.sql
│   ├── filebeat.yml
│   └── ssl/
├── docs/                   # Documentation
│   ├── COMPLETE_DOCUMENTATION.md
│   ├── ADMIN_GUIDE.md
│   └── DEPLOYMENT.md
├── tests/                  # Test files
│   └── test_ocr_utils.py
└── requirements.txt        # Python dependencies

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        OCR Processor                             │
├─────────────────────────────────────────────────────────────────┤
│  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐    │
│  │   REST    │  │    CLI    │  │    GUI    │  │  Worker   │    │
│  │   API     │  │  Tools    │  │  Interface│  │  Process  │    │
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘    │
│        │              │              │              │          │
│        └──────────────┴──────────────┴──────────────┘          │
│                            │                                     │
│                    ┌───────▼───────┐                            │
│                    │  Core Engine  │                            │
│                    │ (OCRmyPDF)    │                            │
│                    └───────┬───────┘                            │
│                            │                                     │
│    ┌───────────────────────┼───────────────────────┐            │
│    │                       │                       │            │
│ ┌───▼───┐            ┌─────▼─────┐           ┌─────▼─────┐     │
│ │ Logger│            │  Database │           │ Notifier  │     │
│ └───────┘            │  (SQL)    │           └───────────┘     │
│                      └───────────┘                              │
└─────────────────────────────────────────────────────────────────┘

Component Details

API Server (`api_server.py`)

The REST API server provides HTTP endpoints for:

Job creation and management
System status monitoring
File upload/download
Batch processing

Key features:

FastAPI-based REST API
JWT/Bearer token authentication
CORS support
Background task processing

Configuration (`config.py`)

Centralized configuration management with:

Environment variable support
JSON config file loading
Validation and defaults
OCR settings per mode

Database Manager (`database_manager.py`)

SQLAlchemy-based database operations:

Job tracking and history
File processing records
Audit logging
Performance metrics

Error Handler (`error_handler.py`)

Comprehensive error handling:

Error classification and categorization
Exponential backoff retry mechanism
Circuit breaker pattern
Notification on critical errors

Logger (`logger.py`)

Structured logging with:

JSON and console output
Log rotation
Remote logging support
Performance metrics tracking

Notification Manager (`notification_manager.py`)

Multi-channel notifications:

Email (SMTP)
Webhooks
Slack integration
Scheduled notifications

Progress Tracker (`progress_tracker.py`)

Real-time progress tracking:

Job status updates
Performance metrics collection
Queue management
Callback system

Security Validator (`security_validator.py`)

Input validation and security:

Path traversal prevention
File type validation
Suspicious pattern detection
Quarantine system

Installation

System Requirements

Requirement	Minimum	Recommended
OS	Linux/macOS	Ubuntu 20.04+
RAM	4 GB	8 GB
Storage	10 GB	50 GB SSD
CPU	2 cores	4+ cores

Dependencies

System Dependencies

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y \
    tesseract-ocr \
    tesseract-ocr-heb \
    qpdf \
    poppler-utils

macOS:

brew install tesseract qpdf poppler

Python Dependencies

pip install -r requirements.txt

Docker Installation (Recommended)

# Clone the repository
git clone <repository-url>
cd ocr-processor

# Copy environment file
cp .env.example .env

# Start all services
docker-compose up -d

# Verify deployment
docker-compose ps
curl http://localhost:8000/health

Manual Installation

# Create virtual environment
python -m venv ocr-env
source ocr-env/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-heb

# Start the API server
python -m src.api_server

Configuration

Environment Variables

Variable	Description	Default	Required
`OCR_LOG_LEVEL`	Logging level	INFO	No
`OCR_DEFAULT_LANGUAGE`	Default OCR language	heb+eng	No
`OCR_DEFAULT_MODE`	Default processing mode	cli	No
`OCR_MAX_CONCURRENT_JOBS`	Max parallel jobs	4	No
`OCR_MAX_FILE_SIZE`	Max file size (bytes)	104857600	No
`OCR_TIMEOUT_PER_FILE`	Timeout per file (seconds)	300	No
`OCR_ENABLE_API`	Enable REST API	true	No
`OCR_API_HOST`	API host	0.0.0.0	No
`OCR_API_PORT`	API port	8000	No
`OCR_DATABASE_URL`	Database connection	-	No*
`OCR_ENABLE_NOTIFICATIONS`	Enable notifications	false	No
`OCR_NOTIFICATION_EMAIL`	Admin email	-	No
`OCR_SMTP_SERVER`	SMTP server	-	No
`OCR_WEBHOOK_URL`	Webhook URL	-	No

*Required for database features

Configuration File

Create ocr_config.json in the project root:

{
  "default_language": "heb+eng",
  "default_mode": "cli",
  "max_concurrent_jobs": 4,
  "max_file_size": 104857600,
  "timeout_per_file": 300,
  "archive_originals": true,
  "create_zip_archives": true,
  "enable_database": false,
  "database_url": "postgresql://user:pass@localhost/ocr_db",
  "enable_notifications": false,
  "notification_email": "",
  "smtp_server": "",
  "smtp_port": 587,
  "webhook_url": "",
  "enable_api": true,
  "api_host": "0.0.0.0",
  "api_port": 8000,
  "log_level": "INFO",
  "log_to_file": true,
  "log_directory": "logs"
}

Configuration Precedence

Environment variables (highest priority)
Configuration file (ocr_config.json)
Default values (lowest priority)

Usage

Command Line Interface

Basic Usage

# Process a single PDF (CLI mode - default)
python cli/ocr_combined.py document.pdf

# Process with force mode (complete OCR)
python cli/ocr_combined.py --mode force document.pdf

# Process with visual mode (bounding boxes)
python cli/ocr_combined.py --mode visual document.pdf

Directory Processing

# Process all PDFs in a directory recursively
python cli/ocr_combined.py --mode force documents/

# Process only top-level PDFs (non-recursive)
python cli/ocr_combined.py --mode force --no-recursive documents/

# Process with archiving
python cli/ocr_combined.py --mode force --archive-dir ./backup documents/

Language Selection

# English only
python cli/ocr_combined.py --lang eng document.pdf

# Hebrew and English (default)
python cli/ocr_combined.py --lang heb+eng document.pdf

# Multiple languages
python cli/ocr_combined.py --lang eng+fra+deu document.pdf

Command Line Options

Option	Description	Default
`input_path`	PDF file or directory	Required
`--mode`	Processing mode	cli
`--lang`	Language(s) for OCR	heb+eng
`--archive-dir`	Directory to backup originals	None
`--no-recursive`	Disable recursive search	False
`--log-file`	Main log file path	ocr_combined.log

Graphical User Interface

Launch the GUI application:

python cli/pdf_ocr_gui.py

The GUI provides:

File/directory selection
Mode and language dropdowns
Progress tracking
Log viewing
Archive configuration

REST API

Base URL

http://localhost:8000/api/v1

Authentication

All API endpoints require Bearer token authentication:

curl -H "Authorization: Bearer your-api-key" http://localhost:8000/health

Endpoints

Health Check:

GET /health

Create Job:

POST /jobs
Content-Type: application/json

{
  "input_path": "/path/to/document.pdf",
  "mode": "cli",
  "language": "heb+eng",
  "priority": "normal",
  "recursive": true,
  "webhook_url": "https://your-webhook.com/callback"
}

Get Job Status:

GET /jobs/{job_id}

List Jobs:

GET /jobs?limit=50&offset=0

Cancel Job:

DELETE /jobs/{job_id}

Upload File:

POST /upload
Content-Type: multipart/form-data

file: document.pdf
mode: cli
language: heb+eng
priority: normal

API Reference

Request Models

OCRJobCreate

Field	Type	Required	Description
`input_path`	string	Yes	Path to PDF file or directory
`mode`	string	No	Processing mode (cli, force, visual)
`language`	string	No	Language for OCR
`priority`	string	No	Job priority (low, normal, high, urgent)
`recursive`	boolean	No	Process directories recursively
`archive_originals`	boolean	No	Archive original files
`webhook_url`	string	No	Webhook URL for notifications
`metadata`	object	No	Additional metadata

OCRJobResponse

Field	Type	Description
`job_id`	string	Unique job identifier
`status`	string	Job status (pending, running, completed, failed, cancelled)
`progress`	float	Progress percentage (0-100)
`input_path`	string	Original input path
`mode`	string	Processing mode
`language`	string	OCR language
`created_at`	datetime	Job creation timestamp
`started_at`	datetime	Job start timestamp
`completed_at`	datetime	Job completion timestamp
`total_files`	int	Total files to process
`processed_files`	int	Successfully processed files
`failed_files`	int	Failed files
`output_path`	string	Output directory path
`error_message`	string	Error message if failed

Response Models

HealthCheck

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "services": {
    "database": "connected",
    "storage": "writable"
  },
  "database": "healthy",
  "storage": "healthy"
}

SystemStatus

{
  "status": "running",
  "version": "2.0.0",
  "uptime_seconds": 3600,
  "active_jobs": 2,
  "total_jobs": 150,
  "database_enabled": true,
  "notifications_enabled": true
}

Error Responses

{
  "detail": "Error message description"
}

Common HTTP Status Codes:

Code	Description
200	Success
201	Created
400	Bad Request
401	Unauthorized
404	Not Found
500	Internal Server Error

Development

Project Setup

# Clone repository
git clone <repository-url>
cd ocr-processor

# Create virtual environment
python -m venv dev-env
source dev-env/bin/activate

# Install development dependencies
pip install -r requirements.txt
pip install pytest pytest-asyncio pytest-cov

# Install pre-commit hooks
pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test
pytest tests/test_api_server.py -v

Code Style

Follow PEP 8 guidelines
Use type hints for all functions
Write docstrings for public functions
Run linting: flake8 src/

Adding New Features

Create feature branch
Implement changes
Add tests
Update documentation
Submit pull request

Debugging

# Enable debug logging
export OCR_LOG_LEVEL=DEBUG

# Run with Python debugger
python -m pdb src/api_server.py

Deployment

Docker Deployment

Production Setup

# Create environment file
cp .env.example .env
# Edit .env with production values

# Start services
docker-compose up -d

# Scale workers
docker-compose up -d --scale ocr-worker=3

Docker Compose Services

services:
  ocr-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OCR_DATABASE_URL=postgresql://user:pass@postgres/ocr_db
    volumes:
      - ./data:/app/data
      - ./logs:/app/logs

  ocr-worker:
    build: .
    scale: 3
    environment:
      - OCR_DATABASE_URL=postgresql://user:pass@postgres/ocr_db

  postgres:
    image: postgres:13
    environment:
      - POSTGRES_USER=ocr_user
      - POSTGRES_PASSWORD=ocr_password
      - POSTGRES_DB=ocr_db
    volumes:
      - postgres_data:/var/lib/postgresql/data

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./docker/nginx.conf:/etc/nginx/nginx.conf
      - ./docker/ssl:/etc/nginx/ssl

Manual Deployment

System Requirements

Python 3.11+
PostgreSQL 13+
Redis (optional, for caching)
Nginx (reverse proxy)

Steps

# Create user and directories
sudo useradd -r -s /sbin/nologin ocr
sudo mkdir -p /var/log/ocr /var/ocr/{input,output,archive}
sudo chown -R ocr:ocr /var/ocr /var/log/ocr

# Copy application files
sudo mkdir -p /opt/ocr-processor
sudo cp -r . /opt/ocr-processor/
cd /opt/ocr-processor

# Create virtual environment
python -m venv venv
./venv/bin/pip install -r requirements.txt

# Create systemd service
sudo cp deploy/ocr-api.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable ocr-api
sudo systemctl start ocr-api

SSL/TLS Configuration

For production HTTPS:

server {
    listen 443 ssl http2;
    server_name ocr.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ocr.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ocr.yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Administration

System Monitoring

Health Checks

# API health
curl http://localhost:8000/health

# Database health
docker-compose exec postgres pg_isready -U ocr_user -d ocr_db

# Service status
docker-compose ps

Log Monitoring

# Real-time logs
docker-compose logs -f ocr-api

# Search for errors
docker-compose logs ocr-api | grep ERROR

# View log files
tail -f /var/log/ocr/ocr_processor.log

Routine Maintenance

Daily Tasks

Verify API health
Check disk space
Review error logs

Weekly Tasks

Analyze performance metrics
Review job success rates
Clean up old logs
Verify backup integrity

Monthly Tasks

Capacity planning analysis
Database maintenance
Security audit
Performance optimization

Backup and Recovery

Database Backup

# Create backup
docker-compose exec postgres pg_dump -U ocr_user ocr_db > backup.sql

# Restore from backup
docker-compose exec -T postgres psql -U ocr_user -d ocr_db < backup.sql

File Backup

# Backup output files
tar -czf ocr_files_backup.tar.gz /var/ocr/output/

# Backup with rsync
rsync -avz /var/ocr/output/ backup:/path/to/backup/

Scaling Guidelines

Load Level	Workers	CPU	RAM	Storage
Light (<10 files/day)	1	2	4GB	50GB
Medium (10-100 files/day)	2-3	4	8GB	200GB
Heavy (100+ files/day)	4+	8	16GB	500GB+

Troubleshooting

Common Issues

Tesseract Not Found

# Verify installation
tesseract --version

# Ubuntu/Debian
sudo apt-get install tesseract-ocr tesseract-ocr-heb

# macOS
brew install tesseract tesseract-lang

Permission Errors

# Check file permissions
ls -la input.pdf

# Fix permissions
chmod 644 input.pdf

# Check directory access
ls -la /path/to/documents/

High Memory Usage

# Reduce concurrent jobs
export OCR_MAX_CONCURRENT_JOBS=2

# Monitor memory usage
docker stats

Database Connection Errors

# Check database logs
docker-compose logs postgres

# Test connection
docker-compose exec ocr-api python -c "from database_manager import get_database_manager; print('DB OK')"

OCR Processing Failures

# Check Tesseract installation
tesseract --version

# Verify file permissions
ls -la /var/ocr/input/

# Check OCR logs
tail -f /var/log/ocr/ocr_errors.log

Performance Tuning

# CPU optimization
export OCR_MAX_CONCURRENT_JOBS=$(nproc)

# Memory optimization
export OCR_MAX_FILE_SIZE=2147483648

# Parallel processing
export OCR_JOBS=0  # Use all available cores

Getting Help

Check logs: /var/log/ocr/ocr_errors.log
Health check: curl http://localhost:8000/health
Database status: docker-compose exec postgres pg_isready
Service status: docker-compose ps

Security

Authentication

API Key Generation

# Generate secure API key
python -c "import secrets; print(secrets.token_urlsafe(32))"

Setting API Key

export OCR_API_KEY="your-generated-api-key"

Input Validation

The system validates all inputs:

Path traversal prevention
File type verification
Suspicious pattern detection
Size limits enforcement

File Security

Suspicious files are quarantined
File permissions are validated
MIME type detection with magic numbers
SHA256 checksum calculation

Security Best Practices

Keep software updated: Regularly update Tesseract and dependencies
Use HTTPS: Always use SSL/TLS in production
Limit permissions: Use least-privilege user accounts
Monitor logs: Review security logs regularly
Backup data: Maintain regular backups
Access control: Restrict API access to authorized users

Security Headers

add_header X-Frame-Options DENY;
add_header X-Content-Type-Options nosniff;
add_header X-XSS-Protection "1; mode=block";
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains";

Contributing

Reporting Issues

Check existing issues first
Provide detailed reproduction steps
Include logs and error messages
Specify environment details

Pull Requests

Fork the repository
Create feature branch
Make changes with tests
Update documentation
Submit pull request

Code Style

Follow PEP 8
Use type hints
Write docstrings
Add unit tests
Update changelog

License

This OCR processing suite is part of the VirtualBox Technologies toolkit and is available for use in document processing workflows.

Support

Getting Help

Documentation: Check this guide and README.md
API Docs: Visit http://localhost:8000/docs
Issues: Create GitHub issue
Email: Contact support team

Emergency Contacts

Primary Administrator: admin@yourcompany.com
Development Team: dev-team@yourcompany.com
On-call Support: +1-234-567-8900

Changelog

Version 2.0.1

CLI Consolidation: Unified all CLI scripts into ocr_combined.py
Removed Legacy Scripts: Eliminated duplicate and outdated CLI tools
Simplified Architecture: Single source of truth for CLI processing
Maintained Functionality: All processing modes preserved in unified script

Version 2.0.0

New REST API with FastAPI
Database integration for job tracking
Multi-channel notifications
Enhanced security validation
Structured logging with structlog
Progress tracking and metrics
Docker Compose deployment
GUI application

Version 1.x

Initial CLI tools
Basic OCR processing
Multiple processing modes
Multi-language support

Last updated: 2024-01-15 Documentation version: 2.0.1

FilesExpand file tree

COMPLETE_DOCUMENTATION.md

Latest commit

History

COMPLETE_DOCUMENTATION.md

File metadata and controls

OCR Processor Enterprise - Complete Documentation

Table of Contents

Overview

Key Capabilities

Version Information

Features

Processing Modes

Input Processing

Visual Analysis Features

Output Management

Multi-language Support

Architecture

Project Structure

System Architecture

Component Details

API Server (api_server.py)

Configuration (config.py)

Database Manager (database_manager.py)

Error Handler (error_handler.py)

Logger (logger.py)

Notification Manager (notification_manager.py)

Progress Tracker (progress_tracker.py)

Security Validator (security_validator.py)

Installation

System Requirements

Dependencies

System Dependencies

Python Dependencies

Docker Installation (Recommended)

Manual Installation

Configuration

Environment Variables

Configuration File

Configuration Precedence

Usage

Command Line Interface

Basic Usage

Directory Processing

Language Selection

Command Line Options

Graphical User Interface

REST API

Base URL

Authentication

Endpoints

API Reference

Request Models

OCRJobCreate

OCRJobResponse

Response Models

HealthCheck

SystemStatus

Error Responses

Development

Project Setup

Running Tests

Code Style

Adding New Features

Debugging

Deployment

Docker Deployment

Production Setup

Docker Compose Services

Manual Deployment

System Requirements

Steps

SSL/TLS Configuration

Administration

System Monitoring

Health Checks

Log Monitoring

Routine Maintenance

Daily Tasks

Weekly Tasks

API Server (`api_server.py`)

Configuration (`config.py`)

Database Manager (`database_manager.py`)

Error Handler (`error_handler.py`)

Logger (`logger.py`)

Notification Manager (`notification_manager.py`)

Progress Tracker (`progress_tracker.py`)

Security Validator (`security_validator.py`)