- Overview
- Features
- Architecture
- Installation
- Configuration
- Usage
- API Reference
- Development
- Deployment
- Administration
- Troubleshooting
- Security
The OCR Processor Enterprise is a comprehensive PDF OCR (Optical Character Recognition) processing toolkit that combines multiple OCR approaches into a unified, feature-rich solution for document digitization and analysis. Built on top of OCRmyPDF, it offers both simple and advanced processing modes for various use cases.
- Multi-format Input: Process individual PDF files or entire directories
- Multiple Interfaces: REST API, Command Line Interface, and Graphical User Interface
- Multi-language Support: Hebrew, English, and other languages via Tesseract
- Visual Analysis: Bounding box visualization and HOCR output
- Enterprise Features: Job queuing, progress tracking, notifications, and audit logging
- Current Version: 2.0.1
- Python Support: 3.11+
- License: Part of the VirtualBox Technologies toolkit
| Mode | Description | Use Case |
|---|---|---|
cli |
Fast processing that preserves existing text | Quick OCR enhancement |
force |
Complete OCR with visual highlights and compression | Full text replacement |
visual |
Processing with bounding box overlays | Layout analysis |
- Single File: Process individual PDF documents
- Directory Processing: Batch process entire folders
- Recursive Search: Automatically find PDFs in subdirectories
- Smart Filtering: Only processes PDF files
- HOCR Generation: Extract spatial layout information
- Bounding Box Visualization: Generate highlighted page images
- Sidecar Text Output: Plain text extraction alongside PDF processing
- PDF/A Format: Standards-compliant archival output
- Timestamped Folders: Organized results with unique timestamps
- Comprehensive Logging: Detailed processing logs per file
- Optional Archiving: Backup original files before processing
- ZIP Compression: Automatic packaging for force mode
- Hebrew + English (default):
heb+eng - English Only:
eng - Custom Languages: Support for any Tesseract language pack
ocr-processor/
├── src/ # Enterprise application code
│ ├── api_server.py # REST API server (FastAPI)
│ ├── config.py # Configuration management
│ ├── database_manager.py # Database operations (SQLAlchemy)
│ ├── error_handler.py # Error handling and recovery
│ ├── logger.py # Structured logging (structlog)
│ ├── notification_manager.py # Notifications and alerts
│ ├── progress_tracker.py # Job progress tracking
│ ├── security_validator.py # Input validation and security
│ └── ocr_utils.py # Core OCR processing utilities
├── cli/ # Unified CLI and GUI tools
│ ├── ocr_combined.py # Unified OCR processing script (all modes)
│ ├── pdf_ocr_gui.py # Full graphical user interface
│ ├── pdf_ocr_gui_simple.py # Simple GUI (CLI mode only)
│ └── PDF/ # Test PDF files
├── docker/ # Docker configuration
│ ├── Dockerfile
│ ├── docker-compose.yml
│ ├── nginx.conf
│ ├── init.sql
│ ├── filebeat.yml
│ └── ssl/
├── docs/ # Documentation
│ ├── COMPLETE_DOCUMENTATION.md
│ ├── ADMIN_GUIDE.md
│ └── DEPLOYMENT.md
├── tests/ # Test files
│ └── test_ocr_utils.py
└── requirements.txt # Python dependencies┌─────────────────────────────────────────────────────────────────┐
│ OCR Processor │
├─────────────────────────────────────────────────────────────────┤
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ REST │ │ CLI │ │ GUI │ │ Worker │ │
│ │ API │ │ Tools │ │ Interface│ │ Process │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ Core Engine │ │
│ │ (OCRmyPDF) │ │
│ └───────┬───────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ │ │ │ │
│ ┌───▼───┐ ┌─────▼─────┐ ┌─────▼─────┐ │
│ │ Logger│ │ Database │ │ Notifier │ │
│ └───────┘ │ (SQL) │ └───────────┘ │
│ └───────────┘ │
└─────────────────────────────────────────────────────────────────┘API Server (api_server.py)
The REST API server provides HTTP endpoints for:
- Job creation and management
- System status monitoring
- File upload/download
- Batch processing
Key features:
- FastAPI-based REST API
- JWT/Bearer token authentication
- CORS support
- Background task processing
Configuration (config.py)
Centralized configuration management with:
- Environment variable support
- JSON config file loading
- Validation and defaults
- OCR settings per mode
Database Manager (database_manager.py)
SQLAlchemy-based database operations:
- Job tracking and history
- File processing records
- Audit logging
- Performance metrics
Error Handler (error_handler.py)
Comprehensive error handling:
- Error classification and categorization
- Exponential backoff retry mechanism
- Circuit breaker pattern
- Notification on critical errors
Logger (logger.py)
Structured logging with:
- JSON and console output
- Log rotation
- Remote logging support
- Performance metrics tracking
Notification Manager (notification_manager.py)
Multi-channel notifications:
- Email (SMTP)
- Webhooks
- Slack integration
- Scheduled notifications
Progress Tracker (progress_tracker.py)
Real-time progress tracking:
- Job status updates
- Performance metrics collection
- Queue management
- Callback system
Security Validator (security_validator.py)
Input validation and security:
- Path traversal prevention
- File type validation
- Suspicious pattern detection
- Quarantine system
| Requirement | Minimum | Recommended |
|---|---|---|
| OS | Linux/macOS | Ubuntu 20.04+ |
| RAM | 4 GB | 8 GB |
| Storage | 10 GB | 50 GB SSD |
| CPU | 2 cores | 4+ cores |
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y \
tesseract-ocr \
tesseract-ocr-heb \
qpdf \
poppler-utilsmacOS:
brew install tesseract qpdf popplerpip install -r requirements.txt# Clone the repository
git clone <repository-url>
cd ocr-processor
# Copy environment file
cp .env.example .env
# Start all services
docker-compose up -d
# Verify deployment
docker-compose ps
curl http://localhost:8000/health# Create virtual environment
python -m venv ocr-env
source ocr-env/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-heb
# Start the API server
python -m src.api_server| Variable | Description | Default | Required |
|---|---|---|---|
OCR_LOG_LEVEL |
Logging level | INFO | No |
OCR_DEFAULT_LANGUAGE |
Default OCR language | heb+eng | No |
OCR_DEFAULT_MODE |
Default processing mode | cli | No |
OCR_MAX_CONCURRENT_JOBS |
Max parallel jobs | 4 | No |
OCR_MAX_FILE_SIZE |
Max file size (bytes) | 104857600 | No |
OCR_TIMEOUT_PER_FILE |
Timeout per file (seconds) | 300 | No |
OCR_ENABLE_API |
Enable REST API | true | No |
OCR_API_HOST |
API host | 0.0.0.0 | No |
OCR_API_PORT |
API port | 8000 | No |
OCR_DATABASE_URL |
Database connection | - | No* |
OCR_ENABLE_NOTIFICATIONS |
Enable notifications | false | No |
OCR_NOTIFICATION_EMAIL |
Admin email | - | No |
OCR_SMTP_SERVER |
SMTP server | - | No |
OCR_WEBHOOK_URL |
Webhook URL | - | No |
*Required for database features
Create ocr_config.json in the project root:
{
"default_language": "heb+eng",
"default_mode": "cli",
"max_concurrent_jobs": 4,
"max_file_size": 104857600,
"timeout_per_file": 300,
"archive_originals": true,
"create_zip_archives": true,
"enable_database": false,
"database_url": "postgresql://user:pass@localhost/ocr_db",
"enable_notifications": false,
"notification_email": "",
"smtp_server": "",
"smtp_port": 587,
"webhook_url": "",
"enable_api": true,
"api_host": "0.0.0.0",
"api_port": 8000,
"log_level": "INFO",
"log_to_file": true,
"log_directory": "logs"
}- Environment variables (highest priority)
- Configuration file (
ocr_config.json) - Default values (lowest priority)
# Process a single PDF (CLI mode - default)
python cli/ocr_combined.py document.pdf
# Process with force mode (complete OCR)
python cli/ocr_combined.py --mode force document.pdf
# Process with visual mode (bounding boxes)
python cli/ocr_combined.py --mode visual document.pdf# Process all PDFs in a directory recursively
python cli/ocr_combined.py --mode force documents/
# Process only top-level PDFs (non-recursive)
python cli/ocr_combined.py --mode force --no-recursive documents/
# Process with archiving
python cli/ocr_combined.py --mode force --archive-dir ./backup documents/# English only
python cli/ocr_combined.py --lang eng document.pdf
# Hebrew and English (default)
python cli/ocr_combined.py --lang heb+eng document.pdf
# Multiple languages
python cli/ocr_combined.py --lang eng+fra+deu document.pdf| Option | Description | Default |
|---|---|---|
input_path |
PDF file or directory | Required |
--mode |
Processing mode | cli |
--lang |
Language(s) for OCR | heb+eng |
--archive-dir |
Directory to backup originals | None |
--no-recursive |
Disable recursive search | False |
--log-file |
Main log file path | ocr_combined.log |
Launch the GUI application:
python cli/pdf_ocr_gui.pyThe GUI provides:
- File/directory selection
- Mode and language dropdowns
- Progress tracking
- Log viewing
- Archive configuration
http://localhost:8000/api/v1All API endpoints require Bearer token authentication:
curl -H "Authorization: Bearer your-api-key" http://localhost:8000/healthHealth Check:
GET /healthCreate Job:
POST /jobs
Content-Type: application/json
{
"input_path": "/path/to/document.pdf",
"mode": "cli",
"language": "heb+eng",
"priority": "normal",
"recursive": true,
"webhook_url": "https://your-webhook.com/callback"
}Get Job Status:
GET /jobs/{job_id}List Jobs:
GET /jobs?limit=50&offset=0Cancel Job:
DELETE /jobs/{job_id}Upload File:
POST /upload
Content-Type: multipart/form-data
file: document.pdf
mode: cli
language: heb+eng
priority: normal| Field | Type | Required | Description |
|---|---|---|---|
input_path |
string | Yes | Path to PDF file or directory |
mode |
string | No | Processing mode (cli, force, visual) |
language |
string | No | Language for OCR |
priority |
string | No | Job priority (low, normal, high, urgent) |
recursive |
boolean | No | Process directories recursively |
archive_originals |
boolean | No | Archive original files |
webhook_url |
string | No | Webhook URL for notifications |
metadata |
object | No | Additional metadata |
| Field | Type | Description |
|---|---|---|
job_id |
string | Unique job identifier |
status |
string | Job status (pending, running, completed, failed, cancelled) |
progress |
float | Progress percentage (0-100) |
input_path |
string | Original input path |
mode |
string | Processing mode |
language |
string | OCR language |
created_at |
datetime | Job creation timestamp |
started_at |
datetime | Job start timestamp |
completed_at |
datetime | Job completion timestamp |
total_files |
int | Total files to process |
processed_files |
int | Successfully processed files |
failed_files |
int | Failed files |
output_path |
string | Output directory path |
error_message |
string | Error message if failed |
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"services": {
"database": "connected",
"storage": "writable"
},
"database": "healthy",
"storage": "healthy"
}{
"status": "running",
"version": "2.0.0",
"uptime_seconds": 3600,
"active_jobs": 2,
"total_jobs": 150,
"database_enabled": true,
"notifications_enabled": true
}{
"detail": "Error message description"
}Common HTTP Status Codes:
| Code | Description |
|---|---|
| 200 | Success |
| 201 | Created |
| 400 | Bad Request |
| 401 | Unauthorized |
| 404 | Not Found |
| 500 | Internal Server Error |
# Clone repository
git clone <repository-url>
cd ocr-processor
# Create virtual environment
python -m venv dev-env
source dev-env/bin/activate
# Install development dependencies
pip install -r requirements.txt
pip install pytest pytest-asyncio pytest-cov
# Install pre-commit hooks
pre-commit install# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test
pytest tests/test_api_server.py -v- Follow PEP 8 guidelines
- Use type hints for all functions
- Write docstrings for public functions
- Run linting:
flake8 src/
- Create feature branch
- Implement changes
- Add tests
- Update documentation
- Submit pull request
# Enable debug logging
export OCR_LOG_LEVEL=DEBUG
# Run with Python debugger
python -m pdb src/api_server.py# Create environment file
cp .env.example .env
# Edit .env with production values
# Start services
docker-compose up -d
# Scale workers
docker-compose up -d --scale ocr-worker=3services:
ocr-api:
build: .
ports:
- "8000:8000"
environment:
- OCR_DATABASE_URL=postgresql://user:pass@postgres/ocr_db
volumes:
- ./data:/app/data
- ./logs:/app/logs
ocr-worker:
build: .
scale: 3
environment:
- OCR_DATABASE_URL=postgresql://user:pass@postgres/ocr_db
postgres:
image: postgres:13
environment:
- POSTGRES_USER=ocr_user
- POSTGRES_PASSWORD=ocr_password
- POSTGRES_DB=ocr_db
volumes:
- postgres_data:/var/lib/postgresql/data
nginx:
image: nginx:alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./docker/nginx.conf:/etc/nginx/nginx.conf
- ./docker/ssl:/etc/nginx/ssl- Python 3.11+
- PostgreSQL 13+
- Redis (optional, for caching)
- Nginx (reverse proxy)
# Create user and directories
sudo useradd -r -s /sbin/nologin ocr
sudo mkdir -p /var/log/ocr /var/ocr/{input,output,archive}
sudo chown -R ocr:ocr /var/ocr /var/log/ocr
# Copy application files
sudo mkdir -p /opt/ocr-processor
sudo cp -r . /opt/ocr-processor/
cd /opt/ocr-processor
# Create virtual environment
python -m venv venv
./venv/bin/pip install -r requirements.txt
# Create systemd service
sudo cp deploy/ocr-api.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable ocr-api
sudo systemctl start ocr-apiFor production HTTPS:
server {
listen 443 ssl http2;
server_name ocr.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/ocr.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ocr.yourdomain.com/privkey.pem;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}# API health
curl http://localhost:8000/health
# Database health
docker-compose exec postgres pg_isready -U ocr_user -d ocr_db
# Service status
docker-compose ps# Real-time logs
docker-compose logs -f ocr-api
# Search for errors
docker-compose logs ocr-api | grep ERROR
# View log files
tail -f /var/log/ocr/ocr_processor.log- Verify API health
- Check disk space
- Review error logs
- Analyze performance metrics
- Review job success rates
- Clean up old logs
- Verify backup integrity
- Capacity planning analysis
- Database maintenance
- Security audit
- Performance optimization
# Create backup
docker-compose exec postgres pg_dump -U ocr_user ocr_db > backup.sql
# Restore from backup
docker-compose exec -T postgres psql -U ocr_user -d ocr_db < backup.sql# Backup output files
tar -czf ocr_files_backup.tar.gz /var/ocr/output/
# Backup with rsync
rsync -avz /var/ocr/output/ backup:/path/to/backup/| Load Level | Workers | CPU | RAM | Storage |
|---|---|---|---|---|
| Light (<10 files/day) | 1 | 2 | 4GB | 50GB |
| Medium (10-100 files/day) | 2-3 | 4 | 8GB | 200GB |
| Heavy (100+ files/day) | 4+ | 8 | 16GB | 500GB+ |
# Verify installation
tesseract --version
# Ubuntu/Debian
sudo apt-get install tesseract-ocr tesseract-ocr-heb
# macOS
brew install tesseract tesseract-lang# Check file permissions
ls -la input.pdf
# Fix permissions
chmod 644 input.pdf
# Check directory access
ls -la /path/to/documents/# Reduce concurrent jobs
export OCR_MAX_CONCURRENT_JOBS=2
# Monitor memory usage
docker stats# Check database logs
docker-compose logs postgres
# Test connection
docker-compose exec ocr-api python -c "from database_manager import get_database_manager; print('DB OK')"# Check Tesseract installation
tesseract --version
# Verify file permissions
ls -la /var/ocr/input/
# Check OCR logs
tail -f /var/log/ocr/ocr_errors.log# CPU optimization
export OCR_MAX_CONCURRENT_JOBS=$(nproc)
# Memory optimization
export OCR_MAX_FILE_SIZE=2147483648
# Parallel processing
export OCR_JOBS=0 # Use all available cores- Check logs:
/var/log/ocr/ocr_errors.log - Health check:
curl http://localhost:8000/health - Database status:
docker-compose exec postgres pg_isready - Service status:
docker-compose ps
# Generate secure API key
python -c "import secrets; print(secrets.token_urlsafe(32))"export OCR_API_KEY="your-generated-api-key"The system validates all inputs:
- Path traversal prevention
- File type verification
- Suspicious pattern detection
- Size limits enforcement
- Suspicious files are quarantined
- File permissions are validated
- MIME type detection with magic numbers
- SHA256 checksum calculation
- Keep software updated: Regularly update Tesseract and dependencies
- Use HTTPS: Always use SSL/TLS in production
- Limit permissions: Use least-privilege user accounts
- Monitor logs: Review security logs regularly
- Backup data: Maintain regular backups
- Access control: Restrict API access to authorized users
add_header X-Frame-Options DENY;
add_header X-Content-Type-Options nosniff;
add_header X-XSS-Protection "1; mode=block";
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains";- Check existing issues first
- Provide detailed reproduction steps
- Include logs and error messages
- Specify environment details
- Fork the repository
- Create feature branch
- Make changes with tests
- Update documentation
- Submit pull request
- Follow PEP 8
- Use type hints
- Write docstrings
- Add unit tests
- Update changelog
This OCR processing suite is part of the VirtualBox Technologies toolkit and is available for use in document processing workflows.
- Documentation: Check this guide and README.md
- API Docs: Visit http://localhost:8000/docs
- Issues: Create GitHub issue
- Email: Contact support team
- Primary Administrator: admin@yourcompany.com
- Development Team: dev-team@yourcompany.com
- On-call Support: +1-234-567-8900
- CLI Consolidation: Unified all CLI scripts into
ocr_combined.py - Removed Legacy Scripts: Eliminated duplicate and outdated CLI tools
- Simplified Architecture: Single source of truth for CLI processing
- Maintained Functionality: All processing modes preserved in unified script
- New REST API with FastAPI
- Database integration for job tracking
- Multi-channel notifications
- Enhanced security validation
- Structured logging with structlog
- Progress tracking and metrics
- Docker Compose deployment
- GUI application
- Initial CLI tools
- Basic OCR processing
- Multiple processing modes
- Multi-language support
Last updated: 2024-01-15 Documentation version: 2.0.1