Every organization faces the same frustrating reality: stacks of scanned documents piling up, valuable information trapped in images, and teams spending hours manually retyping text. Sound familiar?
You're drowning in PDFs that look good but can't be searched. Your team wastes 30+ minutes per document rekeying content. Critical data slips through the cracks because finding information in scanned files is like finding a needle in a haystack.
What if you could unlock that trapped text in seconds—not hours?
We're not just another OCR tool. We've built a complete document processing ecosystem that transforms how organizations handle scanned documents.
| What You Get | Why It Matters |
|---|---|
| One-Click Processing | Drop a file, get searchable text. No expertise required. |
| Multi-Language Magic | Hebrew, English, French, German, Spanish—handle mixed-language documents effortlessly |
| Three Processing Modes | Choose fast (cli), thorough (force), or visual analysis—whatever your workflow needs |
| Enterprise-Ready | REST API, job queuing, progress tracking, audit logs—built for production environments |
| Batch Processing | Process hundreds of documents at once with recursive directory scanning |
- 90% reduction in manual data entry time
- 10x faster document digitization
- 50+ languages supported out of the box
- Zero expertise required to get started
"We processed 10,000 legacy documents in a single weekend. What used to take months now takes days." — Document Management Team, Enterprise Client
# Clone and launch
git clone <repository-url>
cd ocr-processor
# Create data directories and place your PDFs in data/input/
mkdir -p data/input data/output data/archive
# Launch services
docker-compose up -d
# Process your first document (replace filename.pdf with your actual file)
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{"input_path": "/app/data/input/filename.pdf", "mode": "cli"}'# Install and run
pip install -r requirements.txt
python cli/ocr_combined.py document.pdf --mode forcepython cli/pdf_ocr_gui.pyThat's it. No complex configuration. No expert knowledge needed.
Convert historical documents, medical records, and legal files into searchable, accessible digital assets. Preserve the past, make it searchable for the future.
Automate invoice processing, contract analysis, and report extraction. Integrate OCR into your existing document workflows without disruption.
Extract text from academic papers, technical manuals, and historical archives. Build datasets from previously inaccessible sources.
Process patient records, case files, and compliance documents with industry-standard PDF/A output—perfect for long-term archival requirements.
Handle Hebrew/English documents seamlessly. Support for 50+ languages means your global documents are never left behind.
Build OCR directly into your applications:
import requests
response = requests.post(
"http://localhost:8000/jobs",
json={
"input_path": "/documents/invoice.pdf",
"mode": "force",
"language": "heb+eng",
"webhook_url": "https://your-system.com/callback"
}
)
job_id = response.json()["job_id"]API endpoints:
POST /jobs→ Create processing jobGET /jobs/{job_id}→ Check statusDELETE /jobs/{job_id}→ Cancel jobGET /health→ System health check
# Process everything in a directory
python cli/ocr_combined.py --mode force ./invoices/
# English only, fast mode
python cli/ocr_combined.py --lang eng --mode cli document.pdf
# Visual mode with bounding boxes
python cli/ocr_combined.py --mode visual --archive-dir ./backup documents/Point-and-click interface for non-technical users. Drag, drop, process. That's it.
| Mode | Speed | What It Does | Best For |
|---|---|---|---|
| CLI ⚡ | Fastest | Skips existing text, preserves layout | Quick enhancement, preserving existing text |
| Force 💪 | Thorough | Forces OCR on every page | Complete text replacement |
| Visual 👁️ | Moderate | Creates bounding box overlays | Layout analysis, forensic work |
Each processed document produces:
output_folder/
├── ocr_output.pdf # ✅ Searchable PDF/A (archival quality)
├── ocr_output.txt # 📝 Plain text extraction
├── ocr_log.txt # 📊 Processing details
└── archive.zip # 📦 Compressed output (force mode)
Built for scale, designed for reliability:
┌─────────────────────────────────────────────────────────┐
│ OCR Processor Engine │
├─────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ REST API │ │ Progress │ │ Database │ │
│ │ (FastAPI) │ │ Tracker │ │ (SQL) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ └───────────────┼───────────────┘ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ OCRmyPDF Core │ │
│ │ (Tesseract-based) │ │
│ └─────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Visual │ │ HOCR │ │ Text │ │
│ │ Output │ │ Output │ │ Output │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
Enterprise features included:
- Job queuing with priority levels
- Real-time progress tracking
- Multi-channel notifications (email, webhook, Slack)
- Structured JSON logging
- Security validation & quarantine
- Audit trail for compliance
We speak your language—literally:
| Language | Code | Notes |
|---|---|---|
| Hebrew + English | heb+eng |
Default, bi-directional support |
| English | eng |
Standard US/UK |
| French | fra |
Plus combinations |
| German | deu |
Plus combinations |
| Spanish | spa |
Plus combinations |
| ...and 45+ more | Any Tesseract code | Custom combinations |
Mix and match: heb+eng+fra+deu for multilingual documents
| Resource | Description |
|---|---|
| 📚 Complete Documentation | In-depth technical reference |
| 🚀 Deployment Guide | Production deployment instructions |
| 👨💼 Admin Guide | System administration & monitoring |
| 🔗 API Documentation | Interactive API docs (when running) |
Want to make OCR Processor even better?
- ⭐ Star the repo if you find it useful
- 🐛 Report bugs via GitHub issues
- 💡 Submit feature requests
- 🔧 Submit pull requests
Need help?
- Check the Complete Documentation
- Search existing issues
- Create a new issue with detailed reproduction steps
Part of the VirtualBox Technologies toolkit. Available for document processing workflows.
Stop typing. Start processing.
# One command. One minute. Unlimited possibilities.
docker-compose up -dYour documents are waiting to be unlocked. 🔓