Every organization faces the same frustrating reality: stacks of scanned documents piling up, valuable information trapped in images, and teams spending hours manually retyping text. Sound familiar?
You're drowning in PDFs that look good but can't be searched. Your team wastes 30+ minutes per document rekeying content. Critical data slips through the cracks because finding information in scanned files is like finding a needle in a haystack.
What if you could unlock that trapped text in secondsβnot hours?
We're not just another OCR tool. We've built a complete document processing ecosystem that transforms how organizations handle scanned documents.
| What You Get | Why It Matters |
|---|---|
| One-Click Processing | Drop a file, get searchable text. No expertise required. |
| Multi-Language Magic | Hebrew, English, French, German, Spanishβhandle mixed-language documents effortlessly |
| Three Processing Modes | Choose fast (cli), thorough (force), or visual analysisβwhatever your workflow needs |
| Enterprise-Ready | REST API, job queuing, progress tracking, audit logsβbuilt for production environments |
| Batch Processing | Process hundreds of documents at once with recursive directory scanning |
- 90% reduction in manual data entry time
- 10x faster document digitization
- 50+ languages supported out of the box
- Zero expertise required to get started
"We processed 10,000 legacy documents in a single weekend. What used to take months now takes days." β Document Management Team, Enterprise Client
# Clone and launch
git clone <repository-url>
cd ocr-processor
# Create data directories and place your PDFs in data/input/
mkdir -p data/input data/output data/archive
# Launch services
docker-compose up -d
# Process your first document (replace filename.pdf with your actual file)
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{"input_path": "/app/data/input/filename.pdf", "mode": "cli"}'# Install and run
pip install -r requirements.txt
python cli/ocr_combined.py document.pdf --mode forcepython cli/pdf_ocr_gui.pyThat's it. No complex configuration. No expert knowledge needed.
Convert historical documents, medical records, and legal files into searchable, accessible digital assets. Preserve the past, make it searchable for the future.
Automate invoice processing, contract analysis, and report extraction. Integrate OCR into your existing document workflows without disruption.
Extract text from academic papers, technical manuals, and historical archives. Build datasets from previously inaccessible sources.
Process patient records, case files, and compliance documents with industry-standard PDF/A outputβperfect for long-term archival requirements.
Handle Hebrew/English documents seamlessly. Support for 50+ languages means your global documents are never left behind.
Build OCR directly into your applications:
import requests
response = requests.post(
"http://localhost:8000/jobs",
json={
"input_path": "/documents/invoice.pdf",
"mode": "force",
"language": "heb+eng",
"webhook_url": "https://your-system.com/callback"
}
)
job_id = response.json()["job_id"]API endpoints:
POST /jobsβ Create processing jobGET /jobs/{job_id}β Check statusDELETE /jobs/{job_id}β Cancel jobGET /healthβ System health check
# Process everything in a directory
python cli/ocr_combined.py --mode force ./invoices/
# English only, fast mode
python cli/ocr_combined.py --lang eng --mode cli document.pdf
# Visual mode with bounding boxes
python cli/ocr_combined.py --mode visual --archive-dir ./backup documents/Point-and-click interface for non-technical users. Drag, drop, process. That's it.
| Mode | Speed | What It Does | Best For |
|---|---|---|---|
| CLI β‘ | Fastest | Skips existing text, preserves layout | Quick enhancement, preserving existing text |
| Force πͺ | Thorough | Forces OCR on every page | Complete text replacement |
| Visual ποΈ | Moderate | Creates bounding box overlays | Layout analysis, forensic work |
Each processed document produces:
output_folder/
βββ ocr_output.pdf # β
Searchable PDF/A (archival quality)
βββ ocr_output.txt # π Plain text extraction
βββ ocr_log.txt # π Processing details
βββ archive.zip # π¦ Compressed output (force mode)
Built for scale, designed for reliability:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OCR Processor Engine β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β REST API β β Progress β β Database β β
β β (FastAPI) β β Tracker β β (SQL) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β
β βββββββββββββββββΌββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββ β
β β OCRmyPDF Core β β
β β (Tesseract-based) β β
β βββββββββββββββββββββββ β
β β β
β βββββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Visual β β HOCR β β Text β β
β β Output β β Output β β Output β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Enterprise features included:
- Job queuing with priority levels
- Real-time progress tracking
- Multi-channel notifications (email, webhook, Slack)
- Structured JSON logging
- Security validation & quarantine
- Audit trail for compliance
We speak your languageβliterally:
| Language | Code | Notes |
|---|---|---|
| Hebrew + English | heb+eng |
Default, bi-directional support |
| English | eng |
Standard US/UK |
| French | fra |
Plus combinations |
| German | deu |
Plus combinations |
| Spanish | spa |
Plus combinations |
| ...and 45+ more | Any Tesseract code | Custom combinations |
Mix and match: heb+eng+fra+deu for multilingual documents
| Resource | Description |
|---|---|
| π Complete Documentation | In-depth technical reference |
| π Deployment Guide | Production deployment instructions |
| π¨βπΌ Admin Guide | System administration & monitoring |
| π API Documentation | Interactive API docs (when running) |
Want to make OCR Processor even better?
- β Star the repo if you find it useful
- π Report bugs via GitHub issues
- π‘ Submit feature requests
- π§ Submit pull requests
Need help?
- Check the Complete Documentation
- Search existing issues
- Create a new issue with detailed reproduction steps
Part of the VirtualBox Technologies toolkit. Available for document processing workflows.
Stop typing. Start processing.
# One command. One minute. Unlimited possibilities.
docker-compose up -dYour documents are waiting to be unlocked. π