Summary
Implement OCR (optical character recognition) for scanned PDFs and images using the local vision model, enabling text extraction and indexing of non-searchable documents.
Motivation
Many documents journalists work with are scanned:
- Leaked documents photographed or scanned
- Court filings from older systems
- FOIA responses as image PDFs
- Historical documents
Currently these can be added to collections but their content isn't searchable. OCR unlocks their full value.
Proposed Approach
Detection
Detect when a PDF needs OCR:
- Extract text via lopdf
- If text is empty/minimal but pages exist → likely scanned
- Optionally: check if PDF contains only images
pub fn needs_ocr(pdf_bytes: &[u8]) -> Result<bool> {
let text = extract_text(pdf_bytes)?;
let page_count = get_page_count(pdf_bytes)?;
// Heuristic: less than 100 chars per page suggests scanned
Ok(text.len() < page_count * 100)
}
Extraction Pipeline
- PDF → Images: Render each page as an image
- Image → Text: Send to vision model with OCR prompt
- Text → Index: Store extracted text, index in milli
Page Rendering
Options for PDF to image conversion:
pdfium-render - Chromium's PDF engine, good quality
pdf-render - Pure Rust, simpler but less mature
- Shell out to
pdftoppm - reliable but external dependency
Recommend pdfium-render for quality and self-containment.
Vision Model Prompts
OCR prompt: "Extract all text from this document image.
Preserve paragraph structure. Output only the extracted text."
For structured documents:
"Extract all text from this document. Identify headers,
paragraphs, and any tables. Format tables as markdown."
Storage
OCR'd text stored the same as regular extracted text:
- Text blob in iroh-blobs with
text_hash
- Metadata entry indicates OCR source:
"extraction": "ocr"
- Syncs to peers who can then search without re-running OCR
Tasks
Performance Considerations
- OCR is slow: a 10-page document might take 30-60 seconds
- Should run in background, not block UI
- Consider page-level progress updates
- Cache aggressively - OCR results stored as blobs sync to peers
Open Questions
- Should OCR run automatically on import, or be user-triggered?
- How to handle mixed PDFs (some pages scanned, some not)?
- Quality vs speed tradeoff - render at what DPI?
Dependencies
Summary
Implement OCR (optical character recognition) for scanned PDFs and images using the local vision model, enabling text extraction and indexing of non-searchable documents.
Motivation
Many documents journalists work with are scanned:
Currently these can be added to collections but their content isn't searchable. OCR unlocks their full value.
Proposed Approach
Detection
Detect when a PDF needs OCR:
Extraction Pipeline
Page Rendering
Options for PDF to image conversion:
pdfium-render- Chromium's PDF engine, good qualitypdf-render- Pure Rust, simpler but less maturepdftoppm- reliable but external dependencyRecommend
pdfium-renderfor quality and self-containment.Vision Model Prompts
For structured documents:
Storage
OCR'd text stored the same as regular extracted text:
text_hash"extraction": "ocr"Tasks
needs_ocr()detection in ingestion pipelinePerformance Considerations
Open Questions
Dependencies