Skip to content

OCR pipeline using local vision models #12

@monneyboi

Description

@monneyboi

Summary

Implement OCR (optical character recognition) for scanned PDFs and images using the local vision model, enabling text extraction and indexing of non-searchable documents.

Motivation

Many documents journalists work with are scanned:

  • Leaked documents photographed or scanned
  • Court filings from older systems
  • FOIA responses as image PDFs
  • Historical documents

Currently these can be added to collections but their content isn't searchable. OCR unlocks their full value.

Proposed Approach

Detection

Detect when a PDF needs OCR:

  1. Extract text via lopdf
  2. If text is empty/minimal but pages exist → likely scanned
  3. Optionally: check if PDF contains only images
pub fn needs_ocr(pdf_bytes: &[u8]) -> Result<bool> {
    let text = extract_text(pdf_bytes)?;
    let page_count = get_page_count(pdf_bytes)?;
    
    // Heuristic: less than 100 chars per page suggests scanned
    Ok(text.len() < page_count * 100)
}

Extraction Pipeline

  1. PDF → Images: Render each page as an image
  2. Image → Text: Send to vision model with OCR prompt
  3. Text → Index: Store extracted text, index in milli

Page Rendering

Options for PDF to image conversion:

  • pdfium-render - Chromium's PDF engine, good quality
  • pdf-render - Pure Rust, simpler but less mature
  • Shell out to pdftoppm - reliable but external dependency

Recommend pdfium-render for quality and self-containment.

Vision Model Prompts

OCR prompt: "Extract all text from this document image. 
Preserve paragraph structure. Output only the extracted text."

For structured documents:

"Extract all text from this document. Identify headers, 
paragraphs, and any tables. Format tables as markdown."

Storage

OCR'd text stored the same as regular extracted text:

  • Text blob in iroh-blobs with text_hash
  • Metadata entry indicates OCR source: "extraction": "ocr"
  • Syncs to peers who can then search without re-running OCR

Tasks

  • Add needs_ocr() detection in ingestion pipeline
  • Integrate PDF page rendering (pdfium-render)
  • Implement OCR via vision model
  • Add progress events for OCR (can be slow)
  • Store OCR results with appropriate metadata
  • Handle OCR failures gracefully (still index what we can)
  • Add UI indication that document was OCR'd
  • Consider batch processing for multi-page documents

Performance Considerations

  • OCR is slow: a 10-page document might take 30-60 seconds
  • Should run in background, not block UI
  • Consider page-level progress updates
  • Cache aggressively - OCR results stored as blobs sync to peers

Open Questions

  1. Should OCR run automatically on import, or be user-triggered?
  2. How to handle mixed PDFs (some pages scanned, some not)?
  3. Quality vs speed tradeoff - render at what DPI?

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions