OCR pipeline using local vision models

## Summary

Implement OCR (optical character recognition) for scanned PDFs and images using the local vision model, enabling text extraction and indexing of non-searchable documents.

## Motivation

Many documents journalists work with are scanned:
- Leaked documents photographed or scanned
- Court filings from older systems
- FOIA responses as image PDFs
- Historical documents

Currently these can be added to collections but their content isn't searchable. OCR unlocks their full value.

## Proposed Approach

### Detection

Detect when a PDF needs OCR:
1. Extract text via lopdf
2. If text is empty/minimal but pages exist → likely scanned
3. Optionally: check if PDF contains only images

```rust
pub fn needs_ocr(pdf_bytes: &[u8]) -> Result<bool> {
    let text = extract_text(pdf_bytes)?;
    let page_count = get_page_count(pdf_bytes)?;
    
    // Heuristic: less than 100 chars per page suggests scanned
    Ok(text.len() < page_count * 100)
}
```

### Extraction Pipeline

1. **PDF → Images**: Render each page as an image
2. **Image → Text**: Send to vision model with OCR prompt
3. **Text → Index**: Store extracted text, index in milli

### Page Rendering

Options for PDF to image conversion:
- `pdfium-render` - Chromium's PDF engine, good quality
- `pdf-render` - Pure Rust, simpler but less mature
- Shell out to `pdftoppm` - reliable but external dependency

Recommend `pdfium-render` for quality and self-containment.

### Vision Model Prompts

```
OCR prompt: "Extract all text from this document image. 
Preserve paragraph structure. Output only the extracted text."
```

For structured documents:
```
"Extract all text from this document. Identify headers, 
paragraphs, and any tables. Format tables as markdown."
```

### Storage

OCR'd text stored the same as regular extracted text:
- Text blob in iroh-blobs with `text_hash`
- Metadata entry indicates OCR source: `"extraction": "ocr"`
- Syncs to peers who can then search without re-running OCR

## Tasks

- [ ] Add `needs_ocr()` detection in ingestion pipeline
- [ ] Integrate PDF page rendering (pdfium-render)
- [ ] Implement OCR via vision model
- [ ] Add progress events for OCR (can be slow)
- [ ] Store OCR results with appropriate metadata
- [ ] Handle OCR failures gracefully (still index what we can)
- [ ] Add UI indication that document was OCR'd
- [ ] Consider batch processing for multi-page documents

## Performance Considerations

- OCR is slow: a 10-page document might take 30-60 seconds
- Should run in background, not block UI
- Consider page-level progress updates
- Cache aggressively - OCR results stored as blobs sync to peers

## Open Questions

1. Should OCR run automatically on import, or be user-triggered?
2. How to handle mixed PDFs (some pages scanned, some not)?
3. Quality vs speed tradeoff - render at what DPI?

## Dependencies

- Blocked by: #11 (Vision model support)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR pipeline using local vision models #12

Summary

Motivation

Proposed Approach

Detection

Extraction Pipeline

Page Rendering

Vision Model Prompts

Storage

Tasks

Performance Considerations

Open Questions

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OCR pipeline using local vision models #12

Description

Summary

Motivation

Proposed Approach

Detection

Extraction Pipeline

Page Rendering

Vision Model Prompts

Storage

Tasks

Performance Considerations

Open Questions

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions