Extraction pipeline redesign: parallel extract → LLM validates

## Current Architecture (wrong)
```
LLM (14 min) → done
OR GLiNER → done  
OR spaCy → done
OR Regex → done
```
Only ONE method runs. LLM processes the entire document text (78K chars for a 32-page contract). Slow, wasteful, no cross-validation.

## Proposed Architecture
```
Phase 1: Extract (parallel, ~30 sec)
├── GLiNER (zero-shot NER, chunked)     → candidates with confidence
├── spaCy (NLP pipeline, chunked)       → candidates with confidence  
└── Regex (pattern matching, instant)   → dates, money, phones, emails

Phase 2: Merge + Dedup (~instant)
├── Fuzzy match duplicates (Clearview AI ≈ Clearvicn Al)
├── Boost confidence for entities found by multiple methods
├── Keep best spelling/normalization per entity
└── Remove obvious junk (< 3 chars, > 80 chars, common words)

Phase 3: LLM Validation (~30 sec)
├── Input: cleaned entity list (~2K chars, not 78K)
├── "Remove junk. Fix OCR errors. What's missing? Score confidence."
├── Identify relationships between validated entities
└── Output: final entity list with relationships
```

## Why this is better
- **40x less LLM input** — 2K entity list vs 78K document text
- **Cross-validation** — entities found by GLiNER AND spaCy get higher confidence
- **LLM does judgment, not scanning** — what it's actually good at
- **Parallel phase 1** — GLiNER + spaCy + Regex run simultaneously
- **Total time: ~60 sec** instead of 14 min
- **Better quality** — multiple extraction methods catch different things

## LLM validation prompt (compact)
```
Review these entities extracted from a government document.
Remove junk (boilerplate, OCR artifacts, generic words).
Fix OCR errors in names. Add confidence scores.
Identify relationships between entities.

Entities: [list of ~50 candidates with types]

Return cleaned JSON with entities and relationships.
```

## Implementation notes
- Phase 1 extractors already exist — just need to run all three, not pick one
- Merge logic: group by fuzzy-matched name, keep highest confidence, prefer GLiNER type
- LLM validation is optional — if no LLM available, skip to Phase 2 output
- Regex always runs (catches structured patterns NER models miss)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction pipeline redesign: parallel extract → LLM validates #60

Current Architecture (wrong)

Proposed Architecture

Why this is better

LLM validation prompt (compact)

Implementation notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extraction pipeline redesign: parallel extract → LLM validates #60

Description

Current Architecture (wrong)

Proposed Architecture

Why this is better

LLM validation prompt (compact)

Implementation notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions