Current Architecture (wrong)
LLM (14 min) → done
OR GLiNER → done
OR spaCy → done
OR Regex → done
Only ONE method runs. LLM processes the entire document text (78K chars for a 32-page contract). Slow, wasteful, no cross-validation.
Proposed Architecture
Phase 1: Extract (parallel, ~30 sec)
├── GLiNER (zero-shot NER, chunked) → candidates with confidence
├── spaCy (NLP pipeline, chunked) → candidates with confidence
└── Regex (pattern matching, instant) → dates, money, phones, emails
Phase 2: Merge + Dedup (~instant)
├── Fuzzy match duplicates (Clearview AI ≈ Clearvicn Al)
├── Boost confidence for entities found by multiple methods
├── Keep best spelling/normalization per entity
└── Remove obvious junk (< 3 chars, > 80 chars, common words)
Phase 3: LLM Validation (~30 sec)
├── Input: cleaned entity list (~2K chars, not 78K)
├── "Remove junk. Fix OCR errors. What's missing? Score confidence."
├── Identify relationships between validated entities
└── Output: final entity list with relationships
Why this is better
- 40x less LLM input — 2K entity list vs 78K document text
- Cross-validation — entities found by GLiNER AND spaCy get higher confidence
- LLM does judgment, not scanning — what it's actually good at
- Parallel phase 1 — GLiNER + spaCy + Regex run simultaneously
- Total time: ~60 sec instead of 14 min
- Better quality — multiple extraction methods catch different things
LLM validation prompt (compact)
Review these entities extracted from a government document.
Remove junk (boilerplate, OCR artifacts, generic words).
Fix OCR errors in names. Add confidence scores.
Identify relationships between entities.
Entities: [list of ~50 candidates with types]
Return cleaned JSON with entities and relationships.
Implementation notes
- Phase 1 extractors already exist — just need to run all three, not pick one
- Merge logic: group by fuzzy-matched name, keep highest confidence, prefer GLiNER type
- LLM validation is optional — if no LLM available, skip to Phase 2 output
- Regex always runs (catches structured patterns NER models miss)
Current Architecture (wrong)
Only ONE method runs. LLM processes the entire document text (78K chars for a 32-page contract). Slow, wasteful, no cross-validation.
Proposed Architecture
Why this is better
LLM validation prompt (compact)
Implementation notes