Skip to content

Extraction pipeline redesign: parallel extract → LLM validates #60

@JordanCoin

Description

@JordanCoin

Current Architecture (wrong)

LLM (14 min) → done
OR GLiNER → done  
OR spaCy → done
OR Regex → done

Only ONE method runs. LLM processes the entire document text (78K chars for a 32-page contract). Slow, wasteful, no cross-validation.

Proposed Architecture

Phase 1: Extract (parallel, ~30 sec)
├── GLiNER (zero-shot NER, chunked)     → candidates with confidence
├── spaCy (NLP pipeline, chunked)       → candidates with confidence  
└── Regex (pattern matching, instant)   → dates, money, phones, emails

Phase 2: Merge + Dedup (~instant)
├── Fuzzy match duplicates (Clearview AI ≈ Clearvicn Al)
├── Boost confidence for entities found by multiple methods
├── Keep best spelling/normalization per entity
└── Remove obvious junk (< 3 chars, > 80 chars, common words)

Phase 3: LLM Validation (~30 sec)
├── Input: cleaned entity list (~2K chars, not 78K)
├── "Remove junk. Fix OCR errors. What's missing? Score confidence."
├── Identify relationships between validated entities
└── Output: final entity list with relationships

Why this is better

  • 40x less LLM input — 2K entity list vs 78K document text
  • Cross-validation — entities found by GLiNER AND spaCy get higher confidence
  • LLM does judgment, not scanning — what it's actually good at
  • Parallel phase 1 — GLiNER + spaCy + Regex run simultaneously
  • Total time: ~60 sec instead of 14 min
  • Better quality — multiple extraction methods catch different things

LLM validation prompt (compact)

Review these entities extracted from a government document.
Remove junk (boilerplate, OCR artifacts, generic words).
Fix OCR errors in names. Add confidence scores.
Identify relationships between entities.

Entities: [list of ~50 candidates with types]

Return cleaned JSON with entities and relationships.

Implementation notes

  • Phase 1 extractors already exist — just need to run all three, not pick one
  • Merge logic: group by fuzzy-matched name, keep highest confidence, prefer GLiNER type
  • LLM validation is optional — if no LLM available, skip to Phase 2 output
  • Regex always runs (catches structured patterns NER models miss)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions