Entity extraction: filter junk entities from regex fallback

## Problem
When using the regex fallback (no GLiNER/spaCy/LLM), extraction produces junk entities from OCR artifacts and document boilerplate. These pollute the graph and waste crossref API calls.

Examples from a 32-page Clearview AI contract:
- `To:` — picked up as organization
- `Bill To:` — picked up as organization  
- `APPROVED` — picked up as organization
- `EC .Orchestrating a or g-?te' 22` — OCR garbage picked up as organization
- `CID SES Criminal lnte` — truncated OCR artifact
- `Justification of request: Clcorwcw will assist the department with identifying suspects through facial recognition.` — entire sentence picked up as person
- `Sgt. Lakca Gaither, 09/94/2019, For FMU Use Only` — name + metadata jammed together
- `Clearvicn Al. Inc.` — OCR misspelling of "Clearview AI"

## Fixes needed
1. **Blocklist** — common boilerplate words that aren't entities: "To:", "From:", "Bill To:", "APPROVED", "RE:", "CC:", etc.
2. **Length filter** — skip entities shorter than 3 chars or longer than 80 chars (full sentences aren't entities)
3. **Character ratio** — if >30% special characters or digits, probably OCR garbage
4. **Fuzzy dedup** — "Clearview AI" and "Clearvicn Al" should merge (Levenshtein distance)
5. **Sentence detection** — if the "entity" contains a verb, it's probably a sentence not a name

## Context
GLiNER and spaCy handle these correctly — this only affects the regex fallback tier. But regex is the safety net that always works, so it needs to be cleaner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity extraction: filter junk entities from regex fallback #57

Problem

Fixes needed

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Entity extraction: filter junk entities from regex fallback #57

Description

Problem

Fixes needed

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions