Skip to content

Entity extraction: filter junk entities from regex fallback #57

@JordanCoin

Description

@JordanCoin

Problem

When using the regex fallback (no GLiNER/spaCy/LLM), extraction produces junk entities from OCR artifacts and document boilerplate. These pollute the graph and waste crossref API calls.

Examples from a 32-page Clearview AI contract:

  • To: — picked up as organization
  • Bill To: — picked up as organization
  • APPROVED — picked up as organization
  • EC .Orchestrating a or g-?te' 22 — OCR garbage picked up as organization
  • CID SES Criminal lnte — truncated OCR artifact
  • Justification of request: Clcorwcw will assist the department with identifying suspects through facial recognition. — entire sentence picked up as person
  • Sgt. Lakca Gaither, 09/94/2019, For FMU Use Only — name + metadata jammed together
  • Clearvicn Al. Inc. — OCR misspelling of "Clearview AI"

Fixes needed

  1. Blocklist — common boilerplate words that aren't entities: "To:", "From:", "Bill To:", "APPROVED", "RE:", "CC:", etc.
  2. Length filter — skip entities shorter than 3 chars or longer than 80 chars (full sentences aren't entities)
  3. Character ratio — if >30% special characters or digits, probably OCR garbage
  4. Fuzzy dedup — "Clearview AI" and "Clearvicn Al" should merge (Levenshtein distance)
  5. Sentence detection — if the "entity" contains a verb, it's probably a sentence not a name

Context

GLiNER and spaCy handle these correctly — this only affects the regex fallback tier. But regex is the safety net that always works, so it needs to be cleaner.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions