Skip to content

PrincetonAfeez/Contact-Extractor

Repository files navigation

Contact extractor

Reads messy text (pasted lists, threads, notes) and extracts email addresses and US/Canada NANP phone numbers. Produces CSV columns: name_if_found, email, phone, source_line, confidence_score, pairing_uncertain.

Requirements

  • Python 3.10+
  • Project modules in this folder: contact_extractor.py, string_sanitizer.py, errors.py
  • Tests: pip install -r requirements.txt (pytest)

Command line

From this directory:

python contact_extractor.py --file input.txt --output contacts_output.csv

Or:

python -m contact_extractor --file input.txt
Flag Description
--file Input text file (required)
--output CSV path (default: contacts_output.csv)
--min-confidence Minimum score 0.0–1.0 (default: 0.0)
--encoding Input encoding (default: utf-8; invalid bytes replaced)
--show-rejected List every rejected/duplicate candidate in findings (default caps at 10 + summary)
--stats Print stats to stderr after a run

Progress: a line is printed to stderr every 100 lines processed.

Programmatic use

from contact_extractor import run, run_file

result = run(text, config={"min_confidence": 0.5, "source_name": "paste.txt"})
result = run_file("input.txt", encoding="utf-8", config={"show_rejected": True})

run_file reads the whole file into memory, then runs the same pipeline as run (the sanitizer operates on the full text first).

Pipeline

  1. Sanitizestring_sanitizer normalizes whitespace, smart quotes, zero-width characters, etc.
  2. Per line — Regex candidates for emails and phones; validate, score, deduplicate across the file.
  3. CSVcsv.writer with minimal quoting.

Email rules

  • Regex capture plus extra checks: no .., sensible local part and domain labels, TLD length 2–63.
  • This is practical / RFC-inspired, not full RFC 5322 (no quoted locals, IP domains, etc.).

Phone rules (US/Canada NANP)

  • Accepts common formats: (555) 123-4567, 555-123-4567, 555.123.4567, +1-555-123-4567, and bare 10- or 11-digit runs (11-digit must start with 1).
  • Rejects digit strings that fail NANP structure: area code and exchange must each start with 2–9.
  • Normalized display: (XXX) XXX-XXXX with optional +1 prefix.
  • Not general international E.164 for arbitrary countries.

Names and pairing

  • Name — Heuristic: same line or line above, pattern like Jane Doe (Title-case words). Otherwise unknown.
  • Pairing — When both emails and phones appear on one line, they are zipped by order. If the counts differ, each output row still gets pairing_uncertain = yes (email–phone alignment may be wrong).

Confidence scores

Scores are deterministic 0.0–1.0 from structure and context (TLD, formatting, “cell/phone/…”, nearby name). They are heuristics, not proof of deliverability.

Findings and stats

  • Sanitizer findings are always included.
  • Extraction rejections are merged into findings; without show_rejected / --show-rejected, only the first 10 rejection entries are listed, then a note with the omitted count (stats.rejection_findings_omitted, stats.rejection_findings_truncated).

Errors

  • Unreadable input path, OS read failure, or unknown text encoding raises errors.InputError.
  • Non-file path (e.g. a directory) raises errors.ParseError.
  • Invalid min_confidence (not a number, non-finite, or outside 0.0–1.0) raises errors.ParseError or errors.ValidationError.

Tests

cd contact_extractor
python -m pytest tests/ -v

About

Reads a messy text file (copy-pasted contact lists, email threads, business cards dumped into text) and extracts every valid email and phone number. Outputs a clean CSV with columns: name_if_found, email, phone, source_line,confidence_score.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages