Reads messy text (pasted lists, threads, notes) and extracts email addresses and US/Canada NANP phone numbers. Produces CSV columns: name_if_found, email, phone, source_line, confidence_score, pairing_uncertain.
- Python 3.10+
- Project modules in this folder:
contact_extractor.py,string_sanitizer.py,errors.py - Tests:
pip install -r requirements.txt(pytest)
From this directory:
python contact_extractor.py --file input.txt --output contacts_output.csv
Or:
python -m contact_extractor --file input.txt
| Flag | Description |
|---|---|
--file |
Input text file (required) |
--output |
CSV path (default: contacts_output.csv) |
--min-confidence |
Minimum score 0.0–1.0 (default: 0.0) |
--encoding |
Input encoding (default: utf-8; invalid bytes replaced) |
--show-rejected |
List every rejected/duplicate candidate in findings (default caps at 10 + summary) |
--stats |
Print stats to stderr after a run |
Progress: a line is printed to stderr every 100 lines processed.
from contact_extractor import run, run_file
result = run(text, config={"min_confidence": 0.5, "source_name": "paste.txt"})
result = run_file("input.txt", encoding="utf-8", config={"show_rejected": True})run_file reads the whole file into memory, then runs the same pipeline as run (the sanitizer operates on the full text first).
- Sanitize —
string_sanitizernormalizes whitespace, smart quotes, zero-width characters, etc. - Per line — Regex candidates for emails and phones; validate, score, deduplicate across the file.
- CSV —
csv.writerwith minimal quoting.
- Regex capture plus extra checks: no
.., sensible local part and domain labels, TLD length 2–63. - This is practical / RFC-inspired, not full RFC 5322 (no quoted locals, IP domains, etc.).
- Accepts common formats:
(555) 123-4567,555-123-4567,555.123.4567,+1-555-123-4567, and bare 10- or 11-digit runs (11-digit must start with1). - Rejects digit strings that fail NANP structure: area code and exchange must each start with 2–9.
- Normalized display:
(XXX) XXX-XXXXwith optional+1prefix. - Not general international E.164 for arbitrary countries.
- Name — Heuristic: same line or line above, pattern like
Jane Doe(Title-case words). Otherwiseunknown. - Pairing — When both emails and phones appear on one line, they are zipped by order. If the counts differ, each output row still gets
pairing_uncertain=yes(email–phone alignment may be wrong).
Scores are deterministic 0.0–1.0 from structure and context (TLD, formatting, “cell/phone/…”, nearby name). They are heuristics, not proof of deliverability.
- Sanitizer findings are always included.
- Extraction rejections are merged into
findings; withoutshow_rejected/--show-rejected, only the first 10 rejection entries are listed, then a note with the omitted count (stats.rejection_findings_omitted,stats.rejection_findings_truncated).
- Unreadable input path, OS read failure, or unknown text encoding raises
errors.InputError. - Non-file path (e.g. a directory) raises
errors.ParseError. - Invalid
min_confidence(not a number, non-finite, or outside 0.0–1.0) raiseserrors.ParseErrororerrors.ValidationError.
cd contact_extractor
python -m pytest tests/ -v