Contact extractor

Reads messy text (pasted lists, threads, notes) and extracts email addresses and US/Canada NANP phone numbers. Produces CSV columns: name_if_found, email, phone, source_line, confidence_score, pairing_uncertain.

Requirements

Python 3.10+
Project modules in this folder: contact_extractor.py, string_sanitizer.py, errors.py
Tests: pip install -r requirements.txt (pytest)

Command line

From this directory:

python contact_extractor.py --file input.txt --output contacts_output.csv

Or:

python -m contact_extractor --file input.txt

Flag	Description
`--file`	Input text file (required)
`--output`	CSV path (default: `contacts_output.csv`)
`--min-confidence`	Minimum score 0.0–1.0 (default: 0.0)
`--encoding`	Input encoding (default: `utf-8`; invalid bytes replaced)
`--show-rejected`	List every rejected/duplicate candidate in `findings` (default caps at 10 + summary)
`--stats`	Print `stats` to stderr after a run

Progress: a line is printed to stderr every 100 lines processed.

Programmatic use

from contact_extractor import run, run_file

result = run(text, config={"min_confidence": 0.5, "source_name": "paste.txt"})
result = run_file("input.txt", encoding="utf-8", config={"show_rejected": True})

run_file reads the whole file into memory, then runs the same pipeline as run (the sanitizer operates on the full text first).

Pipeline

Sanitize — string_sanitizer normalizes whitespace, smart quotes, zero-width characters, etc.
Per line — Regex candidates for emails and phones; validate, score, deduplicate across the file.
CSV — csv.writer with minimal quoting.

Email rules

Regex capture plus extra checks: no .., sensible local part and domain labels, TLD length 2–63.
This is practical / RFC-inspired, not full RFC 5322 (no quoted locals, IP domains, etc.).

Phone rules (US/Canada NANP)

Accepts common formats: (555) 123-4567, 555-123-4567, 555.123.4567, +1-555-123-4567, and bare 10- or 11-digit runs (11-digit must start with 1).
Rejects digit strings that fail NANP structure: area code and exchange must each start with 2–9.
Normalized display: (XXX) XXX-XXXX with optional +1 prefix.
Not general international E.164 for arbitrary countries.

Names and pairing

Name — Heuristic: same line or line above, pattern like Jane Doe (Title-case words). Otherwise unknown.
Pairing — When both emails and phones appear on one line, they are zipped by order. If the counts differ, each output row still gets pairing_uncertain = yes (email–phone alignment may be wrong).

Confidence scores

Scores are deterministic 0.0–1.0 from structure and context (TLD, formatting, “cell/phone/…”, nearby name). They are heuristics, not proof of deliverability.

Findings and stats

Sanitizer findings are always included.
Extraction rejections are merged into findings; without show_rejected / --show-rejected, only the first 10 rejection entries are listed, then a note with the omitted count (stats.rejection_findings_omitted, stats.rejection_findings_truncated).

Errors

Unreadable input path, OS read failure, or unknown text encoding raises errors.InputError.
Non-file path (e.g. a directory) raises errors.ParseError.
Invalid min_confidence (not a number, non-finite, or outside 0.0–1.0) raises errors.ParseError or errors.ValidationError.

Tests

cd contact_extractor
python -m pytest tests/ -v

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contact extractor

Requirements

Command line

Programmatic use

Pipeline

Email rules

Phone rules (US/Canada NANP)

Names and pairing

Confidence scores

Findings and stats

Errors

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Schema		Schema
tests		tests
.gitignore		.gitignore
README.md		README.md
Tutorial.md		Tutorial.md
contact_extractor.py		contact_extractor.py
errors.py		errors.py
requirements.txt		requirements.txt
string_sanitizer.py		string_sanitizer.py

Folders and files

Latest commit

History

Repository files navigation

Contact extractor

Requirements

Command line

Programmatic use

Pipeline

Email rules

Phone rules (US/Canada NANP)

Names and pairing

Confidence scores

Findings and stats

Errors

Tests

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages