A desktop application for PDF text extraction, font-aware layout analysis, and citation detection, designed for academic and technical documents.
- Text extraction with exact span-to-PDF coordinate mapping
- Font-aware layout analysis (font name, size, style)
- Superscript & bracket citation detection engine
- Bibliography parsing with false-positive suppression
- Modern GUI with synchronized PDF/Text views
PDF (pdfplumber)
↓
PageData / LineData / CharData
↓
Citation Channels (Superscript / Bracket)
↓
Fusion Engine (confidence scoring + filtering)
↓
RefEntry / Occurrence
↓
GUI (PDF ↔ Text ↔ Citation sync)
- All GUI updates run on the main thread
- Background tasks are cancelable via job-id invalidation
- Image rendering uses LRU cache to cap memory usage
- Superscript channel: geometric + font-size based detection
- Bracket channel:
[n],(n)style inline citations
- Strict line-head ID matching (
^\s*(\[(\d+)\]|(\d+)\.)) - Year-number filtering (1900–2099) to prevent ID pollution
max_id_multiplierfalse-citation upper bound
- When bibliography is reliable (≥ N entries), unlinked citations are penalized but not discarded
- Small or missing bibliographies automatically disable penalties
- Thread-safe background execution (no Tk access in workers)
- Job ID mechanism prevents stale callbacks from overwriting state
- PDF handle caching with guaranteed release on exit/errors
- Debug reports never include raw document text
python app_gui.py- Prebuilt executable:
dist_exe/PDFTextRecognizer.exe - No Python environment required
Included test scripts:
test_citation_improvements.py– citation logic validationreconciliation_check.py– configuration & import-path verification
All tests must pass before deployment.
- Not intended for OCR (scanned PDFs)
- Not a reference manager or citation formatter
- Focused on structural correctness and traceability, not heuristics-only extraction
MIT LICENSE