PDF Text Recognizer Pro

A desktop application for PDF text extraction, font-aware layout analysis, and citation detection, designed for academic and technical documents.

Features

Core Capabilities

Text extraction with exact span-to-PDF coordinate mapping
Font-aware layout analysis (font name, size, style)
Superscript & bracket citation detection engine
Bibliography parsing with false-positive suppression
Modern GUI with synchronized PDF/Text views

Architecture

PDF (pdfplumber)
   ↓
PageData / LineData / CharData
   ↓
Citation Channels (Superscript / Bracket)
   ↓
Fusion Engine (confidence scoring + filtering)
   ↓
RefEntry / Occurrence
   ↓
GUI (PDF ↔ Text ↔ Citation sync)

All GUI updates run on the main thread
Background tasks are cancelable via job-id invalidation
Image rendering uses LRU cache to cap memory usage

Citation Engine Design

Detection Channels

Superscript channel: geometric + font-size based detection
Bracket channel: [n], (n) style inline citations

Bibliography Handling

Strict line-head ID matching (^\s*(\[(\d+)\]|(\d+)\.))
Year-number filtering (1900–2099) to prevent ID pollution
max_id_multiplier false-citation upper bound

Soft Constraint System

When bibliography is reliable (≥ N entries), unlinked citations are penalized but not discarded
Small or missing bibliographies automatically disable penalties

Reliability & Safety

Thread-safe background execution (no Tk access in workers)
Job ID mechanism prevents stale callbacks from overwriting state
PDF handle caching with guaranteed release on exit/errors
Debug reports never include raw document text

Build & Run

Run from source

python app_gui.py

Windows executable

Prebuilt executable: dist_exe/PDFTextRecognizer.exe
No Python environment required

Testing & Verification

Included test scripts:

test_citation_improvements.py – citation logic validation
reconciliation_check.py – configuration & import-path verification

All tests must pass before deployment.

Scope & Non-Goals

Not intended for OCR (scanned PDFs)
Not a reference manager or citation formatter
Focused on structural correctness and traceability, not heuristics-only extraction

License

MIT LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
app		app
build_spec		build_spec
engine		engine
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Launch.bat		Launch.bat
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Text Recognizer Pro

Features

Core Capabilities

Architecture

Citation Engine Design

Detection Channels

Bibliography Handling

Soft Constraint System

Reliability & Safety

Build & Run

Run from source

Windows executable

Testing & Verification

Scope & Non-Goals

License

About

Uh oh!

Releases 2

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Text Recognizer Pro

Features

Core Capabilities

Architecture

Citation Engine Design

Detection Channels

Bibliography Handling

Soft Constraint System

Reliability & Safety

Build & Run

Run from source

Windows executable

Testing & Verification

Scope & Non-Goals

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Contributors

Uh oh!

Languages