Skip to content

feat: integrate docling for document data extraction#76

Draft
Abdeali099 wants to merge 27 commits intoversion-15from
use-docling-to-extract-data
Draft

feat: integrate docling for document data extraction#76
Abdeali099 wants to merge 27 commits intoversion-15from
use-docling-to-extract-data

Conversation

@Abdeali099
Copy link
Copy Markdown
Member

@Abdeali099 Abdeali099 commented Mar 16, 2026

  • https://www.docling.ai/

  • Implement docling support

  • User can selected PDF Processor from Transaction Parser Settings

  • To add a new PDF Processor add it too hook and options of settings

Note

OCR currently disable. Will enable in future with proper package

  • no-docs

@vorasmit
Copy link
Copy Markdown
Member

@greptileai

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 16, 2026

Confidence Score: 4/5

  • This PR is safe to merge; the substantive issues from the previous review cycle have all been addressed, with only a minor type annotation inaccuracy remaining.
  • The core logic is sound: lazy imports, proper stream management, converter caching, and idempotent patching are all correctly implemented. The one remaining new finding is a misleading str | None return type on FileProcessor.get_content — a cosmetic type annotation issue that has no runtime impact. The two known accepted trade-offs (hard docling dependency, hardcoded dropdown options) are deliberate product decisions. No critical logic or security issues were found.
  • No files require special attention beyond the minor type annotation fix in transaction_parser/transaction_parser/utils/file_processor.py.

Important Files Changed

Filename Overview
transaction_parser/transaction_parser/utils/pdf_processor.py New file introducing PDFProcessor ABC, DoclingPDFProcessor, OCRMyPDFProcessor, and the get_pdf_processor factory. Most issues flagged in the previous review cycle (lazy imports, stream seek resets, InputFormat enum key, converter caching, status checks) have been properly addressed in this revision.
transaction_parser/transaction_parser/utils/file_processor.py Refactored to delegate PDF processing to the new PDFProcessor strategy. PDF/spreadsheet logic is now cleanly separated. Minor: get_content is annotated `str
transaction_parser/patches/set_default_pdf_processor.py Idempotent patch that sets pdf_processor to OCRMyPDF only when no value is currently stored, preventing the unconditional overwrite flagged in the prior review.
transaction_parser/patches/init.py Empty __init__.py added to make transaction_parser.patches a proper Python package, fixing the previously reported ModuleNotFoundError during bench migrate.
transaction_parser/transaction_parser/ai_integration/parser.py Tightened get_content return type from `dict
transaction_parser/hooks.py Adds pdf_processors hook dict mapping processor names to class paths — clean extension point for third-party apps.
pyproject.toml Adds docling>=2.75.0 as a mandatory install-time dependency. The developer has opted to keep it as a hard dependency (the optional-extras approach was discussed in the previous review and declined).
transaction_parser/transaction_parser/doctype/transaction_parser_settings/transaction_parser_settings.json Adds pdf_processor Select field with hardcoded OCRMyPDF\nDocling options. The extensibility limitation (hook-registered processors not appearing in the dropdown) is a known accepted trade-off to be addressed in a follow-up PR via Property Setter.
transaction_parser/transaction_parser/init.py Adds now=frappe.conf.developer_mode to the background job enqueue call so that parsing runs synchronously in development. Deliberate choice documented and accepted by the team.
transaction_parser/patches.txt Registers the new set_default_pdf_processor patch under [post_model_sync]. The #2 comment on the prior patch and #1 here appear to be ordering notes, though this numbering scheme should be documented if it has semantic meaning.

Last reviewed commit: 9f02aaf

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099 Abdeali099 self-assigned this Mar 17, 2026
@Abdeali099 Abdeali099 added P1 Highest Priority enhancement New feature or request labels Mar 17, 2026
@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

@Abdeali099
Copy link
Copy Markdown
Member Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request P1 Highest Priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants