feat: integrate docling for document data extraction#76
Draft
Abdeali099 wants to merge 27 commits intoversion-15from
Draft
feat: integrate docling for document data extraction#76Abdeali099 wants to merge 27 commits intoversion-15from
Abdeali099 wants to merge 27 commits intoversion-15from
Conversation
…ced document handling
Member
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| transaction_parser/transaction_parser/utils/pdf_processor.py | New file introducing PDFProcessor ABC, DoclingPDFProcessor, OCRMyPDFProcessor, and the get_pdf_processor factory. Most issues flagged in the previous review cycle (lazy imports, stream seek resets, InputFormat enum key, converter caching, status checks) have been properly addressed in this revision. |
| transaction_parser/transaction_parser/utils/file_processor.py | Refactored to delegate PDF processing to the new PDFProcessor strategy. PDF/spreadsheet logic is now cleanly separated. Minor: get_content is annotated `str |
| transaction_parser/patches/set_default_pdf_processor.py | Idempotent patch that sets pdf_processor to OCRMyPDF only when no value is currently stored, preventing the unconditional overwrite flagged in the prior review. |
| transaction_parser/patches/init.py | Empty __init__.py added to make transaction_parser.patches a proper Python package, fixing the previously reported ModuleNotFoundError during bench migrate. |
| transaction_parser/transaction_parser/ai_integration/parser.py | Tightened get_content return type from `dict |
| transaction_parser/hooks.py | Adds pdf_processors hook dict mapping processor names to class paths — clean extension point for third-party apps. |
| pyproject.toml | Adds docling>=2.75.0 as a mandatory install-time dependency. The developer has opted to keep it as a hard dependency (the optional-extras approach was discussed in the previous review and declined). |
| transaction_parser/transaction_parser/doctype/transaction_parser_settings/transaction_parser_settings.json | Adds pdf_processor Select field with hardcoded OCRMyPDF\nDocling options. The extensibility limitation (hook-registered processors not appearing in the dropdown) is a known accepted trade-off to be addressed in a follow-up PR via Property Setter. |
| transaction_parser/transaction_parser/init.py | Adds now=frappe.conf.developer_mode to the background job enqueue call so that parsing runs synchronously in development. Deliberate choice documented and accepted by the team. |
| transaction_parser/patches.txt | Registers the new set_default_pdf_processor patch under [post_model_sync]. The #2 comment on the prior patch and #1 here appear to be ordering notes, though this numbering scheme should be documented if it has semantic meaning. |
Last reviewed commit: 9f02aaf
Member
Author
Member
Author
Member
Author
Member
Author
Member
Author
Member
Author
Member
Author
Member
Author
Member
Author
...rser/transaction_parser/doctype/transaction_parser_settings/transaction_parser_settings.json
Show resolved
Hide resolved
Member
Author
Member
Author
Member
Author
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
https://www.docling.ai/
Implement docling support
User can selected
PDF Processorfrom Transaction Parser SettingsTo add a new
PDF Processoradd it too hook and options of settingsNote
OCR currently disable. Will enable in future with proper package
no-docs