feat: add benchmark test configs and doctypes#77
feat: add benchmark test configs and doctypes#77Abdeali099 wants to merge 105 commits intouse-docling-to-extract-datafrom
Conversation
Confidence Score: 2/5Not safe to merge — the public parse endpoint is broken due to a keyword argument name mismatch that causes a TypeError on every background job. A P0 defect in
|
| Filename | Overview |
|---|---|
| transaction_parser/transaction_parser/init.py | Refactored to support multiple file URLs; introduces a critical parameter name mismatch (file_url vs file_urls) in frappe.enqueue that will break every background parse job, plus document_name: file (a Document object) in the duplicate-entry notification path. |
| transaction_parser/parser_benchmark/runner.py | New benchmark runner implementing file-parsing, AI-parsing, cost calculation, and accuracy scoring steps; correctly uses tracemalloc for memory measurement and handles multi-file scenarios by mirroring the controller's merge flow. |
| transaction_parser/parser_benchmark/scorer.py | New recursive scorer with proper normalization, weighted accuracy, and mismatch reporting; logic is sound and handles nested dicts, lists, and numeric precision correctly. |
| transaction_parser/transaction_parser/utils/response_merger.py | New schema-driven ResponseMerger that fills missing primitive/object fields from subsequent file responses and deduplicates list items by match keys; implementation is clean and correct. |
| transaction_parser/transaction_parser/overrides/communication.py | Refactored to match email accounts before checking party emails (fixing an ordering issue), adds process_one_document_per_communication mode, and filters unsupported extensions before enqueueing. |
| transaction_parser/transaction_parser/ai_integration/parser.py | Exposes ai_response on AIParser for downstream use by the benchmark runner, adds company parameter to parse()/_build_messages(), and initializes ai_response = {} to avoid AttributeError before first call. |
| transaction_parser/transaction_parser/controllers/transaction.py | Refactored to accept a list of files, introducing _parse_single_file, _parse_multiple_files, and get_match_keys; _attach_file updated to attach all files to the created document. |
Comments Outside Diff (2)
-
transaction_parser/transaction_parser/__init__.py, line 27 (link)Parameter name mismatch breaks all background jobs
frappe.enqueuepasses the argument asfile_url=cstr(file_url)(line 27), but_parsewas renamed to expectfile_urls(line 39). When the background worker calls_parse(file_url=..., ...), it will raiseTypeError: _parse() got an unexpected keyword argument 'file_url', making every invocation through the publicparse()endpoint fail. -
transaction_parser/transaction_parser/__init__.py, line 109-112 (link)Error notification uses raw
filevariable (Document object, not string)In the duplicate-entry exception branch,
fileis now the lastfrappe.Documentobject assigned inside the loop (for file_name in file_names: file = frappe.get_doc(...)). The notification sets"document_name": file(a Document object, not a name string) and"subject": _("Duplicate entry found for {0}").format(file_urls)(which formats the raw list, e.g.['url1', 'url2']). Before this refactor,file.namewas used. This will produce a malformed notification.
Reviews (16): Last reviewed commit: "refactor: patch for process one document..." | Re-trigger Greptile
…urement in file parsing
…ics and benchmarking
…rder and Purchase Invoice
…d version comparison reports
…sion comparison reports
…prove response merging
…rty and update company context role hints in prompts
| def generate( | ||
| self, file, ai_model: str | None = None, page_limit: int | None = None | ||
| self, | ||
| files, |
| if isinstance(files, str): | ||
| files = [files] | ||
|
|
||
| self.files = files |
| document_schema=schema, | ||
| document_data=content, | ||
| file_doc_name=self.file.name, | ||
| file_doc_name=file.name, |
There was a problem hiding this comment.
Need to update benchmark
| def get_match_keys(self) -> dict[str, list[str]]: | ||
| """Return list field name -> key fields used to match items during merge.""" | ||
| return { | ||
| "item_list": ["party_item_code", "quantity", "rate", "description"], |
There was a problem hiding this comment.
if we hard code party_item_code and someone change item schema like company_item_code then?
…etting and update modified timestamp
…case in notification
…and enhance error logging
No description provided.