fix(#132): sort CC transactions by inferring year from Payment Due date#133
Merged
fix(#132): sort CC transactions by inferring year from Payment Due date#133
Conversation
Two related bugs in RowMergerService caused empty rows in CC output: 1. Ref: continuation lines — AIB CC PDFs emit a 'Ref: <digits>' line after each transaction. Without a specific classifier, TransactionClassifier picked these up as transactions (they have a date) and emitted phantom empty rows. Added RefContinuationClassifier (priority 3) to catch the Ref: pattern before TransactionClassifier runs. 2. Y-split date rows — some transactions have their Transaction Date word at a slightly different Y-coordinate, causing RowBuilder to split the transaction into a date-only row + a dateless row with the actual description/amount. Added date-only split detection in merge_continuation_lines: when a transaction row contains only date-column values and the next row is a transaction with no date, carry the date forward and collapse them. Tests: added TestRefContinuationClassifier unit tests, chain integration test for Ref: classification, and two RowMerger integration tests covering both patterns.
Yearless dates (e.g. "3 Feb") from AIB CC statements failed to parse, causing all transactions to fall back to epoch and sort in undefined order. - PageHeaderAnalyser.extract_statement_year(): scans full page 1 text for "Payment Due" / "Payment Due Date: DD Mon YYYY" and returns the year - ExtractionResult gains statement_year: int | None field - PDFTableExtractor.extract() restructured: pre-scans page 1 to extract card number and statement year before building the row processor; warns when year cannot be determined - RowPostProcessor stamps statement_year onto each transaction row so it flows into Transaction.additional_fields - DateParserService gains YEARLESS_DATE_FORMATS (%d %b, %d %B) and _parse_yearless_date(date_str, hint_year); parse_transaction_date() accepts optional hint_year parameter - ChronologicalSortingStrategy reads statement_year from additional_fields and passes it as hint_year to the date parser 39 new tests across test_date_parser.py, test_page_header_analyser.py, test_sorting_service.py, and test_row_post_processor.py
Extract _is_date_only_split, _collect_continuations, and _handle_orphaned_continuation helpers to reduce cyclomatic complexity of merge_continuation_lines from D (23) to B (8). Also fix pre-existing isort ordering in parser-free tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
3 Febafter column alias mappingDateParserServicehad no format for%d %bso every CC date fell back to epoch, producing undefined sort orderPayment Due/Payment Due Datefield on page 1 of the PDF and propagated through the pipeline to the sort stepChanges
PageHeaderAnalyser.extract_statement_year()— scans full page 1 text forPayment Due [Date]: DD Mon YYYYand returns the year asint | None; logs a warning when not foundExtractionResult— newstatement_year: int | None = NonefieldPDFTableExtractor.extract()— restructured to pre-scan page 1 (card number + statement year) before building the row processor, so the year is known at construction timeRowPostProcessor— acceptsstatement_yearparam, stampsrow["statement_year"]on every transaction row so it flows intoTransaction.additional_fieldsDateParserService— addsYEARLESS_DATE_FORMATS(%d %b,%d %B),_parse_yearless_date(date_str, hint_year), andhint_yearparam onparse_transaction_date()ChronologicalSortingStrategy— readsstatement_yearfromtx.additional_fieldsand passes ashint_yearto the date parserType
Testing
make docker-integrationpassed locally (required when touchingDockerfile,entrypoint.sh,docker-compose.yml, orpackages/parser-core/)New tests: 39 across 4 files
tests/services/test_date_parser.py(new — 22 tests for yearless parsing and hint_year)tests/extraction/test_page_header_analyser.py(+9 tests forextract_statement_year)tests/services/test_sorting_service.py(+5 tests for yearless date sort order)tests/extraction/test_row_post_processor.py(+3 tests forstatement_yearstamping)Full suite: 1523 passed, 9 skipped (all pre-existing)
Checklist
Downstream impact
bankstatements_core(exported class, function, or exception)ExtractionResultgainsstatement_year: int | None = None(optional field, fully backward-compatible)DateParserService.parse_transaction_date()gainshint_year: int | None = None(optional param, fully backward-compatible)