fix(#132): sort CC transactions by inferring year from Payment Due date by longieirl · Pull Request #133 · longieirl/bankstatementprocessor

longieirl · 2026-04-09T14:24:45Z

Summary

CC statement transactions (e.g. AIB CC) use yearless dates like 3 Feb after column alias mapping
DateParserService had no format for %d %b so every CC date fell back to epoch, producing undefined sort order
Year is now extracted from the Payment Due / Payment Due Date field on page 1 of the PDF and propagated through the pipeline to the sort step

Changes

PageHeaderAnalyser.extract_statement_year() — scans full page 1 text for Payment Due [Date]: DD Mon YYYY and returns the year as int | None; logs a warning when not found
ExtractionResult — new statement_year: int | None = None field
PDFTableExtractor.extract() — restructured to pre-scan page 1 (card number + statement year) before building the row processor, so the year is known at construction time
RowPostProcessor — accepts statement_year param, stamps row["statement_year"] on every transaction row so it flows into Transaction.additional_fields
DateParserService — adds YEARLESS_DATE_FORMATS (%d %b, %d %B), _parse_yearless_date(date_str, hint_year), and hint_year param on parse_transaction_date()
ChronologicalSortingStrategy — reads statement_year from tx.additional_fields and passes as hint_year to the date parser

Type

Bug fix

Testing

Tests pass (coverage ≥ 91%)
Manually tested
make docker-integration passed locally (required when touching Dockerfile, entrypoint.sh, docker-compose.yml, or packages/parser-core/)

New tests: 39 across 4 files

tests/services/test_date_parser.py (new — 22 tests for yearless parsing and hint_year)
tests/extraction/test_page_header_analyser.py (+9 tests for extract_statement_year)
tests/services/test_sorting_service.py (+5 tests for yearless date sort order)
tests/extraction/test_row_post_processor.py (+3 tests for statement_year stamping)

Full suite: 1523 passed, 9 skipped (all pre-existing)

Checklist

Code follows project style
Self-reviewed
Documentation updated (if needed)
No new warnings

Downstream impact

This PR changes a public interface in bankstatements_core (exported class, function, or exception)
- ExtractionResult gains statement_year: int | None = None (optional field, fully backward-compatible)
- DateParserService.parse_transaction_date() gains hint_year: int | None = None (optional param, fully backward-compatible)

Two related bugs in RowMergerService caused empty rows in CC output: 1. Ref: continuation lines — AIB CC PDFs emit a 'Ref: <digits>' line after each transaction. Without a specific classifier, TransactionClassifier picked these up as transactions (they have a date) and emitted phantom empty rows. Added RefContinuationClassifier (priority 3) to catch the Ref: pattern before TransactionClassifier runs. 2. Y-split date rows — some transactions have their Transaction Date word at a slightly different Y-coordinate, causing RowBuilder to split the transaction into a date-only row + a dateless row with the actual description/amount. Added date-only split detection in merge_continuation_lines: when a transaction row contains only date-column values and the next row is a transaction with no date, carry the date forward and collapse them. Tests: added TestRefContinuationClassifier unit tests, chain integration test for Ref: classification, and two RowMerger integration tests covering both patterns.

Yearless dates (e.g. "3 Feb") from AIB CC statements failed to parse, causing all transactions to fall back to epoch and sort in undefined order. - PageHeaderAnalyser.extract_statement_year(): scans full page 1 text for "Payment Due" / "Payment Due Date: DD Mon YYYY" and returns the year - ExtractionResult gains statement_year: int | None field - PDFTableExtractor.extract() restructured: pre-scans page 1 to extract card number and statement year before building the row processor; warns when year cannot be determined - RowPostProcessor stamps statement_year onto each transaction row so it flows into Transaction.additional_fields - DateParserService gains YEARLESS_DATE_FORMATS (%d %b, %d %B) and _parse_yearless_date(date_str, hint_year); parse_transaction_date() accepts optional hint_year parameter - ChronologicalSortingStrategy reads statement_year from additional_fields and passes it as hint_year to the date parser 39 new tests across test_date_parser.py, test_page_header_analyser.py, test_sorting_service.py, and test_row_post_processor.py

Extract _is_date_only_split, _collect_continuations, and _handle_orphaned_continuation helpers to reduce cyclomatic complexity of merge_continuation_lines from D (23) to B (8). Also fix pre-existing isort ordering in parser-free tests.

web-flow added 2 commits April 9, 2026 14:53

longieirl self-assigned this Apr 9, 2026

github-actions bot added the bug Something isn't working label Apr 9, 2026

web-flow added 2 commits April 9, 2026 15:35

fix: remove unused type: ignore comments flagged by mypy

92bcb5e

longieirl merged commit 57b743d into main Apr 9, 2026
10 checks passed

longieirl deleted the fix/132-cc-yearless-date-sorting branch April 9, 2026 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#132): sort CC transactions by inferring year from Payment Due date#133

fix(#132): sort CC transactions by inferring year from Payment Due date#133
longieirl merged 4 commits intomainfrom
fix/132-cc-yearless-date-sorting

longieirl commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

longieirl commented Apr 9, 2026

Summary

Changes

Type

Testing

Checklist

Downstream impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants