Skip to content

fix(#132): sort CC transactions by inferring year from Payment Due date#133

Merged
longieirl merged 4 commits intomainfrom
fix/132-cc-yearless-date-sorting
Apr 9, 2026
Merged

fix(#132): sort CC transactions by inferring year from Payment Due date#133
longieirl merged 4 commits intomainfrom
fix/132-cc-yearless-date-sorting

Conversation

@longieirl
Copy link
Copy Markdown
Owner

Summary

  • CC statement transactions (e.g. AIB CC) use yearless dates like 3 Feb after column alias mapping
  • DateParserService had no format for %d %b so every CC date fell back to epoch, producing undefined sort order
  • Year is now extracted from the Payment Due / Payment Due Date field on page 1 of the PDF and propagated through the pipeline to the sort step

Changes

  • PageHeaderAnalyser.extract_statement_year() — scans full page 1 text for Payment Due [Date]: DD Mon YYYY and returns the year as int | None; logs a warning when not found
  • ExtractionResult — new statement_year: int | None = None field
  • PDFTableExtractor.extract() — restructured to pre-scan page 1 (card number + statement year) before building the row processor, so the year is known at construction time
  • RowPostProcessor — accepts statement_year param, stamps row["statement_year"] on every transaction row so it flows into Transaction.additional_fields
  • DateParserService — adds YEARLESS_DATE_FORMATS (%d %b, %d %B), _parse_yearless_date(date_str, hint_year), and hint_year param on parse_transaction_date()
  • ChronologicalSortingStrategy — reads statement_year from tx.additional_fields and passes as hint_year to the date parser

Type

  • Bug fix

Testing

  • Tests pass (coverage ≥ 91%)
  • Manually tested
  • make docker-integration passed locally (required when touching Dockerfile, entrypoint.sh, docker-compose.yml, or packages/parser-core/)

New tests: 39 across 4 files

  • tests/services/test_date_parser.py (new — 22 tests for yearless parsing and hint_year)
  • tests/extraction/test_page_header_analyser.py (+9 tests for extract_statement_year)
  • tests/services/test_sorting_service.py (+5 tests for yearless date sort order)
  • tests/extraction/test_row_post_processor.py (+3 tests for statement_year stamping)

Full suite: 1523 passed, 9 skipped (all pre-existing)

Checklist

  • Code follows project style
  • Self-reviewed
  • Documentation updated (if needed)
  • No new warnings

Downstream impact

  • This PR changes a public interface in bankstatements_core (exported class, function, or exception)
    • ExtractionResult gains statement_year: int | None = None (optional field, fully backward-compatible)
    • DateParserService.parse_transaction_date() gains hint_year: int | None = None (optional param, fully backward-compatible)

web-flow added 2 commits April 9, 2026 14:53
Two related bugs in RowMergerService caused empty rows in CC output:

1. Ref: continuation lines — AIB CC PDFs emit a 'Ref: <digits>' line
   after each transaction. Without a specific classifier, TransactionClassifier
   picked these up as transactions (they have a date) and emitted phantom
   empty rows. Added RefContinuationClassifier (priority 3) to catch the
   Ref: pattern before TransactionClassifier runs.

2. Y-split date rows — some transactions have their Transaction Date word
   at a slightly different Y-coordinate, causing RowBuilder to split the
   transaction into a date-only row + a dateless row with the actual
   description/amount. Added date-only split detection in merge_continuation_lines:
   when a transaction row contains only date-column values and the next row
   is a transaction with no date, carry the date forward and collapse them.

Tests: added TestRefContinuationClassifier unit tests, chain integration
test for Ref: classification, and two RowMerger integration tests covering
both patterns.
Yearless dates (e.g. "3 Feb") from AIB CC statements failed to parse,
causing all transactions to fall back to epoch and sort in undefined order.

- PageHeaderAnalyser.extract_statement_year(): scans full page 1 text for
  "Payment Due" / "Payment Due Date: DD Mon YYYY" and returns the year
- ExtractionResult gains statement_year: int | None field
- PDFTableExtractor.extract() restructured: pre-scans page 1 to extract
  card number and statement year before building the row processor; warns
  when year cannot be determined
- RowPostProcessor stamps statement_year onto each transaction row so it
  flows into Transaction.additional_fields
- DateParserService gains YEARLESS_DATE_FORMATS (%d %b, %d %B) and
  _parse_yearless_date(date_str, hint_year); parse_transaction_date()
  accepts optional hint_year parameter
- ChronologicalSortingStrategy reads statement_year from additional_fields
  and passes it as hint_year to the date parser

39 new tests across test_date_parser.py, test_page_header_analyser.py,
test_sorting_service.py, and test_row_post_processor.py
@longieirl longieirl self-assigned this Apr 9, 2026
@github-actions github-actions bot added the bug Something isn't working label Apr 9, 2026
web-flow added 2 commits April 9, 2026 15:35
Extract _is_date_only_split, _collect_continuations, and
_handle_orphaned_continuation helpers to reduce cyclomatic complexity
of merge_continuation_lines from D (23) to B (8). Also fix pre-existing
isort ordering in parser-free tests.
@longieirl longieirl merged commit 57b743d into main Apr 9, 2026
10 checks passed
@longieirl longieirl deleted the fix/132-cc-yearless-date-sorting branch April 9, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants