Skip to content

RFC: Encapsulate date-threading state in StatefulPageRowProcessor #31

@longieirl

Description

@longieirl

Summary

RowPostProcessor.process(row, current_date) -> str both mutates row["Date"] in-place and returns an updated current_date for the caller to thread through the extraction loop. The caller in pdf_extractor.py must reassign current_date on every row — if that assignment is omitted, date propagation silently regresses. When a page is skipped (_extract_page returns None), the stale current_date from the previous page bleeds into the next valid page with no signal and no test coverage.

This RFC proposes a StatefulPageRowProcessor wrapper that encapsulates the threading entirely and makes both failure modes impossible or assertable.


Problem

Current extraction loop (pdf_extractor.py:83–128)

rows: list[dict] = []
current_date = ""                          # manual state variable

post_processor = RowPostProcessor(
    columns=self.columns,
    row_classifier=self._row_classifier,
    template=self.template,
    filename_date=filename_date,
    filename=pdf_path.name,
)

with self._pdf_reader.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, 1):
        page_rows = self._extract_page(page, page_num)
        if page_rows is None:
            continue                       # current_date bleeds silently

        for row in page_rows:
            current_date = post_processor.process(row, current_date)   # reassignment required
            if row:
                rows.append(row)

Failure mode 1 — forgotten reassignment

# Silent regression — current_date never advances
post_processor.process(row, current_date)   # return value discarded

Nothing in the type system or linter catches this. Every dateless row silently gets the first date ever seen.

Failure mode 2 — skipped-page date bleed

If pages 3–5 are skipped (invalid table structure, no headers) and page 6 has dateless rows, those rows inherit the date from page 2 with no signal. This is sometimes correct (dates span pages) but untestable — no unit test verifies cross-page date state under skipped pages.

RowPostProcessor.process() — dual responsibility

def process(self, row: dict, current_date: str) -> str:
    # 1. Classifies row; skips non-transactions
    # 2. Mutates row["Date"] in-place
    # 3. Returns new current_date for caller to reassign

Mutation and return value must both be used correctly. This is a hidden contract.

Current test gap

test_row_post_processor.py tests RowPostProcessor in isolation. There is no test for:

  • A sequence of rows across pages where a page is skipped
  • That date state does not bleed incorrectly after a skip
  • That the source of a filled date (propagated vs filename fallback) is correct

Proposed Interface

StatefulPageRowProcessor

A thin wrapper around the existing RowPostProcessor. RowPostProcessor itself is unchanged.

class StatefulPageRowProcessor:
    """Encapsulates date-threading state for the page extraction loop.

    Wraps RowPostProcessor so callers process an entire page's rows
    in one call. Date state is held internally and advanced automatically;
    the caller cannot forget the reassignment or corrupt state with a
    skipped page.
    """

    def __init__(self, post_processor: RowPostProcessor) -> None:
        """
        Args:
            post_processor: Fully configured RowPostProcessor instance.
        """
        ...

    def process_page(self, page_rows: list[dict]) -> list[dict]:
        """Process all rows on a page, threading date state automatically.

        Args:
            page_rows: Rows from a single page (may be empty).

        Returns:
            Rows with metadata tagged (Filename, document_type, template_id,
            Date filled where missing).
        """
        ...

    def current_date(self) -> str:
        """Most recently seen date. For logging and test assertions."""
        ...

    def last_date_source(self) -> str:
        """Source of the last date fill.

        Returns one of: 'row_data' | 'propagated' | 'filename' | 'initial'

        Allows tests to assert not just that a date was filled, but why.
        """
        ...

    def reset(self, to: str = "") -> None:
        """Reset internal date state. Call between PDFs in a batch.

        Args:
            to: Date to seed with (default: empty string for a clean start).
        """
        ...

Updated extraction loop (pdf_extractor.py)

# Before (6 lines of manual threading)
for row in page_rows:
    current_date = post_processor.process(row, current_date)
    if row:
        rows.append(row)

# After (1 line, no reassignment, no state variable in caller)
rows.extend(page_processor.process_page(page_rows))

Full loop after refactor:

filename_date = extract_filename_date(pdf_path.name)
post_processor = RowPostProcessor(
    columns=self.columns,
    row_classifier=self._row_classifier,
    template=self.template,
    filename_date=filename_date,
    filename=pdf_path.name,
)
page_processor = StatefulPageRowProcessor(post_processor)

with self._pdf_reader.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, 1):
        # ... credit card check, IBAN extraction unchanged ...

        page_rows = self._extract_page(page, page_num)
        if page_rows is None:
            continue              # wrapper not called → state not corrupted

        rows.extend(page_processor.process_page(page_rows))

The current_date local variable disappears entirely from pdf_extractor.py.


What this enables in tests

Skipped-page bleed is now assertable

def test_date_preserved_across_skipped_pages():
    """Date from page 1 propagates to page 3 when page 2 is skipped."""
    processor = StatefulPageRowProcessor(_make_post_processor())

    page1_rows = [{"Date": "05/01/2024", "Details": "Tx1", "Debit €": "", "Credit €": "", "Balance €": ""}]
    processor.process_page(page1_rows)
    assert processor.current_date() == "05/01/2024"

    # Page 2 skipped — do not call process_page()

    page3_rows = [{"Date": "", "Details": "Tx2", "Debit €": "", "Credit €": "", "Balance €": ""}]
    result = processor.process_page(page3_rows)

    assert result[0]["Date"] == "05/01/2024"
    assert processor.last_date_source() == "propagated"


def test_no_date_bleed_after_reset():
    """reset() clears internal state between PDFs in a batch."""
    processor = StatefulPageRowProcessor(_make_post_processor(filename_date="01 Jan 2025"))

    processor.process_page([{"Date": "15/03/2025", "Details": "Tx", "Debit €": "", "Credit €": "", "Balance €": ""}])
    assert processor.current_date() == "15/03/2025"

    processor.reset()
    assert processor.current_date() == ""

    processor.process_page([{"Date": "", "Details": "Tx2", "Debit €": "", "Credit €": "", "Balance €": ""}])
    assert processor.last_date_source() == "filename"


def test_date_source_row_data():
    """Row with its own date records source as row_data."""
    processor = StatefulPageRowProcessor(_make_post_processor())
    processor.process_page([{"Date": "10/02/2024", "Details": "Tx", "Debit €": "", "Credit €": "", "Balance €": ""}])
    assert processor.last_date_source() == "row_data"


def test_date_source_filename_fallback():
    """Dateless row with no prior date falls back to filename_date."""
    processor = StatefulPageRowProcessor(_make_post_processor(filename_date="01 Feb 2024"))
    processor.process_page([{"Date": "", "Details": "Tx", "Debit €": "", "Credit €": "", "Balance €": ""}])
    assert processor.last_date_source() == "filename"

Misuse is now structurally prevented

Old misuse New state
post_processor.process(row, current_date) — return discarded No return value to discard; process_page() returns the rows
current_date local var not reassigned No local var; state is internal
Skipped page corrupts current_date Wrapper not called on skipped pages; state unchanged

What stays unchanged

Component Change
RowPostProcessor constructor and process() signature Minor: add last_source side-effect attribute to expose date source to wrapper
extract_filename_date() None
test_row_post_processor.py existing tests None — all continue to pass
RowBuilder, PDFTableExtractor.__init__ None

What is explicitly out of scope

  • Per-row observability audit trail (ProcessingMetadata, DateSource enum, observer callback) — real value but a separate concern that would change RowPostProcessor's constructor and return type
  • Changes to boundary detection or the classifier chain

Implementation steps

  1. Add last_source tracking to RowPostProcessor.process() — set self._last_source as a side-effect attribute ("row_data", "propagated", "filename") each time a transaction row is processed. Return type stays str. This is the only change to RowPostProcessor.

  2. Add StatefulPageRowProcessor to row_post_processor.py

    • __init__: stores post_processor, initialises _current_date = "", _last_date_source = "initial"
    • process_page(page_rows): inner loop calling post_processor.process(row, self._current_date), reassigning self._current_date, reading post_processor._last_source into self._last_date_source, collecting non-empty rows
    • current_date(): getter
    • last_date_source(): getter
    • reset(to=""): sets _current_date = to, _last_date_source = "initial"
  3. Update PDFTableExtractor.extract() — remove current_date local var and inner loop; add StatefulPageRowProcessor; replace inner loop with rows.extend(page_processor.process_page(page_rows))

  4. Add 4 tests to test_row_post_processor.py — the four tests shown above


Coverage impact

Current: 92% (1288 tests). Four new tests, no files deleted, RowPostProcessor minimally changed. Net: +4 tests covering the previously-untestable cross-page date state.


Files affected

File Change
src/bankstatements_core/extraction/row_post_processor.py Add StatefulPageRowProcessor; add _last_source side-effect to RowPostProcessor.process()
src/bankstatements_core/extraction/pdf_extractor.py Replace inner loop + current_date var with StatefulPageRowProcessor
tests/extraction/test_row_post_processor.py Add 4 new tests

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions