RFC: Encapsulate date-threading state in StatefulPageRowProcessor

## Summary

`RowPostProcessor.process(row, current_date) -> str` both mutates `row["Date"]` in-place and returns an updated `current_date` for the caller to thread through the extraction loop. The caller in `pdf_extractor.py` must reassign `current_date` on every row — if that assignment is omitted, date propagation silently regresses. When a page is skipped (`_extract_page` returns `None`), the stale `current_date` from the previous page bleeds into the next valid page with no signal and no test coverage.

This RFC proposes a `StatefulPageRowProcessor` wrapper that encapsulates the threading entirely and makes both failure modes impossible or assertable.

---

## Problem

### Current extraction loop (`pdf_extractor.py:83–128`)

```python
rows: list[dict] = []
current_date = ""                          # manual state variable

post_processor = RowPostProcessor(
    columns=self.columns,
    row_classifier=self._row_classifier,
    template=self.template,
    filename_date=filename_date,
    filename=pdf_path.name,
)

with self._pdf_reader.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, 1):
        page_rows = self._extract_page(page, page_num)
        if page_rows is None:
            continue                       # current_date bleeds silently

        for row in page_rows:
            current_date = post_processor.process(row, current_date)   # reassignment required
            if row:
                rows.append(row)
```

### Failure mode 1 — forgotten reassignment

```python
# Silent regression — current_date never advances
post_processor.process(row, current_date)   # return value discarded
```

Nothing in the type system or linter catches this. Every dateless row silently gets the first date ever seen.

### Failure mode 2 — skipped-page date bleed

If pages 3–5 are skipped (invalid table structure, no headers) and page 6 has dateless rows, those rows inherit the date from page 2 with no signal. This is sometimes correct (dates span pages) but **untestable** — no unit test verifies cross-page date state under skipped pages.

### `RowPostProcessor.process()` — dual responsibility

```python
def process(self, row: dict, current_date: str) -> str:
    # 1. Classifies row; skips non-transactions
    # 2. Mutates row["Date"] in-place
    # 3. Returns new current_date for caller to reassign
```

Mutation and return value must both be used correctly. This is a hidden contract.

### Current test gap

`test_row_post_processor.py` tests `RowPostProcessor` in isolation. There is no test for:
- A sequence of rows across pages where a page is skipped
- That date state does not bleed incorrectly after a skip
- That the source of a filled date (propagated vs filename fallback) is correct

---

## Proposed Interface

### `StatefulPageRowProcessor`

A thin wrapper around the existing `RowPostProcessor`. **`RowPostProcessor` itself is unchanged.**

```python
class StatefulPageRowProcessor:
    """Encapsulates date-threading state for the page extraction loop.

    Wraps RowPostProcessor so callers process an entire page's rows
    in one call. Date state is held internally and advanced automatically;
    the caller cannot forget the reassignment or corrupt state with a
    skipped page.
    """

    def __init__(self, post_processor: RowPostProcessor) -> None:
        """
        Args:
            post_processor: Fully configured RowPostProcessor instance.
        """
        ...

    def process_page(self, page_rows: list[dict]) -> list[dict]:
        """Process all rows on a page, threading date state automatically.

        Args:
            page_rows: Rows from a single page (may be empty).

        Returns:
            Rows with metadata tagged (Filename, document_type, template_id,
            Date filled where missing).
        """
        ...

    def current_date(self) -> str:
        """Most recently seen date. For logging and test assertions."""
        ...

    def last_date_source(self) -> str:
        """Source of the last date fill.

        Returns one of: 'row_data' | 'propagated' | 'filename' | 'initial'

        Allows tests to assert not just that a date was filled, but why.
        """
        ...

    def reset(self, to: str = "") -> None:
        """Reset internal date state. Call between PDFs in a batch.

        Args:
            to: Date to seed with (default: empty string for a clean start).
        """
        ...
```

### Updated extraction loop (`pdf_extractor.py`)

```python
# Before (6 lines of manual threading)
for row in page_rows:
    current_date = post_processor.process(row, current_date)
    if row:
        rows.append(row)

# After (1 line, no reassignment, no state variable in caller)
rows.extend(page_processor.process_page(page_rows))
```

Full loop after refactor:

```python
filename_date = extract_filename_date(pdf_path.name)
post_processor = RowPostProcessor(
    columns=self.columns,
    row_classifier=self._row_classifier,
    template=self.template,
    filename_date=filename_date,
    filename=pdf_path.name,
)
page_processor = StatefulPageRowProcessor(post_processor)

with self._pdf_reader.open(pdf_path) as pdf:
    for page_num, page in enumerate(pdf.pages, 1):
        # ... credit card check, IBAN extraction unchanged ...

        page_rows = self._extract_page(page, page_num)
        if page_rows is None:
            continue              # wrapper not called → state not corrupted

        rows.extend(page_processor.process_page(page_rows))
```

The `current_date` local variable disappears entirely from `pdf_extractor.py`.

---

## What this enables in tests

### Skipped-page bleed is now assertable

```python
def test_date_preserved_across_skipped_pages():
    """Date from page 1 propagates to page 3 when page 2 is skipped."""
    processor = StatefulPageRowProcessor(_make_post_processor())

    page1_rows = [{"Date": "05/01/2024", "Details": "Tx1", "Debit €": "", "Credit €": "", "Balance €": ""}]
    processor.process_page(page1_rows)
    assert processor.current_date() == "05/01/2024"

    # Page 2 skipped — do not call process_page()

    page3_rows = [{"Date": "", "Details": "Tx2", "Debit €": "", "Credit €": "", "Balance €": ""}]
    result = processor.process_page(page3_rows)

    assert result[0]["Date"] == "05/01/2024"
    assert processor.last_date_source() == "propagated"


def test_no_date_bleed_after_reset():
    """reset() clears internal state between PDFs in a batch."""
    processor = StatefulPageRowProcessor(_make_post_processor(filename_date="01 Jan 2025"))

    processor.process_page([{"Date": "15/03/2025", "Details": "Tx", "Debit €": "", "Credit €": "", "Balance €": ""}])
    assert processor.current_date() == "15/03/2025"

    processor.reset()
    assert processor.current_date() == ""

    processor.process_page([{"Date": "", "Details": "Tx2", "Debit €": "", "Credit €": "", "Balance €": ""}])
    assert processor.last_date_source() == "filename"


def test_date_source_row_data():
    """Row with its own date records source as row_data."""
    processor = StatefulPageRowProcessor(_make_post_processor())
    processor.process_page([{"Date": "10/02/2024", "Details": "Tx", "Debit €": "", "Credit €": "", "Balance €": ""}])
    assert processor.last_date_source() == "row_data"


def test_date_source_filename_fallback():
    """Dateless row with no prior date falls back to filename_date."""
    processor = StatefulPageRowProcessor(_make_post_processor(filename_date="01 Feb 2024"))
    processor.process_page([{"Date": "", "Details": "Tx", "Debit €": "", "Credit €": "", "Balance €": ""}])
    assert processor.last_date_source() == "filename"
```

### Misuse is now structurally prevented

| Old misuse | New state |
|---|---|
| `post_processor.process(row, current_date)` — return discarded | No return value to discard; `process_page()` returns the rows |
| `current_date` local var not reassigned | No local var; state is internal |
| Skipped page corrupts `current_date` | Wrapper not called on skipped pages; state unchanged |

---

## What stays unchanged

| Component | Change |
|---|---|
| `RowPostProcessor` constructor and `process()` signature | Minor: add `last_source` side-effect attribute to expose date source to wrapper |
| `extract_filename_date()` | None |
| `test_row_post_processor.py` existing tests | None — all continue to pass |
| `RowBuilder`, `PDFTableExtractor.__init__` | None |

---

## What is explicitly out of scope

- Per-row observability audit trail (`ProcessingMetadata`, `DateSource` enum, observer callback) — real value but a separate concern that would change `RowPostProcessor`'s constructor and return type
- Changes to boundary detection or the classifier chain

---

## Implementation steps

1. **Add `last_source` tracking to `RowPostProcessor.process()`** — set `self._last_source` as a side-effect attribute (`"row_data"`, `"propagated"`, `"filename"`) each time a transaction row is processed. Return type stays `str`. This is the only change to `RowPostProcessor`.

2. **Add `StatefulPageRowProcessor` to `row_post_processor.py`**
   - `__init__`: stores `post_processor`, initialises `_current_date = ""`, `_last_date_source = "initial"`
   - `process_page(page_rows)`: inner loop calling `post_processor.process(row, self._current_date)`, reassigning `self._current_date`, reading `post_processor._last_source` into `self._last_date_source`, collecting non-empty rows
   - `current_date()`: getter
   - `last_date_source()`: getter
   - `reset(to="")`: sets `_current_date = to`, `_last_date_source = "initial"`

3. **Update `PDFTableExtractor.extract()`** — remove `current_date` local var and inner loop; add `StatefulPageRowProcessor`; replace inner loop with `rows.extend(page_processor.process_page(page_rows))`

4. **Add 4 tests to `test_row_post_processor.py`** — the four tests shown above

---

## Coverage impact

Current: 92% (1288 tests). Four new tests, no files deleted, `RowPostProcessor` minimally changed. Net: +4 tests covering the previously-untestable cross-page date state.

---

## Files affected

| File | Change |
|---|---|
| `src/bankstatements_core/extraction/row_post_processor.py` | Add `StatefulPageRowProcessor`; add `_last_source` side-effect to `RowPostProcessor.process()` |
| `src/bankstatements_core/extraction/pdf_extractor.py` | Replace inner loop + `current_date` var with `StatefulPageRowProcessor` |
| `tests/extraction/test_row_post_processor.py` | Add 4 new tests |


File	Change
`src/bankstatements_core/extraction/row_post_processor.py`	Add `StatefulPageRowProcessor`; add `_last_source` side-effect to `RowPostProcessor.process()`
`src/bankstatements_core/extraction/pdf_extractor.py`	Replace inner loop + `current_date` var with `StatefulPageRowProcessor`
`tests/extraction/test_row_post_processor.py`	Add 4 new tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Encapsulate date-threading state in StatefulPageRowProcessor #31

Summary

Problem

Current extraction loop (`pdf_extractor.py:83–128`)

Failure mode 1 — forgotten reassignment

Failure mode 2 — skipped-page date bleed

`RowPostProcessor.process()` — dual responsibility

Current test gap

Proposed Interface

`StatefulPageRowProcessor`

Updated extraction loop (`pdf_extractor.py`)

What this enables in tests

Skipped-page bleed is now assertable

Misuse is now structurally prevented

What stays unchanged

What is explicitly out of scope

Implementation steps

Coverage impact

Files affected

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Old misuse	New state
`post_processor.process(row, current_date)` — return discarded	No return value to discard; `process_page()` returns the rows
`current_date` local var not reassigned	No local var; state is internal
Skipped page corrupts `current_date`	Wrapper not called on skipped pages; state unchanged

Component	Change
`RowPostProcessor` constructor and `process()` signature	Minor: add `last_source` side-effect attribute to expose date source to wrapper
`extract_filename_date()`	None
`test_row_post_processor.py` existing tests	None — all continue to pass
`RowBuilder`, `PDFTableExtractor.__init__`	None

RFC: Encapsulate date-threading state in StatefulPageRowProcessor #31

Description

Summary

Problem

Current extraction loop (pdf_extractor.py:83–128)

Failure mode 1 — forgotten reassignment

Failure mode 2 — skipped-page date bleed

RowPostProcessor.process() — dual responsibility

Current test gap

Proposed Interface

StatefulPageRowProcessor

Updated extraction loop (pdf_extractor.py)

What this enables in tests

Skipped-page bleed is now assertable

Misuse is now structurally prevented

What stays unchanged

What is explicitly out of scope

Implementation steps

Coverage impact

Files affected

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Current extraction loop (`pdf_extractor.py:83–128`)

`RowPostProcessor.process()` — dual responsibility

`StatefulPageRowProcessor`

Updated extraction loop (`pdf_extractor.py`)