Summary
RowPostProcessor.process(row, current_date) -> str both mutates row["Date"] in-place and returns an updated current_date for the caller to thread through the extraction loop. The caller in pdf_extractor.py must reassign current_date on every row — if that assignment is omitted, date propagation silently regresses. When a page is skipped (_extract_page returns None), the stale current_date from the previous page bleeds into the next valid page with no signal and no test coverage.
This RFC proposes a StatefulPageRowProcessor wrapper that encapsulates the threading entirely and makes both failure modes impossible or assertable.
Problem
Current extraction loop (pdf_extractor.py:83–128)
rows: list[dict] = []
current_date = "" # manual state variable
post_processor = RowPostProcessor(
columns=self.columns,
row_classifier=self._row_classifier,
template=self.template,
filename_date=filename_date,
filename=pdf_path.name,
)
with self._pdf_reader.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
page_rows = self._extract_page(page, page_num)
if page_rows is None:
continue # current_date bleeds silently
for row in page_rows:
current_date = post_processor.process(row, current_date) # reassignment required
if row:
rows.append(row)
Failure mode 1 — forgotten reassignment
# Silent regression — current_date never advances
post_processor.process(row, current_date) # return value discarded
Nothing in the type system or linter catches this. Every dateless row silently gets the first date ever seen.
Failure mode 2 — skipped-page date bleed
If pages 3–5 are skipped (invalid table structure, no headers) and page 6 has dateless rows, those rows inherit the date from page 2 with no signal. This is sometimes correct (dates span pages) but untestable — no unit test verifies cross-page date state under skipped pages.
RowPostProcessor.process() — dual responsibility
def process(self, row: dict, current_date: str) -> str:
# 1. Classifies row; skips non-transactions
# 2. Mutates row["Date"] in-place
# 3. Returns new current_date for caller to reassign
Mutation and return value must both be used correctly. This is a hidden contract.
Current test gap
test_row_post_processor.py tests RowPostProcessor in isolation. There is no test for:
- A sequence of rows across pages where a page is skipped
- That date state does not bleed incorrectly after a skip
- That the source of a filled date (propagated vs filename fallback) is correct
Proposed Interface
StatefulPageRowProcessor
A thin wrapper around the existing RowPostProcessor. RowPostProcessor itself is unchanged.
class StatefulPageRowProcessor:
"""Encapsulates date-threading state for the page extraction loop.
Wraps RowPostProcessor so callers process an entire page's rows
in one call. Date state is held internally and advanced automatically;
the caller cannot forget the reassignment or corrupt state with a
skipped page.
"""
def __init__(self, post_processor: RowPostProcessor) -> None:
"""
Args:
post_processor: Fully configured RowPostProcessor instance.
"""
...
def process_page(self, page_rows: list[dict]) -> list[dict]:
"""Process all rows on a page, threading date state automatically.
Args:
page_rows: Rows from a single page (may be empty).
Returns:
Rows with metadata tagged (Filename, document_type, template_id,
Date filled where missing).
"""
...
def current_date(self) -> str:
"""Most recently seen date. For logging and test assertions."""
...
def last_date_source(self) -> str:
"""Source of the last date fill.
Returns one of: 'row_data' | 'propagated' | 'filename' | 'initial'
Allows tests to assert not just that a date was filled, but why.
"""
...
def reset(self, to: str = "") -> None:
"""Reset internal date state. Call between PDFs in a batch.
Args:
to: Date to seed with (default: empty string for a clean start).
"""
...
Updated extraction loop (pdf_extractor.py)
# Before (6 lines of manual threading)
for row in page_rows:
current_date = post_processor.process(row, current_date)
if row:
rows.append(row)
# After (1 line, no reassignment, no state variable in caller)
rows.extend(page_processor.process_page(page_rows))
Full loop after refactor:
filename_date = extract_filename_date(pdf_path.name)
post_processor = RowPostProcessor(
columns=self.columns,
row_classifier=self._row_classifier,
template=self.template,
filename_date=filename_date,
filename=pdf_path.name,
)
page_processor = StatefulPageRowProcessor(post_processor)
with self._pdf_reader.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, 1):
# ... credit card check, IBAN extraction unchanged ...
page_rows = self._extract_page(page, page_num)
if page_rows is None:
continue # wrapper not called → state not corrupted
rows.extend(page_processor.process_page(page_rows))
The current_date local variable disappears entirely from pdf_extractor.py.
What this enables in tests
Skipped-page bleed is now assertable
def test_date_preserved_across_skipped_pages():
"""Date from page 1 propagates to page 3 when page 2 is skipped."""
processor = StatefulPageRowProcessor(_make_post_processor())
page1_rows = [{"Date": "05/01/2024", "Details": "Tx1", "Debit €": "", "Credit €": "", "Balance €": ""}]
processor.process_page(page1_rows)
assert processor.current_date() == "05/01/2024"
# Page 2 skipped — do not call process_page()
page3_rows = [{"Date": "", "Details": "Tx2", "Debit €": "", "Credit €": "", "Balance €": ""}]
result = processor.process_page(page3_rows)
assert result[0]["Date"] == "05/01/2024"
assert processor.last_date_source() == "propagated"
def test_no_date_bleed_after_reset():
"""reset() clears internal state between PDFs in a batch."""
processor = StatefulPageRowProcessor(_make_post_processor(filename_date="01 Jan 2025"))
processor.process_page([{"Date": "15/03/2025", "Details": "Tx", "Debit €": "", "Credit €": "", "Balance €": ""}])
assert processor.current_date() == "15/03/2025"
processor.reset()
assert processor.current_date() == ""
processor.process_page([{"Date": "", "Details": "Tx2", "Debit €": "", "Credit €": "", "Balance €": ""}])
assert processor.last_date_source() == "filename"
def test_date_source_row_data():
"""Row with its own date records source as row_data."""
processor = StatefulPageRowProcessor(_make_post_processor())
processor.process_page([{"Date": "10/02/2024", "Details": "Tx", "Debit €": "", "Credit €": "", "Balance €": ""}])
assert processor.last_date_source() == "row_data"
def test_date_source_filename_fallback():
"""Dateless row with no prior date falls back to filename_date."""
processor = StatefulPageRowProcessor(_make_post_processor(filename_date="01 Feb 2024"))
processor.process_page([{"Date": "", "Details": "Tx", "Debit €": "", "Credit €": "", "Balance €": ""}])
assert processor.last_date_source() == "filename"
Misuse is now structurally prevented
| Old misuse |
New state |
post_processor.process(row, current_date) — return discarded |
No return value to discard; process_page() returns the rows |
current_date local var not reassigned |
No local var; state is internal |
Skipped page corrupts current_date |
Wrapper not called on skipped pages; state unchanged |
What stays unchanged
| Component |
Change |
RowPostProcessor constructor and process() signature |
Minor: add last_source side-effect attribute to expose date source to wrapper |
extract_filename_date() |
None |
test_row_post_processor.py existing tests |
None — all continue to pass |
RowBuilder, PDFTableExtractor.__init__ |
None |
What is explicitly out of scope
- Per-row observability audit trail (
ProcessingMetadata, DateSource enum, observer callback) — real value but a separate concern that would change RowPostProcessor's constructor and return type
- Changes to boundary detection or the classifier chain
Implementation steps
-
Add last_source tracking to RowPostProcessor.process() — set self._last_source as a side-effect attribute ("row_data", "propagated", "filename") each time a transaction row is processed. Return type stays str. This is the only change to RowPostProcessor.
-
Add StatefulPageRowProcessor to row_post_processor.py
__init__: stores post_processor, initialises _current_date = "", _last_date_source = "initial"
process_page(page_rows): inner loop calling post_processor.process(row, self._current_date), reassigning self._current_date, reading post_processor._last_source into self._last_date_source, collecting non-empty rows
current_date(): getter
last_date_source(): getter
reset(to=""): sets _current_date = to, _last_date_source = "initial"
-
Update PDFTableExtractor.extract() — remove current_date local var and inner loop; add StatefulPageRowProcessor; replace inner loop with rows.extend(page_processor.process_page(page_rows))
-
Add 4 tests to test_row_post_processor.py — the four tests shown above
Coverage impact
Current: 92% (1288 tests). Four new tests, no files deleted, RowPostProcessor minimally changed. Net: +4 tests covering the previously-untestable cross-page date state.
Files affected
| File |
Change |
src/bankstatements_core/extraction/row_post_processor.py |
Add StatefulPageRowProcessor; add _last_source side-effect to RowPostProcessor.process() |
src/bankstatements_core/extraction/pdf_extractor.py |
Replace inner loop + current_date var with StatefulPageRowProcessor |
tests/extraction/test_row_post_processor.py |
Add 4 new tests |
Summary
RowPostProcessor.process(row, current_date) -> strboth mutatesrow["Date"]in-place and returns an updatedcurrent_datefor the caller to thread through the extraction loop. The caller inpdf_extractor.pymust reassigncurrent_dateon every row — if that assignment is omitted, date propagation silently regresses. When a page is skipped (_extract_pagereturnsNone), the stalecurrent_datefrom the previous page bleeds into the next valid page with no signal and no test coverage.This RFC proposes a
StatefulPageRowProcessorwrapper that encapsulates the threading entirely and makes both failure modes impossible or assertable.Problem
Current extraction loop (
pdf_extractor.py:83–128)Failure mode 1 — forgotten reassignment
Nothing in the type system or linter catches this. Every dateless row silently gets the first date ever seen.
Failure mode 2 — skipped-page date bleed
If pages 3–5 are skipped (invalid table structure, no headers) and page 6 has dateless rows, those rows inherit the date from page 2 with no signal. This is sometimes correct (dates span pages) but untestable — no unit test verifies cross-page date state under skipped pages.
RowPostProcessor.process()— dual responsibilityMutation and return value must both be used correctly. This is a hidden contract.
Current test gap
test_row_post_processor.pytestsRowPostProcessorin isolation. There is no test for:Proposed Interface
StatefulPageRowProcessorA thin wrapper around the existing
RowPostProcessor.RowPostProcessoritself is unchanged.Updated extraction loop (
pdf_extractor.py)Full loop after refactor:
The
current_datelocal variable disappears entirely frompdf_extractor.py.What this enables in tests
Skipped-page bleed is now assertable
Misuse is now structurally prevented
post_processor.process(row, current_date)— return discardedprocess_page()returns the rowscurrent_datelocal var not reassignedcurrent_dateWhat stays unchanged
RowPostProcessorconstructor andprocess()signaturelast_sourceside-effect attribute to expose date source to wrapperextract_filename_date()test_row_post_processor.pyexisting testsRowBuilder,PDFTableExtractor.__init__What is explicitly out of scope
ProcessingMetadata,DateSourceenum, observer callback) — real value but a separate concern that would changeRowPostProcessor's constructor and return typeImplementation steps
Add
last_sourcetracking toRowPostProcessor.process()— setself._last_sourceas a side-effect attribute ("row_data","propagated","filename") each time a transaction row is processed. Return type staysstr. This is the only change toRowPostProcessor.Add
StatefulPageRowProcessortorow_post_processor.py__init__: storespost_processor, initialises_current_date = "",_last_date_source = "initial"process_page(page_rows): inner loop callingpost_processor.process(row, self._current_date), reassigningself._current_date, readingpost_processor._last_sourceintoself._last_date_source, collecting non-empty rowscurrent_date(): getterlast_date_source(): getterreset(to=""): sets_current_date = to,_last_date_source = "initial"Update
PDFTableExtractor.extract()— removecurrent_datelocal var and inner loop; addStatefulPageRowProcessor; replace inner loop withrows.extend(page_processor.process_page(page_rows))Add 4 tests to
test_row_post_processor.py— the four tests shown aboveCoverage impact
Current: 92% (1288 tests). Four new tests, no files deleted,
RowPostProcessorminimally changed. Net: +4 tests covering the previously-untestable cross-page date state.Files affected
src/bankstatements_core/extraction/row_post_processor.pyStatefulPageRowProcessor; add_last_sourceside-effect toRowPostProcessor.process()src/bankstatements_core/extraction/pdf_extractor.pycurrent_datevar withStatefulPageRowProcessortests/extraction/test_row_post_processor.py