Skip to content

RFC: Establish Transaction as pipeline currency with single extraction boundary #16

@longieirl

Description

@longieirl

Problem

The codebase has three representations of the same domain entity (a bank transaction) in active use:

  • Raw dict — produced by PDFTableExtractor._extract_rows_from_words()
  • Transaction dataclass — the domain model in src/domain/models/transaction.py
  • TransactionRow TypedDict — defined in src/models.py, referenced nowhere at runtime (dead code)

The result is a dict → Transaction → dict churn pattern that happens redundantly on every service call. Each of the following services independently converts in and then converts back out:

  • TransactionFilterService.filter_empty_rows / filter_header_rows / filter_invalid_dates
  • ChronologicalSortingStrategy.sort
  • MonthlySummaryService.generate
  • ExpenseAnalysisService.analyze

This means for a single PDF run, the same 300 transactions may be converted to Transaction objects and back to dict 4–5 times. More importantly, the dict representation is untyped — column name inconsistencies ("Debit €" vs "Debit_EUR" vs "Debit EUR") are already causing silent bugs (e.g. CreditCardDuplicateStrategy.create_key reads "Debit_EUR" which never matches the live key "Debit €").

Additionally, document_type and transaction_type — first-class domain concepts — are stamped onto raw dicts in _process_row and have no typed home in Transaction. They currently leak into additional_fields or get silently dropped depending on the code path.

Proposed Solution

Transaction becomes the pipeline currency. Raw dict objects exist only inside the extraction layer and are never returned to callers. All internal service signatures accept and return list[Transaction]. Output writers are the single serialisation site.

Changes

1. Promote document_type and transaction_type to first-class Transaction fields

# src/domain/models/transaction.py
@dataclass
class Transaction:
    date:              str
    details:           str
    debit:             str | None
    credit:            str | None
    balance:           str | None
    filename:          str
    document_type:     str = "bank_statement"   # was leaking via additional_fields
    transaction_type:  str = "other"            # was ghost in TransactionRow TypedDict
    additional_fields: dict[str, str] = field(default_factory=dict)

from_dict maps "document_type" and "transaction_type" explicitly, excluding them from additional_fields. to_dict() deliberately omits them — they are pipeline metadata, not output columns.

2. Normalise to_dict() output column names

def to_dict(self) -> dict[str, str | None]:
    result: dict[str, str | None] = {
        "Date":        self.date,
        "Details":     self.details,
        "Debit EUR":   self.debit,      # was "Debit €"
        "Credit EUR":  self.credit,     # was "Credit €"
        "Balance EUR": self.balance,    # was "Balance €"
        "Filename":    self.filename,
    }
    result.update(self.additional_fields)   # custom template columns verbatim
    return result

from_dict key alias lists gain "Debit EUR" / "Credit EUR" / "Balance EUR" alongside the existing entries to maintain round-trip integrity.

3. Establish ExtractionResult as the extraction boundary type

# src/extraction/extraction_result.py
@dataclass(frozen=True)
class ExtractionResult:
    transactions: list[Transaction]
    page_count:   int
    iban:         str | None
    source_path:  Path

PDFTableExtractor.extract() returns ExtractionResult instead of tuple[list[dict], int, str | None]. The conversion from raw dicts to Transaction objects happens inside extract() — one list comprehension, the single conversion site in the entire codebase.

# Inside PDFTableExtractor.extract() — the single boundary
transactions = [Transaction.from_dict(row) for row in raw_rows]
return ExtractionResult(transactions=transactions, page_count=..., iban=..., source_path=pdf_path)

ExtractionOrchestrator.extract_from_pdf() return type updates to match.

4. Update all internal service signatures

# transaction_filter.py
def apply_all_filters(self, transactions: list[Transaction]) -> list[Transaction]: ...

# sorting_service.py
def sort(self, transactions: list[Transaction]) -> list[Transaction]: ...

# duplicate_detector.py
def detect_and_separate(self, transactions: list[Transaction]) -> tuple[list[Transaction], list[Transaction]]: ...

# transaction_processing_orchestrator.py
def process_transaction_group(self, transactions: list[Transaction], ...) -> tuple[list[Transaction], list[Transaction]]: ...

# pdf_processing_orchestrator.py
def process_all_pdfs(self, ...) -> tuple[list[Transaction], int, dict[str, str]]: ...

Protocol interfaces in src/domain/protocols/services.py update to match.

5. OutputOrchestrator becomes the single serialisation site

def write_output_files(
    self,
    unique_transactions: list[Transaction],
    duplicate_transactions: list[Transaction],
    iban_suffix: str | None = None,
) -> dict[str, str]:
    unique_rows    = [tx.to_dict() for tx in unique_transactions]     # ← only call to to_dict()
    duplicate_rows = [tx.to_dict() for tx in duplicate_transactions]
    df_unique      = pd.DataFrame(unique_rows, columns=self.column_names)
    ...

OutputFormatStrategy implementations remain list[dict] consumers — they do not change.

6. Delete TransactionRow TypedDict and domain/converters.py

src/models.py (TransactionRow) is deleted. src/domain/converters.py (dicts_to_transactions, transactions_to_dicts, etc.) becomes dead code once service signatures are migrated and is deleted. src/domain/__init__.py updated to export only Transaction.

7. Fix _is_header_row placement

TransactionFilterService._is_header_row currently checks whether a cell value matches a column name — a check that requires a flat dict. Move this check inside PDFTableExtractor._extract_rows_from_words() (pre-boundary), where it logically belongs. Post-boundary filter methods work on typed Transaction fields only.

Orchestrator experience after the change

# PDFProcessingOrchestrator — no conversion ceremony anywhere
result = self.extraction_orchestrator.extract_from_pdf(pdf)
filtered = self.filter_service.apply_all_filters(result.transactions)
all_transactions.extend(filtered)

# TransactionProcessingOrchestrator
unique, dupes = self.duplicate_detector.detect_and_separate(classified)
sorted_txns   = self.sorting_service.sort(unique)

# Custom template columns survive verbatim — no special handling required
# tx.additional_fields == {"Reference": "VT-449201", "Merchant Category": "Groceries"}
# tx.to_dict() → {"Date": ..., "Debit EUR": ..., "Reference": "VT-449201", ...}

What stays the same

  • OutputFormatStrategy implementations (CSV, JSON, XLSX) — unchanged
  • MonthlySummaryService and ExpenseAnalysisService — receive list[dict] from OutputOrchestrator post-serialisation, unchanged
  • additional_fields round-trip — custom template columns survive and appear verbatim in CSV/XLSX output
  • IPDFReader injection on PDFTableExtractor — unchanged
  • Entitlements enforcement surface — unchanged
  • The bankstatements-premium consumer — unaffected (never touches transaction types)

Migration Sequence

This is a hard cut with no gradual path. Recommended order to keep tests green at each step:

  1. Add document_type and transaction_type fields to Transaction; update from_dict and to_dict
  2. Add ExtractionResult; update PDFTableExtractor.extract() return type; update ExtractionOrchestrator
  3. Update PDFProcessingOrchestrator to consume ExtractionResult
  4. Move _is_header_row pre-boundary into PDFTableExtractor
  5. Update TransactionFilterService signatures; delete internal conversion calls
  6. Update SortingStrategy / TransactionSortingService signatures
  7. Update DuplicateDetectionStrategy / DuplicateDetectionService signatures; fix CreditCardDuplicateStrategy.create_key (currently reads "Debit_EUR" — silent bug)
  8. Update TransactionProcessingOrchestrator signatures
  9. Update OutputOrchestrator.write_output_files to accept list[Transaction]; move to_dict() calls here
  10. Update protocol interfaces in src/domain/protocols/services.py
  11. Update processor.py top-level orchestration
  12. Delete src/models.py (TransactionRow)
  13. Delete src/domain/converters.py
  14. Update src/domain/__init__.py
  15. Update column name references throughout ("Debit €""Debit EUR" in templates, tests, strategies)

Test Impact

  • Tests that construct list[dict] to call service methods must be updated to construct Transaction objects. This is more verbose but catches column-name bugs at construction time rather than silently at runtime.
  • The ExcelOutputStrategy._is_numeric_column check for "€" in column names can be removed (no longer matches after normalisation; "debit" / "credit" / "balance" substring checks still apply).
  • Coverage gaps in exclusion_detector.py (59%) and expense_analysis.py (~70%) are unaffected by this change and remain as separate work.

Out of Scope

  • Per-format column remapping (ColumnMap) — not needed given the consumer is a single thin shell
  • Bank-specific pluggable converters — not needed for the current consumer surface
  • Changes to MonthlySummaryService or ExpenseAnalysisService internal logic

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions