Problem
The codebase has three representations of the same domain entity (a bank transaction) in active use:
- Raw
dict — produced by PDFTableExtractor._extract_rows_from_words()
Transaction dataclass — the domain model in src/domain/models/transaction.py
TransactionRow TypedDict — defined in src/models.py, referenced nowhere at runtime (dead code)
The result is a dict → Transaction → dict churn pattern that happens redundantly on every service call. Each of the following services independently converts in and then converts back out:
TransactionFilterService.filter_empty_rows / filter_header_rows / filter_invalid_dates
ChronologicalSortingStrategy.sort
MonthlySummaryService.generate
ExpenseAnalysisService.analyze
This means for a single PDF run, the same 300 transactions may be converted to Transaction objects and back to dict 4–5 times. More importantly, the dict representation is untyped — column name inconsistencies ("Debit €" vs "Debit_EUR" vs "Debit EUR") are already causing silent bugs (e.g. CreditCardDuplicateStrategy.create_key reads "Debit_EUR" which never matches the live key "Debit €").
Additionally, document_type and transaction_type — first-class domain concepts — are stamped onto raw dicts in _process_row and have no typed home in Transaction. They currently leak into additional_fields or get silently dropped depending on the code path.
Proposed Solution
Transaction becomes the pipeline currency. Raw dict objects exist only inside the extraction layer and are never returned to callers. All internal service signatures accept and return list[Transaction]. Output writers are the single serialisation site.
Changes
1. Promote document_type and transaction_type to first-class Transaction fields
# src/domain/models/transaction.py
@dataclass
class Transaction:
date: str
details: str
debit: str | None
credit: str | None
balance: str | None
filename: str
document_type: str = "bank_statement" # was leaking via additional_fields
transaction_type: str = "other" # was ghost in TransactionRow TypedDict
additional_fields: dict[str, str] = field(default_factory=dict)
from_dict maps "document_type" and "transaction_type" explicitly, excluding them from additional_fields. to_dict() deliberately omits them — they are pipeline metadata, not output columns.
2. Normalise to_dict() output column names
def to_dict(self) -> dict[str, str | None]:
result: dict[str, str | None] = {
"Date": self.date,
"Details": self.details,
"Debit EUR": self.debit, # was "Debit €"
"Credit EUR": self.credit, # was "Credit €"
"Balance EUR": self.balance, # was "Balance €"
"Filename": self.filename,
}
result.update(self.additional_fields) # custom template columns verbatim
return result
from_dict key alias lists gain "Debit EUR" / "Credit EUR" / "Balance EUR" alongside the existing entries to maintain round-trip integrity.
3. Establish ExtractionResult as the extraction boundary type
# src/extraction/extraction_result.py
@dataclass(frozen=True)
class ExtractionResult:
transactions: list[Transaction]
page_count: int
iban: str | None
source_path: Path
PDFTableExtractor.extract() returns ExtractionResult instead of tuple[list[dict], int, str | None]. The conversion from raw dicts to Transaction objects happens inside extract() — one list comprehension, the single conversion site in the entire codebase.
# Inside PDFTableExtractor.extract() — the single boundary
transactions = [Transaction.from_dict(row) for row in raw_rows]
return ExtractionResult(transactions=transactions, page_count=..., iban=..., source_path=pdf_path)
ExtractionOrchestrator.extract_from_pdf() return type updates to match.
4. Update all internal service signatures
# transaction_filter.py
def apply_all_filters(self, transactions: list[Transaction]) -> list[Transaction]: ...
# sorting_service.py
def sort(self, transactions: list[Transaction]) -> list[Transaction]: ...
# duplicate_detector.py
def detect_and_separate(self, transactions: list[Transaction]) -> tuple[list[Transaction], list[Transaction]]: ...
# transaction_processing_orchestrator.py
def process_transaction_group(self, transactions: list[Transaction], ...) -> tuple[list[Transaction], list[Transaction]]: ...
# pdf_processing_orchestrator.py
def process_all_pdfs(self, ...) -> tuple[list[Transaction], int, dict[str, str]]: ...
Protocol interfaces in src/domain/protocols/services.py update to match.
5. OutputOrchestrator becomes the single serialisation site
def write_output_files(
self,
unique_transactions: list[Transaction],
duplicate_transactions: list[Transaction],
iban_suffix: str | None = None,
) -> dict[str, str]:
unique_rows = [tx.to_dict() for tx in unique_transactions] # ← only call to to_dict()
duplicate_rows = [tx.to_dict() for tx in duplicate_transactions]
df_unique = pd.DataFrame(unique_rows, columns=self.column_names)
...
OutputFormatStrategy implementations remain list[dict] consumers — they do not change.
6. Delete TransactionRow TypedDict and domain/converters.py
src/models.py (TransactionRow) is deleted. src/domain/converters.py (dicts_to_transactions, transactions_to_dicts, etc.) becomes dead code once service signatures are migrated and is deleted. src/domain/__init__.py updated to export only Transaction.
7. Fix _is_header_row placement
TransactionFilterService._is_header_row currently checks whether a cell value matches a column name — a check that requires a flat dict. Move this check inside PDFTableExtractor._extract_rows_from_words() (pre-boundary), where it logically belongs. Post-boundary filter methods work on typed Transaction fields only.
Orchestrator experience after the change
# PDFProcessingOrchestrator — no conversion ceremony anywhere
result = self.extraction_orchestrator.extract_from_pdf(pdf)
filtered = self.filter_service.apply_all_filters(result.transactions)
all_transactions.extend(filtered)
# TransactionProcessingOrchestrator
unique, dupes = self.duplicate_detector.detect_and_separate(classified)
sorted_txns = self.sorting_service.sort(unique)
# Custom template columns survive verbatim — no special handling required
# tx.additional_fields == {"Reference": "VT-449201", "Merchant Category": "Groceries"}
# tx.to_dict() → {"Date": ..., "Debit EUR": ..., "Reference": "VT-449201", ...}
What stays the same
OutputFormatStrategy implementations (CSV, JSON, XLSX) — unchanged
MonthlySummaryService and ExpenseAnalysisService — receive list[dict] from OutputOrchestrator post-serialisation, unchanged
additional_fields round-trip — custom template columns survive and appear verbatim in CSV/XLSX output
IPDFReader injection on PDFTableExtractor — unchanged
- Entitlements enforcement surface — unchanged
- The
bankstatements-premium consumer — unaffected (never touches transaction types)
Migration Sequence
This is a hard cut with no gradual path. Recommended order to keep tests green at each step:
- Add
document_type and transaction_type fields to Transaction; update from_dict and to_dict
- Add
ExtractionResult; update PDFTableExtractor.extract() return type; update ExtractionOrchestrator
- Update
PDFProcessingOrchestrator to consume ExtractionResult
- Move
_is_header_row pre-boundary into PDFTableExtractor
- Update
TransactionFilterService signatures; delete internal conversion calls
- Update
SortingStrategy / TransactionSortingService signatures
- Update
DuplicateDetectionStrategy / DuplicateDetectionService signatures; fix CreditCardDuplicateStrategy.create_key (currently reads "Debit_EUR" — silent bug)
- Update
TransactionProcessingOrchestrator signatures
- Update
OutputOrchestrator.write_output_files to accept list[Transaction]; move to_dict() calls here
- Update protocol interfaces in
src/domain/protocols/services.py
- Update
processor.py top-level orchestration
- Delete
src/models.py (TransactionRow)
- Delete
src/domain/converters.py
- Update
src/domain/__init__.py
- Update column name references throughout (
"Debit €" → "Debit EUR" in templates, tests, strategies)
Test Impact
- Tests that construct
list[dict] to call service methods must be updated to construct Transaction objects. This is more verbose but catches column-name bugs at construction time rather than silently at runtime.
- The
ExcelOutputStrategy._is_numeric_column check for "€" in column names can be removed (no longer matches after normalisation; "debit" / "credit" / "balance" substring checks still apply).
- Coverage gaps in
exclusion_detector.py (59%) and expense_analysis.py (~70%) are unaffected by this change and remain as separate work.
Out of Scope
- Per-format column remapping (
ColumnMap) — not needed given the consumer is a single thin shell
- Bank-specific pluggable converters — not needed for the current consumer surface
- Changes to
MonthlySummaryService or ExpenseAnalysisService internal logic
Problem
The codebase has three representations of the same domain entity (a bank transaction) in active use:
dict— produced byPDFTableExtractor._extract_rows_from_words()Transactiondataclass — the domain model insrc/domain/models/transaction.pyTransactionRowTypedDict — defined insrc/models.py, referenced nowhere at runtime (dead code)The result is a
dict → Transaction → dictchurn pattern that happens redundantly on every service call. Each of the following services independently converts in and then converts back out:TransactionFilterService.filter_empty_rows / filter_header_rows / filter_invalid_datesChronologicalSortingStrategy.sortMonthlySummaryService.generateExpenseAnalysisService.analyzeThis means for a single PDF run, the same 300 transactions may be converted to
Transactionobjects and back todict4–5 times. More importantly, thedictrepresentation is untyped — column name inconsistencies ("Debit €"vs"Debit_EUR"vs"Debit EUR") are already causing silent bugs (e.g.CreditCardDuplicateStrategy.create_keyreads"Debit_EUR"which never matches the live key"Debit €").Additionally,
document_typeandtransaction_type— first-class domain concepts — are stamped onto raw dicts in_process_rowand have no typed home inTransaction. They currently leak intoadditional_fieldsor get silently dropped depending on the code path.Proposed Solution
Transactionbecomes the pipeline currency. Rawdictobjects exist only inside the extraction layer and are never returned to callers. All internal service signatures accept and returnlist[Transaction]. Output writers are the single serialisation site.Changes
1. Promote
document_typeandtransaction_typeto first-classTransactionfieldsfrom_dictmaps"document_type"and"transaction_type"explicitly, excluding them fromadditional_fields.to_dict()deliberately omits them — they are pipeline metadata, not output columns.2. Normalise
to_dict()output column namesfrom_dictkey alias lists gain"Debit EUR"/"Credit EUR"/"Balance EUR"alongside the existing entries to maintain round-trip integrity.3. Establish
ExtractionResultas the extraction boundary typePDFTableExtractor.extract()returnsExtractionResultinstead oftuple[list[dict], int, str | None]. The conversion from raw dicts toTransactionobjects happens insideextract()— one list comprehension, the single conversion site in the entire codebase.ExtractionOrchestrator.extract_from_pdf()return type updates to match.4. Update all internal service signatures
Protocol interfaces in
src/domain/protocols/services.pyupdate to match.5.
OutputOrchestratorbecomes the single serialisation siteOutputFormatStrategyimplementations remainlist[dict]consumers — they do not change.6. Delete
TransactionRowTypedDict anddomain/converters.pysrc/models.py(TransactionRow) is deleted.src/domain/converters.py(dicts_to_transactions,transactions_to_dicts, etc.) becomes dead code once service signatures are migrated and is deleted.src/domain/__init__.pyupdated to export onlyTransaction.7. Fix
_is_header_rowplacementTransactionFilterService._is_header_rowcurrently checks whether a cell value matches a column name — a check that requires a flat dict. Move this check insidePDFTableExtractor._extract_rows_from_words()(pre-boundary), where it logically belongs. Post-boundary filter methods work on typedTransactionfields only.Orchestrator experience after the change
What stays the same
OutputFormatStrategyimplementations (CSV, JSON, XLSX) — unchangedMonthlySummaryServiceandExpenseAnalysisService— receivelist[dict]fromOutputOrchestratorpost-serialisation, unchangedadditional_fieldsround-trip — custom template columns survive and appear verbatim in CSV/XLSX outputIPDFReaderinjection onPDFTableExtractor— unchangedbankstatements-premiumconsumer — unaffected (never touches transaction types)Migration Sequence
This is a hard cut with no gradual path. Recommended order to keep tests green at each step:
document_typeandtransaction_typefields toTransaction; updatefrom_dictandto_dictExtractionResult; updatePDFTableExtractor.extract()return type; updateExtractionOrchestratorPDFProcessingOrchestratorto consumeExtractionResult_is_header_rowpre-boundary intoPDFTableExtractorTransactionFilterServicesignatures; delete internal conversion callsSortingStrategy/TransactionSortingServicesignaturesDuplicateDetectionStrategy/DuplicateDetectionServicesignatures; fixCreditCardDuplicateStrategy.create_key(currently reads"Debit_EUR"— silent bug)TransactionProcessingOrchestratorsignaturesOutputOrchestrator.write_output_filesto acceptlist[Transaction]; moveto_dict()calls heresrc/domain/protocols/services.pyprocessor.pytop-level orchestrationsrc/models.py(TransactionRow)src/domain/converters.pysrc/domain/__init__.py"Debit €"→"Debit EUR"in templates, tests, strategies)Test Impact
list[dict]to call service methods must be updated to constructTransactionobjects. This is more verbose but catches column-name bugs at construction time rather than silently at runtime.ExcelOutputStrategy._is_numeric_columncheck for"€"in column names can be removed (no longer matches after normalisation;"debit"/"credit"/"balance"substring checks still apply).exclusion_detector.py(59%) andexpense_analysis.py(~70%) are unaffected by this change and remain as separate work.Out of Scope
ColumnMap) — not needed given the consumer is a single thin shellMonthlySummaryServiceorExpenseAnalysisServiceinternal logic