RFC: Establish Transaction as pipeline currency with single extraction boundary

## Problem

The codebase has three representations of the same domain entity (a bank transaction) in active use:

- **Raw `dict`** — produced by `PDFTableExtractor._extract_rows_from_words()`
- **`Transaction` dataclass** — the domain model in `src/domain/models/transaction.py`
- **`TransactionRow` TypedDict** — defined in `src/models.py`, referenced nowhere at runtime (dead code)

The result is a `dict → Transaction → dict` churn pattern that happens redundantly on every service call. Each of the following services independently converts in and then converts back out:

- `TransactionFilterService.filter_empty_rows / filter_header_rows / filter_invalid_dates`
- `ChronologicalSortingStrategy.sort`
- `MonthlySummaryService.generate`
- `ExpenseAnalysisService.analyze`

This means for a single PDF run, the same 300 transactions may be converted to `Transaction` objects and back to `dict` 4–5 times. More importantly, the `dict` representation is untyped — column name inconsistencies (`"Debit €"` vs `"Debit_EUR"` vs `"Debit EUR"`) are already causing silent bugs (e.g. `CreditCardDuplicateStrategy.create_key` reads `"Debit_EUR"` which never matches the live key `"Debit €"`).

Additionally, `document_type` and `transaction_type` — first-class domain concepts — are stamped onto raw dicts in `_process_row` and have no typed home in `Transaction`. They currently leak into `additional_fields` or get silently dropped depending on the code path.

## Proposed Solution

**`Transaction` becomes the pipeline currency.** Raw `dict` objects exist only inside the extraction layer and are never returned to callers. All internal service signatures accept and return `list[Transaction]`. Output writers are the single serialisation site.

### Changes

#### 1. Promote `document_type` and `transaction_type` to first-class `Transaction` fields

```python
# src/domain/models/transaction.py
@dataclass
class Transaction:
    date:              str
    details:           str
    debit:             str | None
    credit:            str | None
    balance:           str | None
    filename:          str
    document_type:     str = "bank_statement"   # was leaking via additional_fields
    transaction_type:  str = "other"            # was ghost in TransactionRow TypedDict
    additional_fields: dict[str, str] = field(default_factory=dict)
```

`from_dict` maps `"document_type"` and `"transaction_type"` explicitly, excluding them from `additional_fields`. `to_dict()` deliberately omits them — they are pipeline metadata, not output columns.

#### 2. Normalise `to_dict()` output column names

```python
def to_dict(self) -> dict[str, str | None]:
    result: dict[str, str | None] = {
        "Date":        self.date,
        "Details":     self.details,
        "Debit EUR":   self.debit,      # was "Debit €"
        "Credit EUR":  self.credit,     # was "Credit €"
        "Balance EUR": self.balance,    # was "Balance €"
        "Filename":    self.filename,
    }
    result.update(self.additional_fields)   # custom template columns verbatim
    return result
```

`from_dict` key alias lists gain `"Debit EUR"` / `"Credit EUR"` / `"Balance EUR"` alongside the existing entries to maintain round-trip integrity.

#### 3. Establish `ExtractionResult` as the extraction boundary type

```python
# src/extraction/extraction_result.py
@dataclass(frozen=True)
class ExtractionResult:
    transactions: list[Transaction]
    page_count:   int
    iban:         str | None
    source_path:  Path
```

`PDFTableExtractor.extract()` returns `ExtractionResult` instead of `tuple[list[dict], int, str | None]`. The conversion from raw dicts to `Transaction` objects happens inside `extract()` — one list comprehension, the single conversion site in the entire codebase.

```python
# Inside PDFTableExtractor.extract() — the single boundary
transactions = [Transaction.from_dict(row) for row in raw_rows]
return ExtractionResult(transactions=transactions, page_count=..., iban=..., source_path=pdf_path)
```

`ExtractionOrchestrator.extract_from_pdf()` return type updates to match.

#### 4. Update all internal service signatures

```python
# transaction_filter.py
def apply_all_filters(self, transactions: list[Transaction]) -> list[Transaction]: ...

# sorting_service.py
def sort(self, transactions: list[Transaction]) -> list[Transaction]: ...

# duplicate_detector.py
def detect_and_separate(self, transactions: list[Transaction]) -> tuple[list[Transaction], list[Transaction]]: ...

# transaction_processing_orchestrator.py
def process_transaction_group(self, transactions: list[Transaction], ...) -> tuple[list[Transaction], list[Transaction]]: ...

# pdf_processing_orchestrator.py
def process_all_pdfs(self, ...) -> tuple[list[Transaction], int, dict[str, str]]: ...
```

Protocol interfaces in `src/domain/protocols/services.py` update to match.

#### 5. `OutputOrchestrator` becomes the single serialisation site

```python
def write_output_files(
    self,
    unique_transactions: list[Transaction],
    duplicate_transactions: list[Transaction],
    iban_suffix: str | None = None,
) -> dict[str, str]:
    unique_rows    = [tx.to_dict() for tx in unique_transactions]     # ← only call to to_dict()
    duplicate_rows = [tx.to_dict() for tx in duplicate_transactions]
    df_unique      = pd.DataFrame(unique_rows, columns=self.column_names)
    ...
```

`OutputFormatStrategy` implementations remain `list[dict]` consumers — they do not change.

#### 6. Delete `TransactionRow` TypedDict and `domain/converters.py`

`src/models.py` (`TransactionRow`) is deleted. `src/domain/converters.py` (`dicts_to_transactions`, `transactions_to_dicts`, etc.) becomes dead code once service signatures are migrated and is deleted. `src/domain/__init__.py` updated to export only `Transaction`.

#### 7. Fix `_is_header_row` placement

`TransactionFilterService._is_header_row` currently checks whether a cell *value* matches a column *name* — a check that requires a flat dict. Move this check inside `PDFTableExtractor._extract_rows_from_words()` (pre-boundary), where it logically belongs. Post-boundary filter methods work on typed `Transaction` fields only.

### Orchestrator experience after the change

```python
# PDFProcessingOrchestrator — no conversion ceremony anywhere
result = self.extraction_orchestrator.extract_from_pdf(pdf)
filtered = self.filter_service.apply_all_filters(result.transactions)
all_transactions.extend(filtered)

# TransactionProcessingOrchestrator
unique, dupes = self.duplicate_detector.detect_and_separate(classified)
sorted_txns   = self.sorting_service.sort(unique)

# Custom template columns survive verbatim — no special handling required
# tx.additional_fields == {"Reference": "VT-449201", "Merchant Category": "Groceries"}
# tx.to_dict() → {"Date": ..., "Debit EUR": ..., "Reference": "VT-449201", ...}
```

### What stays the same

- `OutputFormatStrategy` implementations (CSV, JSON, XLSX) — unchanged
- `MonthlySummaryService` and `ExpenseAnalysisService` — receive `list[dict]` from `OutputOrchestrator` post-serialisation, unchanged
- `additional_fields` round-trip — custom template columns survive and appear verbatim in CSV/XLSX output
- `IPDFReader` injection on `PDFTableExtractor` — unchanged
- Entitlements enforcement surface — unchanged
- The `bankstatements-premium` consumer — unaffected (never touches transaction types)

## Migration Sequence

This is a hard cut with no gradual path. Recommended order to keep tests green at each step:

1. Add `document_type` and `transaction_type` fields to `Transaction`; update `from_dict` and `to_dict`
2. Add `ExtractionResult`; update `PDFTableExtractor.extract()` return type; update `ExtractionOrchestrator`
3. Update `PDFProcessingOrchestrator` to consume `ExtractionResult`
4. Move `_is_header_row` pre-boundary into `PDFTableExtractor`
5. Update `TransactionFilterService` signatures; delete internal conversion calls
6. Update `SortingStrategy` / `TransactionSortingService` signatures
7. Update `DuplicateDetectionStrategy` / `DuplicateDetectionService` signatures; fix `CreditCardDuplicateStrategy.create_key` (currently reads `"Debit_EUR"` — silent bug)
8. Update `TransactionProcessingOrchestrator` signatures
9. Update `OutputOrchestrator.write_output_files` to accept `list[Transaction]`; move `to_dict()` calls here
10. Update protocol interfaces in `src/domain/protocols/services.py`
11. Update `processor.py` top-level orchestration
12. Delete `src/models.py` (`TransactionRow`)
13. Delete `src/domain/converters.py`
14. Update `src/domain/__init__.py`
15. Update column name references throughout (`"Debit €"` → `"Debit EUR"` in templates, tests, strategies)

## Test Impact

- Tests that construct `list[dict]` to call service methods must be updated to construct `Transaction` objects. This is more verbose but catches column-name bugs at construction time rather than silently at runtime.
- The `ExcelOutputStrategy._is_numeric_column` check for `"€"` in column names can be removed (no longer matches after normalisation; `"debit"` / `"credit"` / `"balance"` substring checks still apply).
- Coverage gaps in `exclusion_detector.py` (59%) and `expense_analysis.py` (~70%) are unaffected by this change and remain as separate work.

## Out of Scope

- Per-format column remapping (`ColumnMap`) — not needed given the consumer is a single thin shell
- Bank-specific pluggable converters — not needed for the current consumer surface
- Changes to `MonthlySummaryService` or `ExpenseAnalysisService` internal logic


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Establish Transaction as pipeline currency with single extraction boundary #16

Problem

Proposed Solution

Changes

1. Promote `document_type` and `transaction_type` to first-class `Transaction` fields

2. Normalise `to_dict()` output column names

3. Establish `ExtractionResult` as the extraction boundary type

4. Update all internal service signatures

5. `OutputOrchestrator` becomes the single serialisation site

6. Delete `TransactionRow` TypedDict and `domain/converters.py`

7. Fix `_is_header_row` placement

Orchestrator experience after the change

What stays the same

Migration Sequence

Test Impact

Out of Scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RFC: Establish Transaction as pipeline currency with single extraction boundary #16

Description

Problem

Proposed Solution

Changes

1. Promote document_type and transaction_type to first-class Transaction fields

2. Normalise to_dict() output column names

3. Establish ExtractionResult as the extraction boundary type

4. Update all internal service signatures

5. OutputOrchestrator becomes the single serialisation site

6. Delete TransactionRow TypedDict and domain/converters.py

7. Fix _is_header_row placement

Orchestrator experience after the change

What stays the same

Migration Sequence

Test Impact

Out of Scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Promote `document_type` and `transaction_type` to first-class `Transaction` fields

2. Normalise `to_dict()` output column names

3. Establish `ExtractionResult` as the extraction boundary type

5. `OutputOrchestrator` becomes the single serialisation site

6. Delete `TransactionRow` TypedDict and `domain/converters.py`

7. Fix `_is_header_row` placement