Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,48 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

---

## [0.1.2] — 2026-03-25

### Fixed
- **#47** — `filter_service.apply_all_filters()` result was computed and logged but silently discarded. Filtered rows are now written back to `result.transactions` in `PDFProcessingOrchestrator.process_all_pdfs()`, so `filter_empty_rows`, `filter_header_rows`, and `filter_invalid_dates` are applied to every successfully extracted PDF.
- **#52** — `BankStatementProcessorBuilder.with_duplicate_strategy()` and `.with_date_sorting()` were inert: `build()` called `ServiceRegistry.from_config()` with no services, causing the registry to create its own defaults and silently ignore the configured strategy. The builder now constructs `DuplicateDetectionService` and `TransactionSortingService` from its configured values and passes them explicitly into `ServiceRegistry.from_config()`.
- **#55** — Credit card / no-IBAN PDFs excluded from the `pdfs_extracted` count in processing output. `process_all_pdfs()` now returns a 3-tuple `(results, pdf_count, pages_read)`.

### Changed (architecture cleanup — PRs #56, #57)
- **#49** — `ChronologicalSortingStrategy` sorts dicts directly via `DateParserService`, removing a redundant `Transaction` round-trip.
- **#48** — Deferred circular imports in `processor.py` removed; `service_registry`, `monthly_summary`, and `expense_analysis` import `ColumnAnalysisService`/`DateParserService` directly at module level.
- **#50** — `TransactionClassifier._looks_like_date` delegates to `RowAnalysisService.looks_like_date`, removing a duplicate regex and fixing a subtle 1-or-2-digit day matching bug.
- **#51** — `ProcessorFactory.create_from_config()` builds `ProcessorConfig` in one block via `BankStatementProcessorBuilder.with_processor_config()`; new config knobs now touch ≤2 files.

---

## [0.1.1] — 2026-03-25

### Added (v1.1 — Transaction Pipeline & Word Utils)
- **Transaction enrichment** (`source_page: int | None`, `confidence_score: float`, `extraction_warnings: list[str]`) — all three fields default correctly and survive `to_dict` / `from_dict` round-trips (#16 / Phase 21).
- **`ExtractionResult` dataclass** (`domain/models/extraction_result.py`) — typed extraction boundary with `transactions`, `page_count`, `iban`, `source_file`, and `warnings` fields. Architecture guard test enforces placement in `domain/models/` (#16 / Phase 22).
- **End-to-end `ExtractionResult` pipeline** — `PDFTableExtractor.extract()`, `ExtractionOrchestrator`, `PDFProcessingOrchestrator`, and `processor` all produce and consume `ExtractionResult`; zero tuple-index unpacking remains (#16 / Phase 23).
- **`extraction/word_utils.py`** — canonical module for `group_words_by_y`, `assign_words_to_columns` (with `strict_rightmost` flag), and `calculate_column_coverage`. Five callers migrated; four private duplicate methods deleted (#21 / Phase 24).

### Changed
- **ServiceRegistry** introduced (`feat/28`, PR #44) — `ServiceRegistry.from_config(ProcessorConfig, Entitlements)` wires all transaction-processing services. `TransactionProcessingOrchestrator` deleted (PR #46 / issue #45).
- **ClassifierRegistry** with explicit integer priorities added to `row_classifiers.py` (fix/29, PR #39).
- **`recursive_scan` default** changed `False → True` in `ProcessingConfig`, `AppConfig`, `ProcessorBuilder`, and `PDFDiscoveryService`; `RECURSIVE_SCAN` env var added to `docker-compose.yml` (fix/40, PR #41).
- **`ScoringConfig` injectable** via `BankStatementProcessorBuilder.with_scoring_config()` (feat/32, PR #36).

---

## [0.1.0] — 2026-03-24

### Added (v1.0 — Architecture RFC)
- **`extraction/word_utils.py`** foundation work — `RowClassifier` chain injected as shared dependency (issue #17, PR #22).
- **`PDFTableExtractor` decomposed** into `PageHeaderAnalyser`, `RowBuilder`, and `RowPostProcessor` (issue #18, PR #23).
- **Facade passthroughs deleted** — `content_analysis_facade.py`, `validation_facade.py`, `row_classification_facade.py` removed; service→shim circular import chain broken (issue #20, Phase 20).
- **`pdf_table_extractor.py` shim** rewired to module-level singletons; `pdf_extractor.py` cleaned of four lazy facade imports.
- Architecture guard test `test_facade_modules_deleted` added.

### Changed
- Credit card templates (`aib_credit_card.json`, `credit_card_default.json`) removed from open-source repo; credit card support is PAID tier only via `require_iban=False` in `Entitlements.paid_tier()`.
75 changes: 56 additions & 19 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ This document describes the structure of the `bankstatementprocessor` monorepo a
```
bankstatementprocessor/
├── packages/
│ ├── parser-core/ PyPI: bankstatements-core
│ └── parser-free/ PyPI: bankstatements-free
│ ├── parser-core/ PyPI: bankstatements-core (v0.1.2)
│ └── parser-free/ PyPI: bankstatements-free (v0.1.0)
├── templates/ shared IBAN-based bank templates
└── .github/workflows/
├── ci.yml lint + test both packages
Expand All @@ -22,15 +22,16 @@ bankstatementprocessor/

The shared parsing library. Contains:

- **`extraction/`** — PDF → rows pipeline (`pdf_extractor`, `boundary_detector`, `row_classifiers`)
- **`services/`** — 21 single-responsibility services (duplicate detection, sorting, monthly summary, GDPR audit log, etc.)
- **`extraction/`** — PDF → rows pipeline (`pdf_extractor`, `boundary_detector`, `row_classifiers`, `word_utils`)
- **`services/`** — single-responsibility services (duplicate detection, sorting, filtering, monthly summary, GDPR audit log, etc.)
- **`builders/`** — `BankStatementProcessorBuilder` fluent builder
- **`templates/`** — template model, registry, detectors, and bundled IBAN-based bank templates
- **`domain/`** — domain models, protocols, currency, dataframe utilities
- **`config/`** — `AppConfig` dataclass validated from environment variables
- **`domain/`** — domain models (`Transaction`, `ExtractionResult`), protocols, currency, converters, dataframe utilities
- **`config/`** — `AppConfig` dataclass validated from environment variables; `ProcessorConfig` for programmatic use
- **`patterns/`** — Strategy, Factory, Repository implementations
- **`facades/`** — `BankStatementProcessingFacade` (main orchestrator)
- **`facades/`** — `BankStatementProcessingFacade` (main orchestrator entry point)
- **`entitlements.py`** — `Entitlements` frozen dataclass (`free_tier()` and `paid_tier()`)
- **`processor.py`** — `BankStatementProcessor` (PDF extraction → dedup → sort → output)
- **`processor.py`** — `BankStatementProcessor` (PDF extraction → filter → dedup → sort → output)

This package has no dependency on any licensing code. The `paid_tier()` entitlement is defined here because it describes a feature set (`require_iban=False`), not access control — activating it requires a valid signed license issued externally.

Expand All @@ -47,21 +48,36 @@ The free tier processes bank statements that include an IBAN pattern. Credit car

## Processing Pipeline

The core flow is the same across all distributions:

```
app.py
app.py / ProcessorFactory
└── BankStatementProcessingFacade.process_with_error_handling()
└── BankStatementProcessor
├── PDFExtractor (page iteration)
│ └── BoundaryDetector
│ └── RowClassifiers (Chain of Responsibility)
├── DuplicateDetectionService
├── SortingService
└── OutputService (CSV / JSON / Excel)
└── BankStatementProcessor.run()
├── PDFProcessingOrchestrator.process_all_pdfs()
│ └── ExtractionOrchestrator.extract_from_pdf()
│ └── BankStatementProcessingFacade.extract_tables_from_pdf()
│ └── PDFTableExtractor.extract() → ExtractionResult
│ ├── BoundaryDetector (word_utils)
│ ├── RowClassifiers (chain of responsibility)
│ └── RowBuilder (word_utils)
│ └── TransactionFilterService.apply_all_filters()
│ ├── filter_empty_rows
│ ├── filter_header_rows
│ └── filter_invalid_dates
└── ServiceRegistry.process_transaction_group()
├── EnrichmentService (Filename, document_type, transaction_type)
├── DuplicateDetectionService
├── TransactionSortingService
└── OutputService (CSV / JSON / Excel)
```

`AppConfig` (from environment variables) is the single source of truth for runtime configuration. Use `get_config_singleton()` to access it.
`ExtractionResult` is the typed boundary between the extraction layer and the service layer:
- Produced by `PDFTableExtractor.extract()` and propagated unchanged through `ExtractionOrchestrator` and `PDFProcessingOrchestrator`
- Fields: `transactions: list[Transaction]`, `page_count: int`, `iban: str | None`, `source_file: Path`, `warnings: list[str]`
- `processor.run()` converts `result.transactions` to `list[dict]` via `transactions_to_dicts()` before handing off to `ServiceRegistry`

`ServiceRegistry` is the wiring point for all post-extraction services. It is constructed by `BankStatementProcessorBuilder.build()` via `ServiceRegistry.from_config()`, which accepts optional injected services to override defaults — enabling custom duplicate strategies and sort orders.

`AppConfig` (from environment variables) is the single source of truth for runtime configuration via Docker/CLI. Use `get_config_singleton()` to access it. For programmatic use, `ProcessorConfig` is constructed directly by the builder.

---

Expand Down Expand Up @@ -112,6 +128,27 @@ The free-tier CLI always calls `free_tier()`. The premium distribution validates

---

## ServiceRegistry

`ServiceRegistry` centralises all transaction-processing service wiring. It is the single construction point for `DuplicateDetectionService`, `TransactionSortingService`, and `IBANGroupingService`.

```python
# Default construction (services built from config)
registry = ServiceRegistry.from_config(config, entitlements=entitlements)

# Custom strategy injection (builder passes these in)
registry = ServiceRegistry.from_config(
config,
entitlements=entitlements,
duplicate_detector=DuplicateDetectionService(my_strategy),
sorting_service=TransactionSortingService(my_sort_strategy),
)
```

`BankStatementProcessorBuilder` constructs services from its configured strategies before calling `from_config()`, so `.with_duplicate_strategy()` and `.with_date_sorting()` are guaranteed to be honoured.

---

## Premium Distribution

A separate premium distribution extends the open-source packages with:
Expand Down