From 3b96693b4ea259b67fe5612de55734b56f904d14 Mon Sep 17 00:00:00 2001 From: longieirl Date: Wed, 25 Mar 2026 17:25:56 +0000 Subject: [PATCH] docs: populate CHANGELOG and update architecture for v1.2 (PRs #56-#58) --- CHANGELOG.md | 45 ++++++++++++++++++++++++++ docs/architecture.md | 75 +++++++++++++++++++++++++++++++++----------- 2 files changed, 101 insertions(+), 19 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 11bddf3..fdc1b02 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,3 +6,48 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [Unreleased] + +--- + +## [0.1.2] — 2026-03-25 + +### Fixed +- **#47** — `filter_service.apply_all_filters()` result was computed and logged but silently discarded. Filtered rows are now written back to `result.transactions` in `PDFProcessingOrchestrator.process_all_pdfs()`, so `filter_empty_rows`, `filter_header_rows`, and `filter_invalid_dates` are applied to every successfully extracted PDF. +- **#52** — `BankStatementProcessorBuilder.with_duplicate_strategy()` and `.with_date_sorting()` were inert: `build()` called `ServiceRegistry.from_config()` with no services, causing the registry to create its own defaults and silently ignore the configured strategy. The builder now constructs `DuplicateDetectionService` and `TransactionSortingService` from its configured values and passes them explicitly into `ServiceRegistry.from_config()`. +- **#55** — Credit card / no-IBAN PDFs excluded from the `pdfs_extracted` count in processing output. `process_all_pdfs()` now returns a 3-tuple `(results, pdf_count, pages_read)`. + +### Changed (architecture cleanup — PRs #56, #57) +- **#49** — `ChronologicalSortingStrategy` sorts dicts directly via `DateParserService`, removing a redundant `Transaction` round-trip. +- **#48** — Deferred circular imports in `processor.py` removed; `service_registry`, `monthly_summary`, and `expense_analysis` import `ColumnAnalysisService`/`DateParserService` directly at module level. +- **#50** — `TransactionClassifier._looks_like_date` delegates to `RowAnalysisService.looks_like_date`, removing a duplicate regex and fixing a subtle 1-or-2-digit day matching bug. +- **#51** — `ProcessorFactory.create_from_config()` builds `ProcessorConfig` in one block via `BankStatementProcessorBuilder.with_processor_config()`; new config knobs now touch ≤2 files. + +--- + +## [0.1.1] — 2026-03-25 + +### Added (v1.1 — Transaction Pipeline & Word Utils) +- **Transaction enrichment** (`source_page: int | None`, `confidence_score: float`, `extraction_warnings: list[str]`) — all three fields default correctly and survive `to_dict` / `from_dict` round-trips (#16 / Phase 21). +- **`ExtractionResult` dataclass** (`domain/models/extraction_result.py`) — typed extraction boundary with `transactions`, `page_count`, `iban`, `source_file`, and `warnings` fields. Architecture guard test enforces placement in `domain/models/` (#16 / Phase 22). +- **End-to-end `ExtractionResult` pipeline** — `PDFTableExtractor.extract()`, `ExtractionOrchestrator`, `PDFProcessingOrchestrator`, and `processor` all produce and consume `ExtractionResult`; zero tuple-index unpacking remains (#16 / Phase 23). +- **`extraction/word_utils.py`** — canonical module for `group_words_by_y`, `assign_words_to_columns` (with `strict_rightmost` flag), and `calculate_column_coverage`. Five callers migrated; four private duplicate methods deleted (#21 / Phase 24). + +### Changed +- **ServiceRegistry** introduced (`feat/28`, PR #44) — `ServiceRegistry.from_config(ProcessorConfig, Entitlements)` wires all transaction-processing services. `TransactionProcessingOrchestrator` deleted (PR #46 / issue #45). +- **ClassifierRegistry** with explicit integer priorities added to `row_classifiers.py` (fix/29, PR #39). +- **`recursive_scan` default** changed `False → True` in `ProcessingConfig`, `AppConfig`, `ProcessorBuilder`, and `PDFDiscoveryService`; `RECURSIVE_SCAN` env var added to `docker-compose.yml` (fix/40, PR #41). +- **`ScoringConfig` injectable** via `BankStatementProcessorBuilder.with_scoring_config()` (feat/32, PR #36). + +--- + +## [0.1.0] — 2026-03-24 + +### Added (v1.0 — Architecture RFC) +- **`extraction/word_utils.py`** foundation work — `RowClassifier` chain injected as shared dependency (issue #17, PR #22). +- **`PDFTableExtractor` decomposed** into `PageHeaderAnalyser`, `RowBuilder`, and `RowPostProcessor` (issue #18, PR #23). +- **Facade passthroughs deleted** — `content_analysis_facade.py`, `validation_facade.py`, `row_classification_facade.py` removed; service→shim circular import chain broken (issue #20, Phase 20). +- **`pdf_table_extractor.py` shim** rewired to module-level singletons; `pdf_extractor.py` cleaned of four lazy facade imports. +- Architecture guard test `test_facade_modules_deleted` added. + +### Changed +- Credit card templates (`aib_credit_card.json`, `credit_card_default.json`) removed from open-source repo; credit card support is PAID tier only via `require_iban=False` in `Entitlements.paid_tier()`. diff --git a/docs/architecture.md b/docs/architecture.md index a62d90e..ef63929 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -9,8 +9,8 @@ This document describes the structure of the `bankstatementprocessor` monorepo a ``` bankstatementprocessor/ ├── packages/ -│ ├── parser-core/ PyPI: bankstatements-core -│ └── parser-free/ PyPI: bankstatements-free +│ ├── parser-core/ PyPI: bankstatements-core (v0.1.2) +│ └── parser-free/ PyPI: bankstatements-free (v0.1.0) ├── templates/ shared IBAN-based bank templates └── .github/workflows/ ├── ci.yml lint + test both packages @@ -22,15 +22,16 @@ bankstatementprocessor/ The shared parsing library. Contains: -- **`extraction/`** — PDF → rows pipeline (`pdf_extractor`, `boundary_detector`, `row_classifiers`) -- **`services/`** — 21 single-responsibility services (duplicate detection, sorting, monthly summary, GDPR audit log, etc.) +- **`extraction/`** — PDF → rows pipeline (`pdf_extractor`, `boundary_detector`, `row_classifiers`, `word_utils`) +- **`services/`** — single-responsibility services (duplicate detection, sorting, filtering, monthly summary, GDPR audit log, etc.) +- **`builders/`** — `BankStatementProcessorBuilder` fluent builder - **`templates/`** — template model, registry, detectors, and bundled IBAN-based bank templates -- **`domain/`** — domain models, protocols, currency, dataframe utilities -- **`config/`** — `AppConfig` dataclass validated from environment variables +- **`domain/`** — domain models (`Transaction`, `ExtractionResult`), protocols, currency, converters, dataframe utilities +- **`config/`** — `AppConfig` dataclass validated from environment variables; `ProcessorConfig` for programmatic use - **`patterns/`** — Strategy, Factory, Repository implementations -- **`facades/`** — `BankStatementProcessingFacade` (main orchestrator) +- **`facades/`** — `BankStatementProcessingFacade` (main orchestrator entry point) - **`entitlements.py`** — `Entitlements` frozen dataclass (`free_tier()` and `paid_tier()`) -- **`processor.py`** — `BankStatementProcessor` (PDF extraction → dedup → sort → output) +- **`processor.py`** — `BankStatementProcessor` (PDF extraction → filter → dedup → sort → output) This package has no dependency on any licensing code. The `paid_tier()` entitlement is defined here because it describes a feature set (`require_iban=False`), not access control — activating it requires a valid signed license issued externally. @@ -47,21 +48,36 @@ The free tier processes bank statements that include an IBAN pattern. Credit car ## Processing Pipeline -The core flow is the same across all distributions: - ``` -app.py +app.py / ProcessorFactory └── BankStatementProcessingFacade.process_with_error_handling() - └── BankStatementProcessor - ├── PDFExtractor (page iteration) - │ └── BoundaryDetector - │ └── RowClassifiers (Chain of Responsibility) - ├── DuplicateDetectionService - ├── SortingService - └── OutputService (CSV / JSON / Excel) + └── BankStatementProcessor.run() + ├── PDFProcessingOrchestrator.process_all_pdfs() + │ └── ExtractionOrchestrator.extract_from_pdf() + │ └── BankStatementProcessingFacade.extract_tables_from_pdf() + │ └── PDFTableExtractor.extract() → ExtractionResult + │ ├── BoundaryDetector (word_utils) + │ ├── RowClassifiers (chain of responsibility) + │ └── RowBuilder (word_utils) + │ └── TransactionFilterService.apply_all_filters() + │ ├── filter_empty_rows + │ ├── filter_header_rows + │ └── filter_invalid_dates + └── ServiceRegistry.process_transaction_group() + ├── EnrichmentService (Filename, document_type, transaction_type) + ├── DuplicateDetectionService + ├── TransactionSortingService + └── OutputService (CSV / JSON / Excel) ``` -`AppConfig` (from environment variables) is the single source of truth for runtime configuration. Use `get_config_singleton()` to access it. +`ExtractionResult` is the typed boundary between the extraction layer and the service layer: +- Produced by `PDFTableExtractor.extract()` and propagated unchanged through `ExtractionOrchestrator` and `PDFProcessingOrchestrator` +- Fields: `transactions: list[Transaction]`, `page_count: int`, `iban: str | None`, `source_file: Path`, `warnings: list[str]` +- `processor.run()` converts `result.transactions` to `list[dict]` via `transactions_to_dicts()` before handing off to `ServiceRegistry` + +`ServiceRegistry` is the wiring point for all post-extraction services. It is constructed by `BankStatementProcessorBuilder.build()` via `ServiceRegistry.from_config()`, which accepts optional injected services to override defaults — enabling custom duplicate strategies and sort orders. + +`AppConfig` (from environment variables) is the single source of truth for runtime configuration via Docker/CLI. Use `get_config_singleton()` to access it. For programmatic use, `ProcessorConfig` is constructed directly by the builder. --- @@ -112,6 +128,27 @@ The free-tier CLI always calls `free_tier()`. The premium distribution validates --- +## ServiceRegistry + +`ServiceRegistry` centralises all transaction-processing service wiring. It is the single construction point for `DuplicateDetectionService`, `TransactionSortingService`, and `IBANGroupingService`. + +```python +# Default construction (services built from config) +registry = ServiceRegistry.from_config(config, entitlements=entitlements) + +# Custom strategy injection (builder passes these in) +registry = ServiceRegistry.from_config( + config, + entitlements=entitlements, + duplicate_detector=DuplicateDetectionService(my_strategy), + sorting_service=TransactionSortingService(my_sort_strategy), +) +``` + +`BankStatementProcessorBuilder` constructs services from its configured strategies before calling `from_config()`, so `.with_duplicate_strategy()` and `.with_date_sorting()` are guaranteed to be honoured. + +--- + ## Premium Distribution A separate premium distribution extends the open-source packages with: