Releases: sebastienrousseau/bankstatementparser
v0.0.8 — Full Platform
Full Platform. Closes every gap identified in the competitive analysis. Five new features: multi-currency balance verification, hledger/beancount export, bulk directory scanner, account mapping rules, and a REST API microservice.
New features
1. Multi-currency balance verification
from bankstatementparser.hybrid import verify_balance_multi_currency
results = verify_balance_multi_currency(transactions, balances={
"GBP": (Decimal("500"), Decimal("570")),
"EUR": (Decimal("1000"), Decimal("1150")),
})Groups by Transaction.currency, runs the Golden Rule independently per group. No more false DISCREPANCY on multi-currency statements.
2. hledger + beancount export
from bankstatementparser.export import to_hledger, to_beancount
Path("journal.ledger").write_text(to_hledger(transactions))
Path("journal.beancount").write_text(to_beancount(transactions))Uses Transaction.category as the contra-account when set by the enrichment module. Zero external dependencies.
3. Bulk directory scanner
from bankstatementparser.hybrid import scan_and_ingest
batch = scan_and_ingest("statements/", pattern="**/*.pdf")
print(f"{batch.file_count} files, {batch.total_unique} unique transactions")Scans a folder tree, runs smart_ingest on every match, deduplicates across the entire batch. Supports seen_hashes for cross-batch persistence.
4. Account mapping rules
from bankstatementparser.enrichment import AccountMapper
mapper = AccountMapper.from_json("mapping.json")
accounts = mapper.map_batch(transactions)Ordered regex rules, first match wins, loaded from JSON config. Pairs with the ledger exporter for end-to-end plaintext-accounting workflows.
5. REST API
pip install 'bankstatementparser[api]'
bankstatementparser-api --port 8000
# POST a file, get JSON back
curl -F file=@statement.pdf http://localhost:8000/ingest
curl http://localhost:8000/healthFastAPI microservice with /ingest and /health. Default bind 127.0.0.1 (safe); use --host 0.0.0.0 for containers. Gated behind [api] extra.
Install
pip install bankstatementparser # core
pip install 'bankstatementparser[hybrid]' # + text-LLM
pip install 'bankstatementparser[hybrid-vision]' # + vision
pip install 'bankstatementparser[enrichment]' # + categorization
pip install 'bankstatementparser[api]' # + REST APITest plan
- 723 tests at 100% line + branch coverage
mypy --strictclean on 29 source filesruff check+bandit -rclean- Python 3.14 asyncio compatibility fix included
- 44 CI checks pass
Full changelog: CHANGELOG.md
Pull request: #52
v0.0.7 — Universal Vision
Universal Vision. Turns the local Ollama vision path from 🔴 (600 s LiteLLM timeout, hallucinated output) to 🟢 (all 11 rows extracted in ~33 s, correct currency and balances). Three independent improvements, all verified end-to-end against real local Ollama models on Apple Silicon.
What's new
1. Direct Ollama bridge — bankstatementparser.hybrid.ollama_direct
# Auto-selected for any ollama/* model — zero opt-in needed
from bankstatementparser.hybrid import smart_ingest
result = smart_ingest("scan.pdf") # just works, ~33s instead of 600s timeoutA ~220-line drop-in replacement for litellm.completion that targets Ollama's /api/chat endpoint via httpx. Sidesteps the upstream LiteLLM ↔ Ollama integration bug where vision calls with long structured-JSON system prompts hang at the 600 s timeout.
ollama_direct_completion(**kwargs)— accepts OpenAI-style messages (including multimodalimage_urlblocks), returns OpenAI-style response envelopeis_ollama_model(model)— returnsTrueforollama/<name>orollama_chat/<name>- Auto-selection in both
VisionExtractorandLLMExtractor— no user action required - No new dependencies —
httpxis already a transitive dep of LiteLLM in[hybrid]
2. ollama/minicpm-v recommended default
ollama pull minicpm-v
export BSP_HYBRID_VISION_MODEL=ollama/minicpm-vminicpm-v:8b (5.5 GB) is explicitly trained for OCR and document understanding. Replaces ollama/llava:7b which was a general-purpose multimodal model not designed for dense statement tables.
| Model | Result on synthetic scanned PDF |
|---|---|
ollama/llava:7b |
🔴 Hallucinates INR currency, fabricated rows |
ollama/minicpm-v:8b |
🟢 All 11 transactions, GBP, balances correct, ~33 s |
3. Strip mode — VisionExtractor(strip_rows=True)
from bankstatementparser.hybrid import VisionExtractor, smart_ingest
vision = VisionExtractor(strip_rows=True, n_strips=4)
result = smart_ingest("dense_statement.pdf", vision_extractor=vision)Splits each page into N overlapping horizontal strips (default 4, 10% overlap). Header strip extracts balances; body strips extract transactions; results merged by transaction_hash. Designed for dense pages (≥15 rows) where small local models can't process the full page — CLIP's 336×336 internal downscale destroys fine table detail on a full A4 page, but preserves it on a strip.
Smoke-test results
| Path | Model | Mode | Result |
|---|---|---|---|
| Text-LLM | ollama/llama3 |
single-shot | ✅ All 11 rows, VERIFIED, ~25 s |
| Vision-LLM | ollama/minicpm-v:8b |
single-shot | ✅ All 11 rows, GBP, ~33 s |
| Vision-LLM | ollama/minicpm-v:8b |
strip_rows=True | ✅ Sign convention correct, ~43 s |
Install
pip install 'bankstatementparser[hybrid-vision]'Migration from v0.0.6
Fully backwards compatible. Existing code keeps working — it just runs faster. Three opt-in upgrade patterns:
# 1. Do nothing — auto-bridge activates for ollama/* models
result = smart_ingest("scan.pdf")
# 2. Switch to minicpm-v
os.environ["BSP_HYBRID_VISION_MODEL"] = "ollama/minicpm-v"
# 3. Enable strip mode for dense pages
vision = VisionExtractor(strip_rows=True, n_strips=4)
result = smart_ingest("dense.pdf", vision_extractor=vision)Test plan
- 677 tests at 100% line + branch coverage (up from 649 on v0.0.6)
mypy --strictclean on 24 source filesruff check+bandit -rclean- 32 docs accuracy tests all pass
- All examples verified end-to-end
- 44 CI checks pass
Full changelog
See CHANGELOG.md for the complete v0.0.7 entry.
Pull request: #51 (8 commits, all SSH-signed)
v0.0.6 — Intelligence Layer
Intelligence Layer. The full v0.0.6 milestone. Drops Python 3.9 to retire the entire transitive CVE allow-list, adds a categorization enrichment module, an interactive review mode for discrepancy resolution, per-row bounding-box extraction from the vision pipeline, a pre-commit hook, and a 32-test automated docs accuracy suite. Closes #44, #45, #46, #47.
What's new
Categorization module (#44) — bankstatementparser.enrichment
from bankstatementparser.enrichment import Categorizer
cat = Categorizer() # default: Plaid 13-category schema
enriched = cat.categorize_batch(transactions)
for et in enriched:
print(et.transaction.description, "->", et.category, et.is_business_expense)Categorizer— LiteLLM-backed with pluggable schema, batch support, graceful failure (no data loss on LLM errors), schema-normalizing category matchingEnrichedTransaction— wrapper (not mutator) aroundTransactioncarryingcategory,is_business_expense,enrichment_confidence, andrationaleDEFAULT_CATEGORY_SCHEMA— Plaid's 13-category taxonomy as the default[enrichment]install extra (pip install 'bankstatementparser[enrichment]')- Prompt injection defense:
_sanitize_for_prompt()strips control characters and common injection markers from transaction descriptions before LLM interpolation
Interactive review mode (#45) — --type review
# 1. Ingest and save
bankstatementparser --type ingest --input statement.pdf --output result.json
# 2. Walk through discrepancies
bankstatementparser --type review --input result.json --output reviewed.jsonIngestResult.to_json()/.from_json()— stable JSON round-trip withschema_version=1, Decimal amounts as strings (no float drift), embeddedaudit_trail--type reviewCLI — single-character action menu per row:[a]ccept / [e]dit / [s]kip / [d]elete / [q]uit. Every action recorded in the audit trail. Edits capturebefore_hash/after_hash. Non-curses (plain stdin/stdout).- JSON size guard — rejects payloads > 50 MB before parsing
Per-row bounding boxes (#46) — BoundingBox + Transaction.source_bbox
for tx in result.transactions:
if tx.source_bbox:
print(f"Row at ({tx.source_bbox.x0:.2f}, {tx.source_bbox.y0:.2f})")BoundingBoxPydantic model with normalized (0.0–1.0) coordinates andpage_index, exported from the top-level packageTransaction.source_bbox— populated by the vision path when the model returns spatial coordinates- Inverted-box validation —
model_validatorrejectsx0 > x1ory0 > y1 - Vision prompt updated to request per-row bounding boxes in the JSON schema
Python 3.9 retirement (#47)
- Minimum Python bumped to 3.10 (Python 3.9 reached EOL 2025-10-31)
- All 9 transitive CVE allow-list entries deleted — every vulnerable package now resolves to its patched series:
| Package | v0.0.5 | v0.0.6 | Advisories closed |
|---|---|---|---|
litellm |
1.80.0 | 1.83.4 | GHSA-jjhc-v7c2-5hh6, GHSA-53mr-6c8q-9789, GHSA-69x8-hrgq-fjj8 |
cryptography |
43.0.3 | 46.0.7 | GHSA-r6ph-v2qm-q3c2, GHSA-79v4-65xg-pq4g, GHSA-m959-cc7f-wv43 |
pillow |
11.3.0 | 12.2.0 | GHSA-cfh3-3jmp-rvhc |
filelock |
3.19.1 | 3.25.2 | GHSA-w853-jp5j-5j7f, GHSA-qmgc-5h2g-mvrw |
requests |
2.32.5 | 2.33.1 | GHSA-gc5v-m9x4-r6x2 |
Security hardening
- Prompt injection defense in enrichment categorizer (
_sanitize_for_prompt) - JSON deserialization size guard (50 MB cap in
IngestResult.from_json) - Frozen-dataclass immutability fix —
IngestResultfields changed fromlisttotuple - BoundingBox inverted-box validation via Pydantic
model_validator - Duplicate-index warning when the LLM returns the same row index twice
Developer experience
- Pre-commit hook (
.githooks/pre-commit) runsmake verify(ruff + mypy + pytest + bandit) before every commit. Setup:make install-hooks - Automated docs accuracy test suite (
test_docs_accuracy.py, 32 tests) validates every factual claim in README, FAQ, CHANGELOG, CONTRIBUTING, and SECURITY against the actual codebase - Modernised Makefile with
install,install-all,install-hooks,test,lint,typecheck,security,verify,dist,releasetargets - PowerShell CLI walkthrough (
06_cli_walkthrough.ps1) for native Windows users
Install
pip install bankstatementparser # core (deterministic parsers)
pip install 'bankstatementparser[hybrid]' # + text-LLM for digital PDFs
pip install 'bankstatementparser[hybrid-vision]' # + vision for scanned PDFs
pip install 'bankstatementparser[enrichment]' # + categorizationMigration from v0.0.5
The public API is unchanged. v0.0.5 code runs on v0.0.6 without modification provided the interpreter is Python 3.10+. If you are on Python 3.9, pin to v0.0.5:
bankstatementparser==0.0.5
Test plan
- 649 tests at 100% line + branch coverage (up from 541 on v0.0.5)
mypy --strictclean on 23 source filesruff check+bandit -rclean- 44 CI checks pass on Python 3.10–3.14
- All hybrid examples verified end-to-end
- Deep-dive security + correctness audit completed with all findings fixed
Full changelog
See CHANGELOG.md for the complete v0.0.6 entry.
Pull request: #48 (15 commits, all SSH-signed)
v0.0.5 — Universal Extraction
Universal Extraction. Combines the deterministic reliability of the existing ISO/exchange-format parsers with an adaptive LLM layer for unstandardized PDFs, including a multimodal vision fallback for scanned/image-only statements. The core "data only, no inference" philosophy of the library is preserved — categorization and review-mode UI are intentionally deferred to v0.0.6.
Three extraction paths via smart_ingest()
| Path | Trigger | Cost | Module |
|---|---|---|---|
| A — Deterministic | detect_statement_format() returns a non-PDF format |
$0, fastest | existing parsers |
| B — Text-LLM | PDF with ≥ 50 chars extractable text | tokens | hybrid/llm_extractor.py |
| C — Vision-LLM | PDF below LOW_TEXT_DENSITY_THRESHOLD (scan/photo) |
tokens + compute | hybrid/vision.py |
IngestResult.source_method is tagged with "deterministic" | "llm" | "vision" for full audit provenance on every row.
from bankstatementparser.hybrid import smart_ingest
result = smart_ingest("statement.pdf")
print(result.source_method) # "deterministic" | "llm" | "vision"
print(result.verification.status) # VERIFIED | DISCREPANCY | FAILED
for tx in result.transactions:
print(tx.transaction_hash, tx.amount, tx.description)Install
# Core install — deterministic parsers only (zero AI dependencies)
pip install bankstatementparser
# Add the text-LLM path for digital PDFs
pip install 'bankstatementparser[hybrid]'
# Add higher-fidelity table extraction (adds pdfplumber)
pip install 'bankstatementparser[hybrid-plus]'
# Add the multimodal vision path for scanned/photocopied PDFs
pip install 'bankstatementparser[hybrid-vision]'Every [hybrid*] extra is opt-in and pure-Python — no poppler, no system libraries, no GPU required. Works identically on macOS, Linux, and WSL.
Highlights
New bankstatementparser.hybrid subpackage
smart_ingest()— single entry point that implements the three-path routing above. Auto-routes to vision whenpypdfextracts fewer thanLOW_TEXT_DENSITY_THRESHOLD(50) characters.LLMExtractor— LiteLLM-backed text extractor with provider-agnostic configuration viaBSP_HYBRID_MODEL. Default model isollama/llama3(local, private). Tolerant JSON parsing handles markdown fences and prose wrappers.VisionExtractor— multimodal extractor for scanned/image-only PDFs. Renders pages withpypdfium2(pure-Python wheel, no poppler dependency) and sends base64 PNGs via LiteLLM's multimodal payload. Vision model is opt-in only viaBSP_HYBRID_VISION_MODEL— no surprise downloads.verify_balance()— Golden Rule integrity check returningVERIFIED | DISCREPANCY | FAILEDwith the exact delta when mismatched.- Structured prompts that explicitly instruct the model to sort transactions chronologically, mitigating PDF reading-order issues.
Transaction model upgrades
transaction_hash— computed field, MD5 ofdate | normalized_description | amount. Every row carries an immutable fingerprint for idempotent re-ingestion.source_method—Literal["deterministic", "llm"], audit provenance per row.confidence—Optional[float], populated for LLM rows.categoryandraw_source_text— reserved placeholders for the v0.0.6 "Intelligence Layer" release.
normalize_description() noise stripping
Strips inline dates (2026-04-01), times (12:49), and long alphanumeric IDs so that recurring charges hash identically. AMZN MKTPLACE 2026-04-01 #A1B2C3 and AMZN MKTPLACE 2026-04-02 #Z9Y8X7 collapse to the same normalized form, which means dedupe_by_hash() actually catches real duplicates instead of being defeated by one rotating reference character.
Deduplicator.dedupe_by_hash()
New strict identity filter using Transaction.transaction_hash, designed for incremental ingestion (syncing to Google Sheets / a database). Mutates a caller-owned seen_hashes: set[str] so consumers can persist state across batches. Coexists with the existing fuzzy/temporal deduplicate() method.
CLI
bankstatementparser --type ingest --input statement.pdf [--output ledger.csv]New bankstatementparser console-script entry point. Both forms work in parallel:
bankstatementparser --type ingest --input file.pdf
python -m bankstatementparser.cli --type ingest --input file.pdfGraceful degradation when the [hybrid] extra is missing — surfaces the specific missing dependency name and prints a pip install hint.
Examples — examples/hybrid/
Eight new files including a Mermaid flow diagram, prerequisites table, 15-minute quick start, mock-vs-live mode comparison, cross-platform verification matrix, and troubleshooting table. generate_sample_pdfs.py produces reproducible synthetic UK-bank PDFs (digital + scanned) so the LLM examples are runnable without real bank PDFs. Each LLM example runs in two modes — MOCK (default, fully offline, CI-safe) and LIVE (set BSP_HYBRID_MODEL / BSP_HYBRID_VISION_MODEL).
See examples/hybrid/README.md for the full walkthrough.
Smoke-test results (real Ollama models, Apple Silicon, 2026-04-08)
| Path | Model | Result |
|---|---|---|
| A — Deterministic | n/a | ✅ CAMT.053 fixture, 3 transactions, all hashes computed |
| B — Text-LLM | ollama/llama3 (4.7 GB) |
✅ All 11 transactions extracted with confidence=1.00, balance VERIFIED, ~25s end-to-end |
| C — Vision-LLM | ollama/llava:7b (4.7 GB) |
gpt-4o, claude-opus-4-6, gemini-2.5-pro). |
| Golden Rule | n/a | ✅ All three outcomes (VERIFIED, DISCREPANCY, FAILED) reproduce as documented |
| Dedupe | n/a | ✅ Recurring Amazon dup caught in batch 1, both already-seen rows caught in batch 2 |
CLI --type ingest |
n/a | ✅ Deterministic path produces expected DataFrame with all v0.0.5 columns |
Test plan
- 541 tests pass (up from 484 on v0.0.4)
- 100% line + branch coverage across the entire package, including the new hybrid subpackage
mypy --strictclean on 21 source filesruff checkclean onbankstatementparser/,tests/, andexamples/bandit -rclean- All optional dependencies monkeypatched in tests — CI does not require any
[hybrid*]extra to be installed - 48 CI checks green on the merge commit
Security
Allow-listed nine transitive CVEs across litellm (3), cryptography (3), pillow (1), filelock (2), and requests (1). All nine share the same root cause: their patched versions require Python ≥ 3.10, while this release still supports Python 3.9. Each advisory is documented per-CVE with the reason its vulnerable code path is unreachable from anything we ship. The entire allow-list can be deleted in a single commit when the minimum Python is raised — see the strategic note in the v0.0.5 commit history.
Deferred to v0.0.6 — "Intelligence Layer"
- Categorization (
categoryfield populated,is_business_expenseflag) — will ship as opt-inbankstatementparser.enrichmentmodule - Interactive review mode — separate
--type reviewsubcommand consuming savedIngestResultJSON - OCR chunk-to-row mapping — true bounding-box mapping from the vision path
- Drop Python 3.9 support — Python 3.9 reached EOL on 2025-10-31
Full changelog
See CHANGELOG.md for the complete v0.0.5 entry.
Pull request: #43 (13 commits, all SSH-signed)
v0.0.4 — 27K tx/s streaming, parallel parsing, Python 3.14, ISO 13485
Performance
| Metric | CAMT | PAIN.001 |
|---|---|---|
| Throughput | 27,000+ tx/s | 52,000+ tx/s |
| Per-transaction latency | 37 us | 19 us |
| Time to first result | < 1 ms | < 2 ms |
| Memory scaling | Constant (1K–50K) | Constant (1K–50K) |
- 20% CAMT streaming optimization (xpath → find/findtext)
- True streaming for PAIN.001 files > 50 MB via chunk-based temp file
- CI-enforced TPS minimums and latency contracts
New Features
parse_files_parallel()— Process multiple statement files across CPU cores usingProcessPoolExecutorDeduplicator— Deterministic transaction deduplication with explainable confidence scoresTransaction— Pydantic model normalizing records from any parser withDecimalprecisionto_polars()/to_polars_lazy()— Optional Polars DataFrame export (pip install bankstatementparser[polars])- Python 3.13 and 3.14 — Full support with CI matrix testing
Dependencies
| Package | Change |
|---|---|
| lxml | 4.9.3 → 6.0.2 |
| Pygments | 2.19.2 → 2.20.0 (CVE-2026-4539 fix) |
| pydantic | Added (^2.11.0) |
| hypothesis | Added (>=6.82,<7) |
| polars | Added (^1.32.0, optional) |
Documentation
- FAQ.md — 11 questions across 3 personas (CFO/Auditor, Fintech Dev, Treasury Analyst)
- docs/MAPPING.md — Complete XML tag to DataFrame column mapping for all 6 formats
- README — Performance table, parallel parsing, deduplication, PII redaction, output examples
ISO 13485 Compliance Suite
- Risk Register — 7 quantified hazards with severity/probability scoring and residual risk
- V&V Plan — 5-phase, 19-step with pass criteria and evidence retention
- Change Control Procedure — Change workflow, impact assessment, rollback
- SOUP Register — 22 tracked components with risk levels and EOL
- Traceability Matrix — 17 design inputs mapped to implementation and verification
- Secure Path to Production — Gate criteria per stage with approval authority
- Security Policy — Response SLAs (48h ack, 30d fix), severity classification
Quality
| Metric | Value |
|---|---|
| Tests | 467 passed, 0 skipped |
| Branch coverage | 100% |
| Modules | 13 |
| Bandit SAST | 0 findings |
| pip-audit | 0 CVEs |
| Commits | All signed (ED25519) |
| SOUP components | 22 |
| Design inputs | 17 |
Breaking Changes
None. All existing APIs are backward-compatible.
THE ARCHITECT ᛫ Sebastien Rousseau ᛫ https://sebastienrousseau.com
THE ENGINE ᛞ EUXIS ᛫ Enterprise Unified Execution Intelligence System ᛫ https://euxis.co
v0.0.3
What's Changed
- feat(v0.0.3): deduplication, parser performance, and typed hardening by @sebastienrousseau in #25
Full Changelog: v0.0.2...v0.0.3
v0.0.2
Highlights
- Add secure in-memory CAMT parsing with
CamtParser.from_string(...)andCamtParser.from_bytes(...) - Add hardened ZIP processing for XML statements via
iter_secure_xml_entries(...) - Add parser support for bank CSV, OFX/QFX, and MT940 formats
- Add automatic statement-format detection with
detect_statement_format(...)andcreate_parser(...) - Add CI, security scanning, SBOM, checksum, and provenance hardening
- Refresh docs, examples, contribution guidance, and cross-platform behavior
Verification
- PR checks were green before merge
Release Integrityworkflow for tagv0.0.2passed successfully on 2026-03-22- Attached artifacts include the wheel, sdist, SHA256 checksums, SBOM, and dependency report
Release v0.0.1
Release v0.0.1 - 2023-11-08
Bank Statement Parser v0.0.1 🐍
The Bank Statement Parser is a Python library built for Finance and Treasury Professionals
The Bank Statement Parser is an essential Python library for financial data management. Developed for the busy finance and treasury professional, it simplifies the task of parsing bank statements.
This tool simplifies the process of analysing CAMT and SEPA transaction files. Its streamlined design removes cumbersome manual data review and provides you with a concise, accurate report to facilitate further analysis.
Bank Statement Parser helps you save time by quickly and accurately processing data, allowing you to focus on your financial insights and decisions. Its reliable precision is powered by Python, making it the smarter, more efficient way to manage bank statements.
Key Features
- Versatile Parsing: Easily handle formats like CAMT (ISO 20022) and beyond.
- Financial Insights: Unlock detailed analysis with powerful calculation utilities.
- Simple CLI: Automate and integrate with a straightforward command-line interface.
Why Choose the Bank Statement Parser
- Designed for Finance: Tailored features for the finance sector's needs.
- Efficiency at Heart: Transform complex data tasks into simple ones.
- Community First: Built and enhanced by experts, for experts.
Functionality
- CamtParser: Parse CAMT format files with ease.
- Pain001Parser: Handle SEPA PAIN.001 files effortlessly.
Installation
Create a Virtual Environment
We recommend creating a virtual environment to install the Bank Statement Parser. This will ensure that the package is installed in an isolated environment and will not affect other projects.
python3 -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`Getting Started
Install bankstatementparser with just one command:
pip install bankstatementparserUsage
CAMT Files
from bankstatementparser import CamtParser
# Initialize the parser with the CAMT file path
camt_parser = CamtParser('path/to/camt/file.xml')
# Parse the file and get the results
results = camt_parser.parse()PAIN.001 Files
from bankstatementparser import Pain001Parser
# Initialize the parser with the PAIN.001 file path
pain_parser = Pain001Parser('path/to/pain/file.xml')
# Parse the file and get the results
results = pain_parser.parse()Command Line Interface (CLI) Guide
Leverage the CLI for quick parsing tasks:
Basic Command
python cli.py --type <file_type> --input <input_file> [--output <output_file>]--type: Type of the bank statement file. Currently supported types are "camt" and "pain001".--input: Path to the bank statement file.--output: (Optional) Path to save the parsed data. If not provided, data is printed to the console.

