Problem
TemplateDetector aggregates confidence scores from 7 detectors to select a bank template. The scoring policy — detector weights and minimum threshold — is hard-coded as module-level constants:
DETECTOR_WEIGHTS = {"IBAN": 2.0, "ColumnHeader": 1.5, "Header": 1.0, "Filename": 0.8, ...}
MIN_CONFIDENCE_THRESHOLD = 0.6
This creates three friction points:
-
No test safety net for weights/threshold. The 18 existing tests mock entire detectors; none validates that a specific detector signal at a known confidence selects (or rejects) a template. If a weight or threshold is changed to tune behaviour for a new bank format, there is no test that will catch a regression.
-
Debugging requires reading 8 files. To understand why template X was chosen over Y, you must read DETECTOR_WEIGHTS, MIN_CONFIDENCE_THRESHOLD, and all 7 detector implementations. There is no structured way to ask "what did each detector contribute for this PDF?"
-
Testing threshold boundaries requires arithmetic gymnastics. The comment in test_detect_below_minimum_threshold already shows this: # Filename detector returns low confidence (0.5 * 0.8 weight = 0.4 < 0.6 threshold). A test author must read the module constants, do the arithmetic, and pick a magic number — which silently breaks if either constant changes.
Proposed Solution
1. ScoringConfig — injectable weights + threshold
Add a frozen dataclass to template_detector.py that carries both constants as fields:
@dataclass(frozen=True)
class ScoringConfig:
weights: dict[str, float]
min_confidence_threshold: float
def __post_init__(self) -> None:
if not 0.0 < self.min_confidence_threshold <= 1.0:
raise ValueError(...)
for name, w in self.weights.items():
if w < 0.0:
raise ValueError(...)
@classmethod
def default(cls) -> "ScoringConfig":
"""Production scoring — used when no config is injected."""
return cls(
weights={"IBAN": 2.0, "CardNumber": 2.0, "LoanReference": 2.0,
"ColumnHeader": 1.5, "Header": 1.0, "Filename": 0.8, "Exclusion": 0.0},
min_confidence_threshold=0.6,
)
def weight_for(self, detector_name: str) -> float:
return self.weights.get(detector_name, 1.0)
Module-level DETECTOR_WEIGHTS and MIN_CONFIDENCE_THRESHOLD are deleted. ScoringConfig.default() is the single source of truth.
2. Backward-compatible constructor
class TemplateDetector:
def __init__(
self,
registry: TemplateRegistry,
scoring: ScoringConfig | None = None,
) -> None:
self.registry = registry
self._scoring = scoring if scoring is not None else ScoringConfig.default()
# detector list unchanged
ExtractionOrchestrator._initialize_template_system() calls TemplateDetector(template_registry) today — zero changes required.
3. Two mechanical substitutions in the scoring loop
# Before
weight = DETECTOR_WEIGHTS.get(result.detector_name, 1.0)
...
if best_score < MIN_CONFIDENCE_THRESHOLD:
# After
weight = self._scoring.weight_for(result.detector_name)
...
if best_score < self._scoring.min_confidence_threshold:
4. get_detection_explanation() — structured debugging without mocking
@dataclass
class DetectionExplanation:
selected_template_id: str
selected_score: float
threshold: float
passed_threshold: bool
per_template_scores: dict[str, float] # template_id -> weighted total
per_template_breakdown: dict[str, list[str]] # template_id -> ["IBAN=0.95*2.0=1.90", ...]
tie_broken: bool
tie_winner_reason: str | None
used_default: bool
default_reason: str | None
def get_detection_explanation(
self, pdf_path: Path, first_page: Page
) -> DetectionExplanation:
...
Implementation: extract _run_detection(pdf_path, first_page) private helper returning tuple[BankTemplate, DetectionExplanation]. detect_template() discards the explanation; get_detection_explanation() discards the template. No PDF is parsed twice.
get_detection_explanation() is not added to the ITemplateDetector protocol — it is a concrete-class-only debug/test method.
5. Exports
Add ScoringConfig and DetectionExplanation to bankstatements_core/templates/__init__.py __all__.
Tests enabled by this change
Threshold boundary without magic numbers:
def test_filename_at_threshold_selects_template(mock_registry, mock_page):
scoring = ScoringConfig.default() # weight=0.8, threshold=0.6
# 0.75 * 0.8 = 0.60 >= 0.60 → should select
with patch(FilenameDetector.detect) as mock_fn, ...:
mock_fn.return_value = [DetectionResult(revolut, 0.75, "Filename", {})]
...
result = TemplateDetector(mock_registry, scoring=scoring).detect_template(...)
assert result == revolut
Weight ordering determines winner:
def test_iban_weight_beats_higher_raw_column_header_confidence(mock_registry, mock_page):
# aib: IBAN(0.5 * 2.0 = 1.0) vs revolut: ColumnHeader(0.6 * 1.5 = 0.9)
# aib wins even though ColumnHeader raw confidence is higher
scoring = ScoringConfig.default()
...
assert result == aib
Explanation reveals tie-break reason without reading _break_tie():
def test_explanation_reports_iban_tie_break_reason(mock_registry, mock_page):
...
explanation = detector.get_detection_explanation(Path("x.pdf"), mock_page)
assert explanation.tie_broken is True
assert explanation.tie_winner_reason == "IBAN match"
assert explanation.per_template_scores["aib"] == pytest.approx(1.0)
assert explanation.per_template_scores["revolut"] == pytest.approx(1.0)
Scope
| File |
Change |
templates/template_detector.py |
Add ScoringConfig, DetectionExplanation; revise __init__; replace 2 global refs; add _run_detection, get_detection_explanation |
templates/__init__.py |
Export ScoringConfig, DetectionExplanation |
tests/templates/test_template_detector.py |
Add 3–4 new tests; existing 18 tests untouched |
templates/detectors/*.py (7 files) |
No changes |
services/extraction_orchestrator.py |
No changes |
domain/protocols/services.py |
No changes |
What this does NOT change
- The 7 detector files — they remain stateless signal producers
ExtractionOrchestrator — zero call-site changes needed
- The
ITemplateDetector protocol — get_detection_explanation is concrete-only
- The detector list ordering —
ExclusionDetector stays first by convention; no ordering guard needed since the list is not injectable in this design
- Existing test mocking patterns — all 18 current
@patch tests continue to work unchanged
Problem
TemplateDetectoraggregates confidence scores from 7 detectors to select a bank template. The scoring policy — detector weights and minimum threshold — is hard-coded as module-level constants:This creates three friction points:
No test safety net for weights/threshold. The 18 existing tests mock entire detectors; none validates that a specific detector signal at a known confidence selects (or rejects) a template. If a weight or threshold is changed to tune behaviour for a new bank format, there is no test that will catch a regression.
Debugging requires reading 8 files. To understand why template X was chosen over Y, you must read
DETECTOR_WEIGHTS,MIN_CONFIDENCE_THRESHOLD, and all 7 detector implementations. There is no structured way to ask "what did each detector contribute for this PDF?"Testing threshold boundaries requires arithmetic gymnastics. The comment in
test_detect_below_minimum_thresholdalready shows this:# Filename detector returns low confidence (0.5 * 0.8 weight = 0.4 < 0.6 threshold). A test author must read the module constants, do the arithmetic, and pick a magic number — which silently breaks if either constant changes.Proposed Solution
1.
ScoringConfig— injectable weights + thresholdAdd a frozen dataclass to
template_detector.pythat carries both constants as fields:Module-level
DETECTOR_WEIGHTSandMIN_CONFIDENCE_THRESHOLDare deleted.ScoringConfig.default()is the single source of truth.2. Backward-compatible constructor
ExtractionOrchestrator._initialize_template_system()callsTemplateDetector(template_registry)today — zero changes required.3. Two mechanical substitutions in the scoring loop
4.
get_detection_explanation()— structured debugging without mockingImplementation: extract
_run_detection(pdf_path, first_page)private helper returningtuple[BankTemplate, DetectionExplanation].detect_template()discards the explanation;get_detection_explanation()discards the template. No PDF is parsed twice.get_detection_explanation()is not added to theITemplateDetectorprotocol — it is a concrete-class-only debug/test method.5. Exports
Add
ScoringConfigandDetectionExplanationtobankstatements_core/templates/__init__.py__all__.Tests enabled by this change
Threshold boundary without magic numbers:
Weight ordering determines winner:
Explanation reveals tie-break reason without reading
_break_tie():Scope
templates/template_detector.pyScoringConfig,DetectionExplanation; revise__init__; replace 2 global refs; add_run_detection,get_detection_explanationtemplates/__init__.pyScoringConfig,DetectionExplanationtests/templates/test_template_detector.pytemplates/detectors/*.py(7 files)services/extraction_orchestrator.pydomain/protocols/services.pyWhat this does NOT change
ExtractionOrchestrator— zero call-site changes neededITemplateDetectorprotocol —get_detection_explanationis concrete-onlyExclusionDetectorstays first by convention; no ordering guard needed since the list is not injectable in this design@patchtests continue to work unchanged