Skip to content

RFC: Make template confidence scoring injectable via ScoringConfig #32

@longieirl

Description

@longieirl

Problem

TemplateDetector aggregates confidence scores from 7 detectors to select a bank template. The scoring policy — detector weights and minimum threshold — is hard-coded as module-level constants:

DETECTOR_WEIGHTS = {"IBAN": 2.0, "ColumnHeader": 1.5, "Header": 1.0, "Filename": 0.8, ...}
MIN_CONFIDENCE_THRESHOLD = 0.6

This creates three friction points:

  1. No test safety net for weights/threshold. The 18 existing tests mock entire detectors; none validates that a specific detector signal at a known confidence selects (or rejects) a template. If a weight or threshold is changed to tune behaviour for a new bank format, there is no test that will catch a regression.

  2. Debugging requires reading 8 files. To understand why template X was chosen over Y, you must read DETECTOR_WEIGHTS, MIN_CONFIDENCE_THRESHOLD, and all 7 detector implementations. There is no structured way to ask "what did each detector contribute for this PDF?"

  3. Testing threshold boundaries requires arithmetic gymnastics. The comment in test_detect_below_minimum_threshold already shows this: # Filename detector returns low confidence (0.5 * 0.8 weight = 0.4 < 0.6 threshold). A test author must read the module constants, do the arithmetic, and pick a magic number — which silently breaks if either constant changes.


Proposed Solution

1. ScoringConfig — injectable weights + threshold

Add a frozen dataclass to template_detector.py that carries both constants as fields:

@dataclass(frozen=True)
class ScoringConfig:
    weights: dict[str, float]
    min_confidence_threshold: float

    def __post_init__(self) -> None:
        if not 0.0 < self.min_confidence_threshold <= 1.0:
            raise ValueError(...)
        for name, w in self.weights.items():
            if w < 0.0:
                raise ValueError(...)

    @classmethod
    def default(cls) -> "ScoringConfig":
        """Production scoring — used when no config is injected."""
        return cls(
            weights={"IBAN": 2.0, "CardNumber": 2.0, "LoanReference": 2.0,
                     "ColumnHeader": 1.5, "Header": 1.0, "Filename": 0.8, "Exclusion": 0.0},
            min_confidence_threshold=0.6,
        )

    def weight_for(self, detector_name: str) -> float:
        return self.weights.get(detector_name, 1.0)

Module-level DETECTOR_WEIGHTS and MIN_CONFIDENCE_THRESHOLD are deleted. ScoringConfig.default() is the single source of truth.

2. Backward-compatible constructor

class TemplateDetector:
    def __init__(
        self,
        registry: TemplateRegistry,
        scoring: ScoringConfig | None = None,
    ) -> None:
        self.registry = registry
        self._scoring = scoring if scoring is not None else ScoringConfig.default()
        # detector list unchanged

ExtractionOrchestrator._initialize_template_system() calls TemplateDetector(template_registry) today — zero changes required.

3. Two mechanical substitutions in the scoring loop

# Before
weight = DETECTOR_WEIGHTS.get(result.detector_name, 1.0)
...
if best_score < MIN_CONFIDENCE_THRESHOLD:

# After
weight = self._scoring.weight_for(result.detector_name)
...
if best_score < self._scoring.min_confidence_threshold:

4. get_detection_explanation() — structured debugging without mocking

@dataclass
class DetectionExplanation:
    selected_template_id: str
    selected_score: float
    threshold: float
    passed_threshold: bool
    per_template_scores: dict[str, float]       # template_id -> weighted total
    per_template_breakdown: dict[str, list[str]] # template_id -> ["IBAN=0.95*2.0=1.90", ...]
    tie_broken: bool
    tie_winner_reason: str | None
    used_default: bool
    default_reason: str | None

def get_detection_explanation(
    self, pdf_path: Path, first_page: Page
) -> DetectionExplanation:
    ...

Implementation: extract _run_detection(pdf_path, first_page) private helper returning tuple[BankTemplate, DetectionExplanation]. detect_template() discards the explanation; get_detection_explanation() discards the template. No PDF is parsed twice.

get_detection_explanation() is not added to the ITemplateDetector protocol — it is a concrete-class-only debug/test method.

5. Exports

Add ScoringConfig and DetectionExplanation to bankstatements_core/templates/__init__.py __all__.


Tests enabled by this change

Threshold boundary without magic numbers:

def test_filename_at_threshold_selects_template(mock_registry, mock_page):
    scoring = ScoringConfig.default()  # weight=0.8, threshold=0.6
    # 0.75 * 0.8 = 0.60 >= 0.60 → should select
    with patch(FilenameDetector.detect) as mock_fn, ...:
        mock_fn.return_value = [DetectionResult(revolut, 0.75, "Filename", {})]
        ...
        result = TemplateDetector(mock_registry, scoring=scoring).detect_template(...)
    assert result == revolut

Weight ordering determines winner:

def test_iban_weight_beats_higher_raw_column_header_confidence(mock_registry, mock_page):
    # aib: IBAN(0.5 * 2.0 = 1.0) vs revolut: ColumnHeader(0.6 * 1.5 = 0.9)
    # aib wins even though ColumnHeader raw confidence is higher
    scoring = ScoringConfig.default()
    ...
    assert result == aib

Explanation reveals tie-break reason without reading _break_tie():

def test_explanation_reports_iban_tie_break_reason(mock_registry, mock_page):
    ...
    explanation = detector.get_detection_explanation(Path("x.pdf"), mock_page)
    assert explanation.tie_broken is True
    assert explanation.tie_winner_reason == "IBAN match"
    assert explanation.per_template_scores["aib"] == pytest.approx(1.0)
    assert explanation.per_template_scores["revolut"] == pytest.approx(1.0)

Scope

File Change
templates/template_detector.py Add ScoringConfig, DetectionExplanation; revise __init__; replace 2 global refs; add _run_detection, get_detection_explanation
templates/__init__.py Export ScoringConfig, DetectionExplanation
tests/templates/test_template_detector.py Add 3–4 new tests; existing 18 tests untouched
templates/detectors/*.py (7 files) No changes
services/extraction_orchestrator.py No changes
domain/protocols/services.py No changes

What this does NOT change

  • The 7 detector files — they remain stateless signal producers
  • ExtractionOrchestrator — zero call-site changes needed
  • The ITemplateDetector protocol — get_detection_explanation is concrete-only
  • The detector list ordering — ExclusionDetector stays first by convention; no ordering guard needed since the list is not injectable in this design
  • Existing test mocking patterns — all 18 current @patch tests continue to work unchanged

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions