feat(analyzer): add recognizer-level threshold config#2116
Open
rodboev wants to merge 7 commits into
Open
Conversation
Author
|
@microsoft-github-policy-service agree |
Contributor
There was a problem hiding this comment.
Pull request overview
Adds support in presidio-analyzer for configuring score thresholds per recognizer (and optionally per entity within that recognizer) via YAML, while preserving the existing global default_score_threshold / request-level score_threshold behavior.
Changes:
- Introduces
recognizer_score_thresholdsconfiguration, applies it during result filtering (before duplicate collapse) when no request-levelscore_thresholdis provided. - Adds validation/normalization for the new configuration shape (including numeric shorthand) and expands unit test coverage for precedence and error cases.
- Updates analyzer configuration docs, the no-code tutorial, and the root changelog to document the new option.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/analyzer_engine.py | Adds recognizer/entity-specific threshold support and applies filtering before deduplication. |
| presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py | Loads recognizer_score_thresholds from analyzer YAML and passes it into AnalyzerEngine. |
| presidio-analyzer/presidio_analyzer/input_validation/schemas.py | Validates and normalizes recognizer_score_thresholds in analyzer configuration validation. |
| presidio-analyzer/presidio_analyzer/conf/default_analyzer_full.yaml | Documents the new YAML option with a commented example. |
| presidio-analyzer/tests/test_analyzer_engine.py | Adds deterministic tests covering precedence, shorthand, invalid values, and dedupe ordering. |
| presidio-analyzer/tests/test_analyzer_engine_provider.py | Verifies provider passes/normalizes thresholds from YAML and that defaults remain unchanged otherwise. |
| presidio-analyzer/tests/test_configuration_validator.py | Adds validator tests for valid shorthand + invalid threshold structures/types/ranges. |
| docs/tutorial/08_no_code.md | Updates no-code YAML example and explains when to use recognizer-level thresholds. |
| docs/analyzer/analyzer_engine_provider.md | Documents the new configuration key and provides a YAML example. |
| CHANGELOG.md | Adds an Unreleased entry documenting the new analyzer YAML capability. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Description
Adds analyzer-level YAML configuration for recognizer-specific score thresholds in
presidio-analyzer. The new configuration lets users set a threshold for a recognizer and optionally override it for a specific entity type, while preserving the current scalarscore_thresholdanddefault_score_thresholdbehavior when the new config is absent.Proposed YAML shape:
Recognizer and entity thresholds are applied in the existing result-filtering path after recognizer execution and context enhancement, before duplicate removal collapses equivalent spans. Explicit
score_thresholdarguments remain a global per-call override, and unmatched recognizers continue to fall back todefault_score_threshold.This also updates the analyzer docs, the no-code tutorial example, and the root changelog.
Issue reference
Fixes #1572
Tests
poetry run pytest tests/test_analyzer_engine.py tests/test_analyzer_engine_provider.py tests/test_configuration_validator.py -qpoetry run ruff check presidio_analyzer/analyzer_engine.py presidio_analyzer/analyzer_engine_provider.py presidio_analyzer/input_validation/schemas.py tests/test_analyzer_engine.py tests/test_analyzer_engine_provider.py tests/test_configuration_validator.pyNote on CHANGELOG
Update
CHANGELOG.mdunder[unreleased]→Analyzer→Added.Checklist