feat(analyzer): add country-specific predefined recognizer for Taiwan (TW) National ID#2072
feat(analyzer): add country-specific predefined recognizer for Taiwan (TW) National ID#2072ArjunPakhan wants to merge 1 commit into
Conversation
|
@microsoft-github-policy-service agree |
|
@ArjunPakhan please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
Adds a new country-specific predefined recognizer to presidio-analyzer for detecting Taiwan National ID numbers (TW_NATIONAL_ID), wiring it into the predefined recognizer exports and multiple analyzer configuration profiles, along with a new unit test.
Changes:
- Added
TwNationalIdRecognizerunderpredefined_recognizers/country_specific/tw/with regex + context keywords and a lowercase invalidation override. - Exported the recognizer via
predefined_recognizers/__init__.pyand thetwpackage__init__.py. - Updated analyzer config YAMLs and introduced a dedicated test module.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/tests/test_tw_national_id_recognizer.py | Adds unit tests for the new TW National ID recognizer. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/tw/tw_national_id_recognizer.py | Implements the new TwNationalIdRecognizer (pattern + context + invalidation). |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/tw/init.py | Exports TwNationalIdRecognizer from the TW country package. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py | Adds the new recognizer to global predefined exports and __all__. |
| presidio-analyzer/presidio_analyzer/conf/slim.yaml | Attempts to include the TW recognizer in the slim registry profile. |
| presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml | Attempts to add the TW recognizer to the default recognizer list (no-code config). |
| presidio-analyzer/presidio_analyzer/conf/default_analyzer_full.yaml | Attempts to include the TW recognizer in the full analyzer registry profile. |
Comments suppressed due to low confidence (1)
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml:70
- This insertion breaks YAML metadata:
UsSsnRecognizerno longer hastype/supported_languages/country_code, andTwNationalIdRecognizeris incorrectly tagged withsupported_languages: [en]andcountry_code: us(which will raiseValueErrorbecause the class declaresCOUNTRY_CODE = "tw").
- name: UsSsnRecognizer
- name: TwNationalIdRecognizer
supported_languages:
- en
type: predefined
country_code: us
| - name: UsSsnRecognizer | ||
| - name: TwNationalIdRecognizer | ||
| type: predefined |
| - name: UsSsnRecognizer | ||
| - name: TwNationalIdRecognizer | ||
| type: predefined |
| # Match 1 upper case letter followed by 9 digits | ||
| PATTERNS = [ | ||
| Pattern("National ID (weak)", r"\b[A-Z][1289][0-9]{8}\b", 0.3), | ||
| ] |
| "身分證", | ||
| "身份證", | ||
| "身分證字號", | ||
| "統一編號", |
| def invalidate_result(self, pattern_text: str) -> bool: | ||
| """ | ||
| Check if the pattern text cannot be validated as a TW_NATIONAL_ID entity. | ||
|
|
||
| :param pattern_text: Text detected as pattern by regex | ||
| :return: True if invalidated | ||
| """ | ||
| # Presidio uses IGNORECASE by default. Explicitly reject lowercase starting letters. | ||
| if not pattern_text[0].isupper(): | ||
| return True | ||
|
|
||
| return False |
| @pytest.mark.parametrize( | ||
| "text, expected_len, expected_positions, expected_score_ranges", | ||
| [ | ||
| # fmt: off | ||
| # Valid Taiwan IDs (Citizen Male, Citizen Female, Resident Male, Resident Female) | ||
| ("My ID is A123456789.", 1, ((9, 19),), ((0.2, 0.4),),), | ||
| ("B298765432", 1, ((0, 10),), ((0.2, 0.4),),), | ||
| ("F800000014", 1, ((0, 10),), ((0.2, 0.4),),), | ||
| ("H987654321", 1, ((0, 10),), ((0.2, 0.4),),), | ||
|
|
||
| # Invalid Formats / Non-Matches | ||
| ("A323456789", 0, (), (),), # Invalid starting digit (3 is not a valid gender code) | ||
| ("A12345678", 0, (), (),), # Too short (only 8 digits) | ||
| ("A1234567890", 0, (), (),), # Too long (10 digits) | ||
| ("1123456789", 0, (), (),), # Missing leading character letter | ||
| ("a123456789", 0, (), (),), # Lowercase character letter | ||
| # fmt: on |
| class TwNationalIdRecognizer(PatternRecognizer): | ||
| """Recognize Taiwan National Identification Number using regex. | ||
|
|
||
| :param patterns: List of patterns to be used by this recognizer | ||
| :param context: List of context words to increase confidence in detection | ||
| :param supported_language: Language this recognizer supports | ||
| :param supported_entity: The entity this recognizer can detect | ||
| """ |
Description
This PR introduces a new country-specific predefined recognizer for identifying Taiwan (TW) National Identification Numbers within the
presidio-analyzerengine.The implementation detects the standard identification layout consisting of an initial uppercase letter representing the region, followed by a gender/status identifier digit (
1,2,8, or9), and 8 sequential tracking digits.Key Modifications:
TwNationalIdRecognizerusing pattern matching regex (\b[A-Z][1289][0-9]{8}\b) alongside English and Traditional Chinese contextual keywords (身分證,統一編號, etc.) to enhance validation accuracy.invalidate_resultto reject lowercase prefix alphabets, counteracting Presidio's default globalIGNORECASEexecution flag.predefined_recognizers/__init__.pyand exported it via the localizedtwcountry directory module layout.- name: TwNationalIdRecognizeracross all core engine configuration routing trees (default_recognizers.yaml,slim.yaml, anddefault_analyzer_full.yaml).Target Entity Specifications
TW_NATIONAL_IDzh(Chinese / Traditional Chinese)0.3(Weak pattern baseline, augmented via context words)Technical Visual Architecture
The module interacts within the standard Presidio predefined pipeline as follows: