feat(analyzer): add nine South African predefined recognizers#2069
Open
thatomokoena wants to merge 5 commits into
Open
feat(analyzer): add nine South African predefined recognizers#2069thatomokoena wants to merge 5 commits into
thatomokoena wants to merge 5 commits into
Conversation
Extend ZA coverage beyond ZA_ID_NUMBER with passport, tax, VAT, CIPC registration, eNaTIS driver's licence and traffic register numbers, licence plates, and mobile/telephone numbers split by line type. All recognizers are disabled by default in default_recognizers.yaml. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a suite of South Africa (ZA) PII recognizers and corresponding tests, and wires them into the predefined recognizer registry, documentation, and default configuration.
Changes:
- Introduce new ZA recognizers (VAT, passport, income tax, driver license, company registration, traffic register number, license plate, mobile & telephone numbers).
- Add pytest coverage for each new recognizer’s analyze/validation behavior.
- Register the recognizers in package exports, default registry YAML, supported entities docs, and changelog.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_vat_number_recognizer.py | Adds ZA VAT number recognizer with regex + validation. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_traffic_register_number_recognizer.py | Adds ZA TRN recognizer with ID-number disambiguation. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_phone_number_recognizer.py | Adds ZA phone base recognizer and mobile/telephone subclasses using phonenumbers. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_passport_recognizer.py | Adds ZA passport recognizer with prefix rules + validation. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_license_plate_recognizer.py | Adds ZA license plate recognizer with multiple patterns + sanitizing validation. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_income_tax_number_recognizer.py | Adds ZA income tax reference recognizer with leading-digit rules. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_driver_license_recognizer.py | Adds ZA driver license recognizer with regex + validation rules. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_company_registration_recognizer.py | Adds ZA company registration recognizer with modern + legacy formats and year validation. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/init.py | Exports new ZA recognizers from the south_africa package. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py | Exposes new ZA recognizers at the predefined_recognizers package level. |
| presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml | Adds ZA recognizers to the default registry (disabled by default). |
| docs/supported_entities.md | Documents newly supported ZA entity types. |
| CHANGELOG.md | Notes addition of ZA recognizers. |
| presidio-analyzer/tests/test_za_vat_number_recognizer.py | Adds tests for ZA VAT recognizer analyze/validate_result. |
| presidio-analyzer/tests/test_za_traffic_register_number_recognizer.py | Adds tests for ZA TRN recognizer analyze/validate_result. |
| presidio-analyzer/tests/test_za_telephone_number_recognizer.py | Adds tests for ZA telephone recognizer analyze + supported props. |
| presidio-analyzer/tests/test_za_passport_recognizer.py | Adds tests for ZA passport recognizer analyze/validate_result. |
| presidio-analyzer/tests/test_za_mobile_number_recognizer.py | Adds tests for ZA mobile recognizer analyze + supported props. |
| presidio-analyzer/tests/test_za_license_plate_recognizer.py | Adds tests for ZA license plate recognizer analyze/validate_result. |
| presidio-analyzer/tests/test_za_income_tax_number_recognizer.py | Adds tests for ZA income tax recognizer analyze/validate_result. |
| presidio-analyzer/tests/test_za_driver_license_recognizer.py | Adds tests for ZA driver license recognizer analyze/validate_result. |
| presidio-analyzer/tests/test_za_company_registration_recognizer.py | Adds tests for ZA company registration recognizer analyze/validate_result. |
Refactor the year validation logic in the ZaCompanyRegistrationRecognizer and ZaDriverLicenseRecognizer to ensure the current year is accurately checked without allowing for the next year. Remove unnecessary dependency on ZaIdNumberRecognizer in ZaDriverLicenseRecognizer to streamline the code. Update the validate_result method in ZaLicensePlateRecognizer to return a boolean type for consistency.
Contributor
Author
|
@microsoft-github-policy-service agree |
Correct 08x NSN fallback classification, tighten driver licence validation, and replace the unstable passport docstring reference. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Hi @SharonHart @omri374. When you have time, would you be willing to take a look at this PR to add more South African recognizers? Happy to address any feedback. Thanks! |
Comment on lines
+11
to
+12
| eNaTIS licence numbers are alphanumeric strings of roughly 10–12 | ||
| characters combining digit blocks with trailing letter groups. |
| | ZA_ID_NUMBER | The South African identity number is a 13-digit identifier in the `YYMMDDSSSSCAZ` format, where the trailing digit is validated with the Luhn algorithm. | Pattern match, context, and checksum. | | ||
| | ZA_PASSPORT | The South African passport number is a 9-character identifier with prefix letter A, D, M, or T followed by 8 digits. | Pattern match, context, and validation. | | ||
| | ZA_INCOME_TAX_NUMBER | The South African SARS income tax reference number is a 10-digit numeric identifier, commonly starting with 0, 1, 2, 3, or 9. | Pattern match, context, and validation. | | ||
| | ZA_DRIVER_LICENSE | The South African eNaTIS driver's licence number is an alphanumeric identifier of roughly 10–12 characters. | Pattern match, context, and validation. | |
Comment on lines
+96
to
+101
| try: | ||
| parsed_number = phonenumbers.parse( | ||
| text[match.start : match.end], self.REGION | ||
| ) | ||
| except NumberParseException: | ||
| continue |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Description
za) predefined recognizers topresidio-analyzer, building on the mergedZA_ID_NUMBERrecognizer (feat(analyzer): add South African ID number recognizer (ZA_ID_NUMBER) #2064). All recognizers are disabled by default indefault_recognizers.yamland tagged withcountry_code: za.ZA_PASSPORT(ZaPassportRecognizer):[ADMT]+ 8 digits with prefix validation.ZA_INCOME_TAX_NUMBER(ZaIncomeTaxNumberRecognizer): 10-digit SARS tax reference; leading digit0/1/2/3/9(excludes VAT4).ZA_DRIVER_LICENSE(ZaDriverLicenseRecognizer): eNaTIS alphanumeric licence numbers (~10–14 chars); rejects validZA_ID_NUMBERmatches.ZA_VAT_NUMBER(ZaVatNumberRecognizer):4+ 9 digits.ZA_COMPANY_REGISTRATION(ZaCompanyRegistrationRecognizer): modern CIPCYYYY/NNNNNN/NNand legacy prefixed formats (e.g.CK).ZA_TRAFFIC_REGISTER_NUMBER(ZaTrafficRegisterNumberRecognizer): 13 digits that failZA_ID_NUMBERvalidation.ZA_LICENSE_PLATE(ZaLicensePlateRecognizer): multi-pattern provincial formats with province suffix whitelist (GP,ZN,WP, etc.).ZA_MOBILE_NUMBER/ZA_TELEPHONE_NUMBER(ZaMobileNumberRecognizer,ZaTelephoneNumberRecognizer): extendPhoneRecognizerwith a shared base that locks to regionZA, filters byphonenumbers.number_type(), and applies ICASA NSN prefix fallback. Toll-free (080), sharecall (086), and VoIP (087) map toZA_TELEPHONE_NUMBER.CHANGELOG.md,docs/supported_entities.md,default_recognizers.yaml, andpredefined_recognizers/__init__.py.test_za_*_recognizer.py); 169test_za_*tests pass locally. Full-reporuff checkpasses.Known limitations:
ZA_TRAFFIC_REGISTER_NUMBERvsZA_ID_NUMBERandZA_INCOME_TAX_NUMBERvsZA_VAT_NUMBERmay collide on numeric strings without context; licence plate patterns do not cover all legacy no-suffix formats; cellular prefix ranges are indicative due to number portability.Issue reference
N/A
Checklist