Skip to content

feat(analyzer): add nine South African predefined recognizers#2069

Open
thatomokoena wants to merge 5 commits into
data-privacy-stack:mainfrom
thatomokoena:feature/za-recognizers
Open

feat(analyzer): add nine South African predefined recognizers#2069
thatomokoena wants to merge 5 commits into
data-privacy-stack:mainfrom
thatomokoena:feature/za-recognizers

Conversation

@thatomokoena

@thatomokoena thatomokoena commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Change Description

  • Adds nine new South African (za) predefined recognizers to presidio-analyzer, building on the merged ZA_ID_NUMBER recognizer (feat(analyzer): add South African ID number recognizer (ZA_ID_NUMBER) #2064). All recognizers are disabled by default in default_recognizers.yaml and tagged with country_code: za.
  • ZA_PASSPORT (ZaPassportRecognizer): [ADMT] + 8 digits with prefix validation.
  • ZA_INCOME_TAX_NUMBER (ZaIncomeTaxNumberRecognizer): 10-digit SARS tax reference; leading digit 0/1/2/3/9 (excludes VAT 4).
  • ZA_DRIVER_LICENSE (ZaDriverLicenseRecognizer): eNaTIS alphanumeric licence numbers (~10–14 chars); rejects valid ZA_ID_NUMBER matches.
  • ZA_VAT_NUMBER (ZaVatNumberRecognizer): 4 + 9 digits.
  • ZA_COMPANY_REGISTRATION (ZaCompanyRegistrationRecognizer): modern CIPC YYYY/NNNNNN/NN and legacy prefixed formats (e.g. CK).
  • ZA_TRAFFIC_REGISTER_NUMBER (ZaTrafficRegisterNumberRecognizer): 13 digits that fail ZA_ID_NUMBER validation.
  • ZA_LICENSE_PLATE (ZaLicensePlateRecognizer): multi-pattern provincial formats with province suffix whitelist (GP, ZN, WP, etc.).
  • ZA_MOBILE_NUMBER / ZA_TELEPHONE_NUMBER (ZaMobileNumberRecognizer, ZaTelephoneNumberRecognizer): extend PhoneRecognizer with a shared base that locks to region ZA, filters by phonenumbers.number_type(), and applies ICASA NSN prefix fallback. Toll-free (080), sharecall (086), and VoIP (087) map to ZA_TELEPHONE_NUMBER.
  • Updated CHANGELOG.md, docs/supported_entities.md, default_recognizers.yaml, and predefined_recognizers/__init__.py.
  • Added nine test modules (test_za_*_recognizer.py); 169 test_za_* tests pass locally. Full-repo ruff check passes.

Known limitations: ZA_TRAFFIC_REGISTER_NUMBER vs ZA_ID_NUMBER and ZA_INCOME_TAX_NUMBER vs ZA_VAT_NUMBER may collide on numeric strings without context; licence plate patterns do not cover all legacy no-suffix formats; cellular prefix ranges are indicative due to number portability.

Issue reference

N/A

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

Extend ZA coverage beyond ZA_ID_NUMBER with passport, tax, VAT, CIPC
registration, eNaTIS driver's licence and traffic register numbers,
licence plates, and mobile/telephone numbers split by line type.

All recognizers are disabled by default in default_recognizers.yaml.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings June 17, 2026 13:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a suite of South Africa (ZA) PII recognizers and corresponding tests, and wires them into the predefined recognizer registry, documentation, and default configuration.

Changes:

  • Introduce new ZA recognizers (VAT, passport, income tax, driver license, company registration, traffic register number, license plate, mobile & telephone numbers).
  • Add pytest coverage for each new recognizer’s analyze/validation behavior.
  • Register the recognizers in package exports, default registry YAML, supported entities docs, and changelog.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_vat_number_recognizer.py Adds ZA VAT number recognizer with regex + validation.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_traffic_register_number_recognizer.py Adds ZA TRN recognizer with ID-number disambiguation.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_phone_number_recognizer.py Adds ZA phone base recognizer and mobile/telephone subclasses using phonenumbers.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_passport_recognizer.py Adds ZA passport recognizer with prefix rules + validation.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_license_plate_recognizer.py Adds ZA license plate recognizer with multiple patterns + sanitizing validation.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_income_tax_number_recognizer.py Adds ZA income tax reference recognizer with leading-digit rules.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_driver_license_recognizer.py Adds ZA driver license recognizer with regex + validation rules.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/za_company_registration_recognizer.py Adds ZA company registration recognizer with modern + legacy formats and year validation.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/south_africa/init.py Exports new ZA recognizers from the south_africa package.
presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py Exposes new ZA recognizers at the predefined_recognizers package level.
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml Adds ZA recognizers to the default registry (disabled by default).
docs/supported_entities.md Documents newly supported ZA entity types.
CHANGELOG.md Notes addition of ZA recognizers.
presidio-analyzer/tests/test_za_vat_number_recognizer.py Adds tests for ZA VAT recognizer analyze/validate_result.
presidio-analyzer/tests/test_za_traffic_register_number_recognizer.py Adds tests for ZA TRN recognizer analyze/validate_result.
presidio-analyzer/tests/test_za_telephone_number_recognizer.py Adds tests for ZA telephone recognizer analyze + supported props.
presidio-analyzer/tests/test_za_passport_recognizer.py Adds tests for ZA passport recognizer analyze/validate_result.
presidio-analyzer/tests/test_za_mobile_number_recognizer.py Adds tests for ZA mobile recognizer analyze + supported props.
presidio-analyzer/tests/test_za_license_plate_recognizer.py Adds tests for ZA license plate recognizer analyze/validate_result.
presidio-analyzer/tests/test_za_income_tax_number_recognizer.py Adds tests for ZA income tax recognizer analyze/validate_result.
presidio-analyzer/tests/test_za_driver_license_recognizer.py Adds tests for ZA driver license recognizer analyze/validate_result.
presidio-analyzer/tests/test_za_company_registration_recognizer.py Adds tests for ZA company registration recognizer analyze/validate_result.

Refactor the year validation logic in the ZaCompanyRegistrationRecognizer and ZaDriverLicenseRecognizer to ensure the current year is accurately checked without allowing for the next year. Remove unnecessary dependency on ZaIdNumberRecognizer in ZaDriverLicenseRecognizer to streamline the code. Update the validate_result method in ZaLicensePlateRecognizer to return a boolean type for consistency.
@thatomokoena

Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree

Copilot AI review requested due to automatic review settings June 18, 2026 06:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Correct 08x NSN fallback classification, tighten driver licence validation,
and replace the unstable passport docstring reference.

Co-authored-by: Cursor <cursoragent@cursor.com>
@thatoisnaked

Copy link
Copy Markdown

Hi @SharonHart @omri374. When you have time, would you be willing to take a look at this PR to add more South African recognizers? Happy to address any feedback. Thanks!

Copilot AI review requested due to automatic review settings June 22, 2026 06:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.

Comment on lines +11 to +12
eNaTIS licence numbers are alphanumeric strings of roughly 10–12
characters combining digit blocks with trailing letter groups.
| ZA_ID_NUMBER | The South African identity number is a 13-digit identifier in the `YYMMDDSSSSCAZ` format, where the trailing digit is validated with the Luhn algorithm. | Pattern match, context, and checksum. |
| ZA_PASSPORT | The South African passport number is a 9-character identifier with prefix letter A, D, M, or T followed by 8 digits. | Pattern match, context, and validation. |
| ZA_INCOME_TAX_NUMBER | The South African SARS income tax reference number is a 10-digit numeric identifier, commonly starting with 0, 1, 2, 3, or 9. | Pattern match, context, and validation. |
| ZA_DRIVER_LICENSE | The South African eNaTIS driver's licence number is an alphanumeric identifier of roughly 10–12 characters. | Pattern match, context, and validation. |
Comment on lines +96 to +101
try:
parsed_number = phonenumbers.parse(
text[match.start : match.end], self.REGION
)
except NumberParseException:
continue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants