fix(analyzer): detect Mastercard 2-series and 18-19 digit credit cards#2075
Open
AUTHENSOR wants to merge 1 commit into
Open
fix(analyzer): detect Mastercard 2-series and 18-19 digit credit cards#2075AUTHENSOR wants to merge 1 commit into
AUTHENSOR wants to merge 1 commit into
Conversation
The CreditCardRecognizer PAN regex only matched leading digits 4/5/6/1/3 and a 13-16 digit window, so two families of real, Luhn-valid cards produced no CREDIT_CARD result and leaked unredacted: - Mastercard 2-series (BIN range 2221-2720, issued since 2017) - 18-19 digit PANs (ISO/IEC 7812 allows up to 19, e.g. UnionPay/Maestro) Widen the regex to cover these ranges and raise the length ceiling to 19. Luhn (validate_result) still gates every match, so non-card numbers are not flagged. Existing detections and Unix-timestamp/Luhn-invalid rejections are unchanged.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates Presidio Analyzer’s CreditCardRecognizer to reduce false negatives by expanding the credit card PAN regex to detect Mastercard 2‑series BINs (2221–2720) and longer (18–19 digit) PANs, and it adds targeted unit tests plus a changelog entry documenting the fix.
Changes:
- Expanded the credit card regex to include Mastercard 2‑series prefixes and allow matching up to 19-digit PANs.
- Added unit tests covering Mastercard 2‑series and 18–19 digit Luhn-valid PANs (plus a Luhn-invalid regression case).
- Documented the fix in
CHANGELOG.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
presidio-analyzer/presidio_analyzer/predefined_recognizers/generic/credit_card_recognizer.py |
Widened PAN regex to match Mastercard 2‑series and longer PAN lengths. |
presidio-analyzer/tests/test_credit_card_recognizer.py |
Added regression tests for newly supported PAN ranges and lengths. |
CHANGELOG.md |
Added a “Fixed” entry describing the credit-card false-negative fix. |
| Pattern( | ||
| "All Credit Cards (weak)", | ||
| r"\b(?!1\d{12}(?!\d))((4\d{3})|(5[0-5]\d{2})|(6\d{3})|(1\d{3})|(3\d{3}))[- ]?(\d{3,4})[- ]?(\d{3,4})[- ]?(\d{3,5})\b", # noqa: E501 | ||
| r"\b(?!1\d{12}(?!\d))((4\d{3})|(5[0-5]\d{2})|(2(22[1-9]|2[3-9]\d|[3-6]\d\d|7[01]\d|720))|(6\d{3})|(1\d{3})|(3\d{3}))[- ]?(\d{3,4})[- ]?(\d{3,4})[- ]?(\d{3,7})\b", # noqa: E501 |
Comment on lines
+48
to
+50
| ("4109906958483040118", 1, (1.0,), ((0, 19),),), | ||
| ("6298036494205552661", 1, (1.0,), ((0, 19),),), | ||
| ("675919345145061238", 1, (1.0,), ((0, 18),),), |
Comment on lines
+25
to
+26
| # Luhn (validate_result) still gates every match, so widening the | ||
| # range does not flag non-card numbers. |
| - Added `supported_entity` parameter to `PhoneRecognizer`. Previously, this recognizer hard-coded `["PHONE_NUMBER"]` as the only possible supported entity. | ||
|
|
||
| #### Fixed | ||
| - Fixed a false-negative in `CreditCardRecognizer` where real, Luhn-valid cards were passing as clean (no `CREDIT_CARD` result) and leaking unredacted. The PAN regex now also matches Mastercard 2-series cards (BIN range 2221-2720, the first four digits; issued since 2017) and 18-19 digit PANs (ISO/IEC 7812 allows up to 19 digits, e.g. UnionPay/Maestro), in addition to the existing ranges. Luhn (`validate_result`) still gates every match, so the widened range does not introduce false positives on non-card numbers. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Description
CreditCardRecognizerwas missing two whole families of real, current, Luhn-valid cards. Because the PAN regex never matched them, the recognizer returned noCREDIT_CARDresult, so these card numbers passed Presidio as "clean" and would leak unredacted downstream (analyze → anonymize):2221000000000009(Luhn-valid) returned[].(UnionPay, Maestro, some Visa). The old length window
(
\d{3,4} \d{3,4} \d{3,5}after a 4-digit prefix) capped matches at 17digits, so a Luhn-valid 19-digit card returned
[].A leading-4 Visa such as
4111111111111111was, and still is, detected atscore
1.0, so this is purely a false-negative fix on the leak surface.Fix
Widen the PAN regex in
presidio_analyzer/predefined_recognizers/generic/credit_card_recognizer.py:2(22[1-9]|2[3-9]\d|[3-6]\d\d|7[01]\d|720), which matches exactly thefour-digit values 2221-2720 (same 4-digit shape as the existing
4\d{3}/5[0-5]\d{2}/6\d{3}branches).\d{3,5}to\d{3,7}so the pattern spans13-19 digit PANs.
The change is conservative: Luhn (
validate_result) still gates everymatch, and the existing negative lookahead that rejects 13-digit Unix
timestamps starting with
1((?!1\d{12}(?!\d))) is untouched. Widening therange therefore does not flag non-card numbers — a 19-digit run that is not
Luhn-valid is still rejected, exactly as today.
Verification
Targeted tests (run from the
presidio-analyzersource dir):Lint on the changed source file:
New parametrized cases assert:
2221000000000009,2720990000000007,2223000048400011) are detected asCREDIT_CARDat score1.0, includingwith context (
my credit card: ...).675919345145061238,4109906958483040118,6298036494205552661) are detected.Unix-timestamp case (
1748503543012) is still not flagged, theLuhn-invalid Visa/Discover cases stay rejected, and a new Luhn-invalid
2-series PAN (
2221000000000001) is rejected.Issue reference
Checklist