Skip to content

Fix TH TNIN regex rejecting valid province codes 22, 52 and 58#2107

Open
jichaowang02-lang wants to merge 1 commit into
data-privacy-stack:mainfrom
jichaowang02-lang:fix/th-tnin-province-regex
Open

Fix TH TNIN regex rejecting valid province codes 22, 52 and 58#2107
jichaowang02-lang wants to merge 1 commit into
data-privacy-stack:mainfrom
jichaowang02-lang:fix/th-tnin-province-regex

Conversation

@jichaowang02-lang

Copy link
Copy Markdown
Contributor

Change Description

The Thai National ID (TNIN) pattern excludes province codes (the 2nd–3rd digits, N2N3) that aren't real Thai provinces. But the shared character class is over-narrow:

r"\b[1-9](?:[134][0-9]|[25][0134567]|[67][01234567]|[89][0123456])\d{10}\b"
#                      ^^^^^^^^^^^^^ shared by N2=2 and N2=5

[25][0134567] allows N3 ∈ {0,1,3,4,5,6,7} for both N2=2 and N2=5, dropping N3=2 and N3=8. But per the recognizer's own docstring, the only forbidden combinations are 28, 29 (for N2=2) and 59 (for N2=5). So the class wrongly rejects the valid province codes:

code province should be
22 Chanthaburi valid
52 Lampang valid
58 Mae Hong Son valid

An ID with one of these codes never matches the pattern, so a genuinely valid TNIN (correct mod-11 check digit) is never detected — a false negative.

Fix

Split the shared class into 2[0-7]|5[0-8], matching exactly the set of N2N3 combinations the docstring documents as valid.

Checklist

  • I have reviewed the contribution guidelines
  • I have added tests to cover my changes
  • All new and existing tests passed

Tests

$ pytest tests/test_th_tnin_recognizer.py -q
46 passed
  • New cases for valid TNINs 1220000000007 / 1520000000004 / 1580000000004 (provinces 22/52/58) — fail before the fix, pass after.
  • Verified exhaustively across all 90 N2N3 combinations: the new regex accepts every documented-valid code and still rejects all 13 forbidden ones (the existing forbidden-combo tests still pass).

The TNIN pattern excludes province codes (digits N2N3) that are not real Thai
provinces, but the shared character class `[25][0134567]` is over-narrow: it
drops N3=2 and N3=8 for *both* N2=2 and N2=5. Per the recognizer's own
docstring, only 28, 29 (for N2=2) and 59 (for N2=5) should be excluded, so the
class wrongly rejected the valid province codes 22 (Chanthaburi), 52 (Lampang)
and 58 (Mae Hong Son). IDs with those codes never matched the pattern, so a
genuinely valid TNIN (correct mod-11 checksum) was never detected.

Split the shared class into `2[0-7]|5[0-8]`, which matches exactly the set of
N2N3 combinations the docstring documents as valid. Verified exhaustively
across all 90 N2N3 combinations: the regex now accepts every documented-valid
code and still rejects all 13 forbidden ones.

Adds regression cases for valid TNINs with province codes 22/52/58.
Copilot AI review requested due to automatic review settings June 27, 2026 02:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants