Skip to content

Fix ISO 8601 date pattern accepting impossible month/day values#2113

Open
jichaowang02-lang wants to merge 1 commit into
data-privacy-stack:mainfrom
jichaowang02-lang:fix/date-iso8601-month-day-range
Open

Fix ISO 8601 date pattern accepting impossible month/day values#2113
jichaowang02-lang wants to merge 1 commit into
data-privacy-stack:mainfrom
jichaowang02-lang:fix/date-iso8601-month-day-range

Conversation

@jichaowang02-lang

Copy link
Copy Markdown
Contributor

Summary

The "ISO 8601 datetime" pattern in DateRecognizer uses [01]\d for the month and [0-3]\d for the day. These ranges admit values that cannot occur in a real date:

  • month: 00 and 1319
  • day: 00 and 3239

So strings like 2024-13-15T14:30:00Z or 2024-12-32T14:30Z are detected as DATE_TIME.

Every other date pattern in this same file already constrains the month to 0112 ([1-9]|0[1-9]|1[0-2]) and the day to 0131 ([1-9]|0[1-9]|[1-2][0-9]|3[0-1]). Only the ISO 8601 pattern was left loose — this looks like an oversight rather than intent.

Fix

Tighten the month/day fields of the ISO pattern to valid ISO 8601 ranges (zero-padded, since ISO 8601 always uses 2-digit fields):

-\d{4}-[01]\d-[0-3]\dT...
+\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])T...

Non-capturing groups are used so the existing capture-group positions in the pattern are unchanged. No valid ISO 8601 datetime is lost — the excluded values are not valid dates.

Tests

Added parametrized cases for invalid month (00, 13) and day (00, 32). All existing test_date_recognizer.py cases still pass (39 passed total). ruff check . is clean.

The "ISO 8601 datetime" pattern in DateRecognizer used `[01]\d` for the
month and `[0-3]\d` for the day. These ranges admit impossible values:
month `00` and `13`-`19`, and day `00` and `32`-`39`. As a result
strings such as `2024-13-15T14:30:00Z` and `2024-12-32T14:30Z` were
detected as DATE_TIME.

Every other date pattern in this same file already constrains the
month to `01`-`12` and the day to `01`-`31`; only the ISO 8601 pattern
was loose. Tighten the ISO month/day fields to match (using
non-capturing groups so existing capture-group positions are
unaffected). No valid ISO 8601 datetime is lost, since those values are
not valid dates to begin with.

Adds parametrized cases for invalid month (00, 13) and day (00, 32).
Copilot AI review requested due to automatic review settings June 27, 2026 16:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Pattern(
"ISO 8601 datetime",
r"\b(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d\.\d+([+-][0-2]\d:[0-5]\d|Z))|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d([+-][0-2]\d:[0-5]\d|Z))|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d([+-][0-2]\d:[0-5]\d|Z))\b",
r"\b(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])T[0-2]\d:[0-5]\d:[0-5]\d\.\d+([+-][0-2]\d:[0-5]\d|Z))|(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])T[0-2]\d:[0-5]\d:[0-5]\d([+-][0-2]\d:[0-5]\d|Z))|(\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])T[0-2]\d:[0-5]\d([+-][0-2]\d:[0-5]\d|Z))\b",
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants