Extract published date from abbreviated and non-ISO formats#251
Draft
drstarry wants to merge 3 commits intokepano:mainfrom
Draft
Extract published date from abbreviated and non-ISO formats#251drstarry wants to merge 3 commits intokepano:mainfrom
drstarry wants to merge 3 commits intokepano:mainfrom
Conversation
When <time>April 8, 2026</time> has no datetime attribute,
getTimeElement returned the raw text instead of an ISO date. Run
parseDateText on the text content first and fall back to the raw
value if the parser doesn't recognize the format.
Also replaces Array.from(querySelectorAll('time'))[0] with
querySelector('time') — same selector, no NodeList/Array allocation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some publishers ship non-ISO date strings in places defuddle trusts to already be ISO — e.g. JSON-LD `datePublished` as `"Oct 20, 2025"` — or place the date in a subtree that `h1.nextElementSibling` walks can't reach. Both cases return empty or raw output, which downstream consumers then try to pass to `new Date()`, hitting V8's year-2001 fallback for incomplete date strings. Three complementary changes in `getPublished`: 1. `parseDateText` now accepts 3-4 letter month abbreviations (Jan, Feb, Sept, etc.) in addition to full month names. Uses a shared MONTH_PATTERN so both day-first and month-first regexes stay in sync. 2. Results from JSON-LD, meta tags, and abbr/time elements pass through `parseDateText` before returning, so non-ISO inputs get normalized to ISO 8601 where possible. ISO strings fall through unchanged. 3. When h1 forward-sibling search finds nothing, walk up to three ancestor levels and scan their `p`/`time` descendants. Covers layouts where the article date lives in a cousin subtree (e.g. separate header and metadata columns under a shared wrapper). Tests: new unit tests for parseDateText, new fixtures for the JSON-LD and h1-cousin cases, plus two existing fixtures updated to reflect that previously-missed dates are now extracted.
Three refinements after code review: - Ancestor walk now scans siblings of the path element at each level rather than running querySelectorAll on the whole ancestor subtree. The previous version could reach into an <article> body at depth 2-3 and match a date-shaped phrase in prose as a false positive. - parseDateText uses pre-compiled static regexes instead of building them via `new RegExp(...)` on every call. This is a hot path during the h1-proximity fallback, which may invoke parseDateText once per p/time element in the header region. - Removed the dead `.replace(/\.$/, '')` calls — the optional trailing period is already outside the capture group, so the captured token is always a bare month name. Adds a regression fixture with dates in body prose that must NOT be picked up as metadata, plus unit tests for trailing-period and uppercase abbreviations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
Two related bugs when extracting the published date from pages with non-ISO or non-sibling date layouts.
Case 1 — JSON-LD ships non-ISO strings. Some publishers emit schema.org structured data like
"datePublished": "Oct 20, 2025"instead of the ISO 8601 format the spec requires. Defuddle returned the raw string, which downstream consumers then passed tonew Date()— hitting V8's year-2001 fallback for incomplete date strings (new Date("Oct 20")→2001-10-20).Case 2 — Date lives in an h1 cousin subtree. Other pages have no JSON-LD date at all and structure their markup so the date is a descendant of an h1 sibling's sibling, not the h1 itself. Defuddle's existing
h1.nextElementSiblingwalk couldn't reach it and returned empty.Both patterns show up on live Anthropic pages — the news section hits case 1, the engineering section hits case 2.
Fix
Three complementary changes in
src/metadata.ts:parseDateTextaccepts month abbreviations.Jan,Feb,Sept,Oct, etc. in addition to full month names. A sharedMONTH_PATTERNkeeps the day-first and month-first regexes in sync.getPublishednormalizes all source paths throughparseDateText. JSON-LD, meta tags, abbr elements, and time elements now pass through the parser before returning. ISO strings fall through unchanged; natural-language strings get normalized toYYYY-MM-DDT00:00:00+00:00.h1-proximity search walks ancestors. When
h1.nextElementSiblingfinds nothing, the search walks up to three ancestor levels and scans theirpandtimedescendants. This scopes the search to the article header area without scanning the whole document.This PR supersedes #248 — it subsumes the
<time>element text parsing fix from that branch and expands it to cover the JSON-LD and cousin-subtree cases that showed up in real Anthropic pages.Tests
tests/parse-date-text.test.ts) — 11 cases covering full names, abbreviations, day-of-week prefixes, and ISO fall-throughmetadata--json-ld-natural-date.html— reproduces the JSON-LD"Oct 20, 2025"patternmetadata--h1-cousin-date.html— reproduces the cousin-subtree layoutmetadata--time-element-no-datetime.html— (from prior branch commit) reproduces<time>April 8, 2026</time>without a datetime attributemath--katex.md—""→"2020-04-29T00:00:00+00:00"; the fixture contains "29 Apr 2020" which the expanded parser now matches. This was previously a silent miss.general--www.figma.com-blog-introducing-codex-to-figma.md—"February 26, 2026"→"2026-02-26T00:00:00+00:00"; the JSON-LD normalization now converts it. Also previously a silent miss.All 253 tests pass (11 new parseDateText units + 148 fixtures + 94 other tests).
Before / After
"datePublished": "Oct 20, 2025""Oct 20, 2025"(raw)"2025-10-20T00:00:00+00:00"<time><time>April 8, 2026</time>"April 8, 2026"(raw)"2026-04-08T00:00:00+00:00"<p class="date">Feb 05, 2026</p>in sibling subtree""(empty)"2026-02-05T00:00:00+00:00"content="February 26, 2026""February 26, 2026"(raw)"2026-02-26T00:00:00+00:00""2025-01-15""2025-01-15""2025-01-15"(unchanged)"some unknown format""some unknown format""some unknown format"(unchanged fallback)Verified against real pages
https://www.anthropic.com/news/claude-code-on-the-web—published: "2025-10-20T00:00:00+00:00"✓https://www.anthropic.com/engineering/building-c-compiler—published: "2026-02-05T00:00:00+00:00"✓