Extract published date from abbreviated and non-ISO formats by drstarry · Pull Request #251 · kepano/defuddle

drstarry · 2026-04-14T12:23:13Z

Symptom

Two related bugs when extracting the published date from pages with non-ISO or non-sibling date layouts.

Case 1 — JSON-LD ships non-ISO strings. Some publishers emit schema.org structured data like "datePublished": "Oct 20, 2025" instead of the ISO 8601 format the spec requires. Defuddle returned the raw string, which downstream consumers then passed to new Date() — hitting V8's year-2001 fallback for incomplete date strings (new Date("Oct 20") → 2001-10-20).

Case 2 — Date lives in an h1 cousin subtree. Other pages have no JSON-LD date at all and structure their markup so the date is a descendant of an h1 sibling's sibling, not the h1 itself. Defuddle's existing h1.nextElementSibling walk couldn't reach it and returned empty.

Both patterns show up on live Anthropic pages — the news section hits case 1, the engineering section hits case 2.

Fix

Three complementary changes in src/metadata.ts:

parseDateText accepts month abbreviations. Jan, Feb, Sept, Oct, etc. in addition to full month names. A shared MONTH_PATTERN keeps the day-first and month-first regexes in sync.
getPublished normalizes all source paths through parseDateText. JSON-LD, meta tags, abbr elements, and time elements now pass through the parser before returning. ISO strings fall through unchanged; natural-language strings get normalized to YYYY-MM-DDT00:00:00+00:00.
h1-proximity search walks ancestors. When h1.nextElementSibling finds nothing, the search walks up to three ancestor levels and scans their p and time descendants. This scopes the search to the article header area without scanning the whole document.

// 1. Shared month pattern (abbreviations + full names, optional trailing period)
private static readonly MONTH_PATTERN =
    '(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sept?(?:ember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\\.?';

// 2. Normalize all getPublished sources
if (result) return this.parseDateText(result) || result;

// 3. Walk h1 ancestors when forward-sibling search fails
let ancestor: Element | null = h1.parentElement;
for (let depth = 0; depth < 3 && ancestor; depth++) {
    for (const child of Array.from(ancestor.querySelectorAll('p, time'))) {
        if (child === h1 || h1.contains(child)) continue;
        const parsed = this.parseDateText(child.textContent?.trim() || '');
        if (parsed) return parsed;
    }
    ancestor = ancestor.parentElement;
}

This PR supersedes #248 — it subsumes the <time> element text parsing fix from that branch and expands it to cover the JSON-LD and cousin-subtree cases that showed up in real Anthropic pages.

Tests

New unit tests (tests/parse-date-text.test.ts) — 11 cases covering full names, abbreviations, day-of-week prefixes, and ISO fall-through
New fixture metadata--json-ld-natural-date.html — reproduces the JSON-LD "Oct 20, 2025" pattern
New fixture metadata--h1-cousin-date.html — reproduces the cousin-subtree layout
New fixture metadata--time-element-no-datetime.html — (from prior branch commit) reproduces <time>April 8, 2026</time> without a datetime attribute
Updated expected math--katex.md — "" → "2020-04-29T00:00:00+00:00"; the fixture contains "29 Apr 2020" which the expanded parser now matches. This was previously a silent miss.
Updated expected general--www.figma.com-blog-introducing-codex-to-figma.md — "February 26, 2026" → "2026-02-26T00:00:00+00:00"; the JSON-LD normalization now converts it. Also previously a silent miss.

All 253 tests pass (11 new parseDateText units + 148 fixtures + 94 other tests).

Before / After

Source	Input	Before	After
JSON-LD	`"datePublished": "Oct 20, 2025"`	`"Oct 20, 2025"` (raw)	`"2025-10-20T00:00:00+00:00"`
`<time>`	`<time>April 8, 2026</time>`	`"April 8, 2026"` (raw)	`"2026-04-08T00:00:00+00:00"`
h1 cousin	`<p class="date">Feb 05, 2026</p>` in sibling subtree	`""` (empty)	`"2026-02-05T00:00:00+00:00"`
meta tag	`content="February 26, 2026"`	`"February 26, 2026"` (raw)	`"2026-02-26T00:00:00+00:00"`
ISO string	`"2025-01-15"`	`"2025-01-15"`	`"2025-01-15"` (unchanged)
Unknown	`"some unknown format"`	`"some unknown format"`	`"some unknown format"` (unchanged fallback)

Verified against real pages

https://www.anthropic.com/news/claude-code-on-the-web — published: "2025-10-20T00:00:00+00:00" ✓
https://www.anthropic.com/engineering/building-c-compiler — published: "2026-02-05T00:00:00+00:00" ✓

When <time>April 8, 2026</time> has no datetime attribute, getTimeElement returned the raw text instead of an ISO date. Run parseDateText on the text content first and fall back to the raw value if the parser doesn't recognize the format. Also replaces Array.from(querySelectorAll('time'))[0] with querySelector('time') — same selector, no NodeList/Array allocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Some publishers ship non-ISO date strings in places defuddle trusts to already be ISO — e.g. JSON-LD `datePublished` as `"Oct 20, 2025"` — or place the date in a subtree that `h1.nextElementSibling` walks can't reach. Both cases return empty or raw output, which downstream consumers then try to pass to `new Date()`, hitting V8's year-2001 fallback for incomplete date strings. Three complementary changes in `getPublished`: 1. `parseDateText` now accepts 3-4 letter month abbreviations (Jan, Feb, Sept, etc.) in addition to full month names. Uses a shared MONTH_PATTERN so both day-first and month-first regexes stay in sync. 2. Results from JSON-LD, meta tags, and abbr/time elements pass through `parseDateText` before returning, so non-ISO inputs get normalized to ISO 8601 where possible. ISO strings fall through unchanged. 3. When h1 forward-sibling search finds nothing, walk up to three ancestor levels and scan their `p`/`time` descendants. Covers layouts where the article date lives in a cousin subtree (e.g. separate header and metadata columns under a shared wrapper). Tests: new unit tests for parseDateText, new fixtures for the JSON-LD and h1-cousin cases, plus two existing fixtures updated to reflect that previously-missed dates are now extracted.

Three refinements after code review: - Ancestor walk now scans siblings of the path element at each level rather than running querySelectorAll on the whole ancestor subtree. The previous version could reach into an <article> body at depth 2-3 and match a date-shaped phrase in prose as a false positive. - parseDateText uses pre-compiled static regexes instead of building them via `new RegExp(...)` on every call. This is a hot path during the h1-proximity fallback, which may invoke parseDateText once per p/time element in the header region. - Removed the dead `.replace(/\.$/, '')` calls — the optional trailing period is already outside the capture group, so the captured token is always a bare month name. Adds a regression fixture with dates in body prose that must NOT be picked up as metadata, plus unit tests for trailing-period and uppercase abbreviations.

drstarry and others added 3 commits April 13, 2026 10:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract published date from abbreviated and non-ISO formats#251

Extract published date from abbreviated and non-ISO formats#251
drstarry wants to merge 3 commits intokepano:mainfrom
drstarry:fix/time-element-parse-text

drstarry commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drstarry commented Apr 14, 2026

Symptom

Fix

Tests

Before / After

Verified against real pages

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant