Skip to content

Extract published date from abbreviated and non-ISO formats#251

Draft
drstarry wants to merge 3 commits intokepano:mainfrom
drstarry:fix/time-element-parse-text
Draft

Extract published date from abbreviated and non-ISO formats#251
drstarry wants to merge 3 commits intokepano:mainfrom
drstarry:fix/time-element-parse-text

Conversation

@drstarry
Copy link
Copy Markdown

Symptom

Two related bugs when extracting the published date from pages with non-ISO or non-sibling date layouts.

Case 1 — JSON-LD ships non-ISO strings. Some publishers emit schema.org structured data like "datePublished": "Oct 20, 2025" instead of the ISO 8601 format the spec requires. Defuddle returned the raw string, which downstream consumers then passed to new Date() — hitting V8's year-2001 fallback for incomplete date strings (new Date("Oct 20")2001-10-20).

Case 2 — Date lives in an h1 cousin subtree. Other pages have no JSON-LD date at all and structure their markup so the date is a descendant of an h1 sibling's sibling, not the h1 itself. Defuddle's existing h1.nextElementSibling walk couldn't reach it and returned empty.

Both patterns show up on live Anthropic pages — the news section hits case 1, the engineering section hits case 2.

Fix

Three complementary changes in src/metadata.ts:

  1. parseDateText accepts month abbreviations. Jan, Feb, Sept, Oct, etc. in addition to full month names. A shared MONTH_PATTERN keeps the day-first and month-first regexes in sync.

  2. getPublished normalizes all source paths through parseDateText. JSON-LD, meta tags, abbr elements, and time elements now pass through the parser before returning. ISO strings fall through unchanged; natural-language strings get normalized to YYYY-MM-DDT00:00:00+00:00.

  3. h1-proximity search walks ancestors. When h1.nextElementSibling finds nothing, the search walks up to three ancestor levels and scans their p and time descendants. This scopes the search to the article header area without scanning the whole document.

// 1. Shared month pattern (abbreviations + full names, optional trailing period)
private static readonly MONTH_PATTERN =
    '(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sept?(?:ember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\\.?';

// 2. Normalize all getPublished sources
if (result) return this.parseDateText(result) || result;

// 3. Walk h1 ancestors when forward-sibling search fails
let ancestor: Element | null = h1.parentElement;
for (let depth = 0; depth < 3 && ancestor; depth++) {
    for (const child of Array.from(ancestor.querySelectorAll('p, time'))) {
        if (child === h1 || h1.contains(child)) continue;
        const parsed = this.parseDateText(child.textContent?.trim() || '');
        if (parsed) return parsed;
    }
    ancestor = ancestor.parentElement;
}

This PR supersedes #248 — it subsumes the <time> element text parsing fix from that branch and expands it to cover the JSON-LD and cousin-subtree cases that showed up in real Anthropic pages.

Tests

  • New unit tests (tests/parse-date-text.test.ts) — 11 cases covering full names, abbreviations, day-of-week prefixes, and ISO fall-through
  • New fixture metadata--json-ld-natural-date.html — reproduces the JSON-LD "Oct 20, 2025" pattern
  • New fixture metadata--h1-cousin-date.html — reproduces the cousin-subtree layout
  • New fixture metadata--time-element-no-datetime.html — (from prior branch commit) reproduces <time>April 8, 2026</time> without a datetime attribute
  • Updated expected math--katex.md"""2020-04-29T00:00:00+00:00"; the fixture contains "29 Apr 2020" which the expanded parser now matches. This was previously a silent miss.
  • Updated expected general--www.figma.com-blog-introducing-codex-to-figma.md"February 26, 2026""2026-02-26T00:00:00+00:00"; the JSON-LD normalization now converts it. Also previously a silent miss.

All 253 tests pass (11 new parseDateText units + 148 fixtures + 94 other tests).

Before / After

Source Input Before After
JSON-LD "datePublished": "Oct 20, 2025" "Oct 20, 2025" (raw) "2025-10-20T00:00:00+00:00"
<time> <time>April 8, 2026</time> "April 8, 2026" (raw) "2026-04-08T00:00:00+00:00"
h1 cousin <p class="date">Feb 05, 2026</p> in sibling subtree "" (empty) "2026-02-05T00:00:00+00:00"
meta tag content="February 26, 2026" "February 26, 2026" (raw) "2026-02-26T00:00:00+00:00"
ISO string "2025-01-15" "2025-01-15" "2025-01-15" (unchanged)
Unknown "some unknown format" "some unknown format" "some unknown format" (unchanged fallback)

Verified against real pages

  • https://www.anthropic.com/news/claude-code-on-the-webpublished: "2025-10-20T00:00:00+00:00"
  • https://www.anthropic.com/engineering/building-c-compilerpublished: "2026-02-05T00:00:00+00:00"

drstarry and others added 3 commits April 13, 2026 10:41
When <time>April 8, 2026</time> has no datetime attribute,
getTimeElement returned the raw text instead of an ISO date. Run
parseDateText on the text content first and fall back to the raw
value if the parser doesn't recognize the format.

Also replaces Array.from(querySelectorAll('time'))[0] with
querySelector('time') — same selector, no NodeList/Array allocation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Some publishers ship non-ISO date strings in places defuddle trusts to
already be ISO — e.g. JSON-LD `datePublished` as `"Oct 20, 2025"` — or
place the date in a subtree that `h1.nextElementSibling` walks can't
reach. Both cases return empty or raw output, which downstream consumers
then try to pass to `new Date()`, hitting V8's year-2001 fallback for
incomplete date strings.

Three complementary changes in `getPublished`:

1. `parseDateText` now accepts 3-4 letter month abbreviations (Jan, Feb,
   Sept, etc.) in addition to full month names. Uses a shared
   MONTH_PATTERN so both day-first and month-first regexes stay in sync.

2. Results from JSON-LD, meta tags, and abbr/time elements pass through
   `parseDateText` before returning, so non-ISO inputs get normalized to
   ISO 8601 where possible. ISO strings fall through unchanged.

3. When h1 forward-sibling search finds nothing, walk up to three
   ancestor levels and scan their `p`/`time` descendants. Covers layouts
   where the article date lives in a cousin subtree (e.g. separate
   header and metadata columns under a shared wrapper).

Tests: new unit tests for parseDateText, new fixtures for the JSON-LD
and h1-cousin cases, plus two existing fixtures updated to reflect that
previously-missed dates are now extracted.
Three refinements after code review:

- Ancestor walk now scans siblings of the path element at each level
  rather than running querySelectorAll on the whole ancestor subtree.
  The previous version could reach into an <article> body at depth 2-3
  and match a date-shaped phrase in prose as a false positive.

- parseDateText uses pre-compiled static regexes instead of building
  them via `new RegExp(...)` on every call. This is a hot path during
  the h1-proximity fallback, which may invoke parseDateText once per
  p/time element in the header region.

- Removed the dead `.replace(/\.$/, '')` calls — the optional trailing
  period is already outside the capture group, so the captured token
  is always a bare month name.

Adds a regression fixture with dates in body prose that must NOT be
picked up as metadata, plus unit tests for trailing-period and
uppercase abbreviations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant