Skip to content

fix(links): repair rotted citations and harden the daily link sweep#53

Merged
jdevalk merged 4 commits into
mainfrom
fix/dead-links-and-link-sweep
Jun 24, 2026
Merged

fix(links): repair rotted citations and harden the daily link sweep#53
jdevalk merged 4 commits into
mainfrom
fix/dead-links-and-link-sweep

Conversation

@jdevalk

@jdevalk jdevalk commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Why

The scheduled External links sweep (links.yml) failed reporting 211 broken links / 1468. Almost all were false positives — hosts that block or rate-limit any headless link checker:

Code Cause
429 (×210) GitHub per-page edit this page / self-repo links, burst-limited
403 (×174) www.w3.org & co. behind a Cloudflare "Just a moment…" JS challenge
400 (×4) developers.facebook.com bot-blocking
405 (×2) the a2a/v1 endpoint is POST-only
503 (×2) chromium.googlesource.com rate limit

Underneath the noise were ~14 genuinely dead/moved citation URLs.

What this does

1. Fixes every real dead link (each replacement verified 200 and on-topic):

  • web-bot-auth — IETF draft renamed → draft-meunier-http-message-signatures-directory
  • speculation-rules — No-Vary-Search → MDN reference
  • bfcache — dead Chrome docs page → DevTools back/forward-cache page
  • caa-records — MDN entry deleted → RFC 8657 (CAA ACME extensions; keeps it standards-led)
  • privacy-policy — EDPB transparency guidelines → current slug
  • content-signals — IAB group renamed → Content Monetization Protocols (CoMP) for AI
  • data-minimization — ICO dropped the /the-principles/ path segment
  • script-loading + critical-css — render-blocking → Chrome for Developers
  • scrollbar-gutter — web.dev article → Baseline scrollbar-props post
  • css-containment — web.dev learn page deleted → web.dev content-visibility
  • accessibility-overlays — WebAIM overlay survey → Practitioners Survey Bump actions/setup-node from 4 to 6 #3 (body sentence reworded to match the source)
  • view-transitions — WebKit blog 16557 → 16967
  • cookie-consent — CNIL cookies → current "new guidelines" page
  • nlwebdocs/nlweb-rest.mddocs/nlweb-rest-api.md

2. Hardens the sweep so it stops crying wolf (linkinator.config.json):

  • retry + retryErrors — re-attempt transient 429s / 5xx
  • concurrency: 25 + 30s timeout — gentler crawl, fewer self-inflicted 429s
  • skip[] — only hosts that hard-block any headless checker (W3C/validator/securityheaders behind Cloudflare, facebook devs, our own repo chrome + /edit/ links, the POST-only a2a endpoint, developer.android.com, the reserved example.com). Third-party citations stay checked. Rationale documented inline in links.yml.

Trade-off: genuine rot on the skipped hosts won't be auto-caught — verify those by hand when citing. Noted in the workflow comment.

Verification

Local full crawl with the new config: 211 → 0 real failures (1229 links scanned). astro check clean (0 errors), lint/format gate green.

🤖 Generated with Claude Code

…sweep

The scheduled External links sweep was failing on 211 "broken" links, but
almost all were false positives — hosts that block or rate-limit any headless
checker (W3C/securityheaders behind a Cloudflare JS challenge → 403, GitHub
per-page edit/self-links → 429, developers.facebook.com → 400, the a2a
endpoint is POST-only → 405). Underneath were ~14 genuinely dead/moved URLs.

Citations fixed (each verified 200, on the same topic):
- web-bot-auth: draft renamed → draft-meunier-http-message-signatures-directory
- speculation-rules: No-Vary-Search → MDN reference
- bfcache: Chrome docs page → DevTools back/forward-cache page
- caa-records: dropped MDN (deleted) → RFC 8657 (CAA ACME extensions)
- privacy-policy: EDPB transparency guidelines → current slug
- content-signals: IAB group renamed → Content Monetization Protocols (CoMP)
- data-minimization: ICO dropped /the-principles/ path segment
- script-loading, critical-css: render-blocking → Chrome for Developers
- scrollbar-gutter: web.dev article → Baseline scrollbar-props post
- css-containment: web.dev learn (deleted) → web.dev content-visibility
- accessibility-overlays: WebAIM overlay survey → Practitioners Survey #3
- view-transitions: WebKit blog 16557 → 16967
- cookie-consent: CNIL cookies → current "new guidelines" page
- nlweb: docs/nlweb-rest.md → docs/nlweb-rest-api.md

Workflow hardening (linkinator.config.json):
- retry / retryErrors so transient 429s and 5xx don't fail the run
- concurrency 25 + 30s timeout for a gentler crawl
- skip[] only hosts that hard-block any headless checker (documented in
  links.yml), so a red run now means real rot, not bot-blocking

Local full crawl after the changes: 211 → 0 real failures (1229 links scanned).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 23, 2026

Copy link
Copy Markdown

Deploying specification-website with  Cloudflare Pages  Cloudflare Pages

Latest commit: 3920664
Status: ✅  Deploy successful!
Preview URL: https://2cfad0ca.specification-website.pages.dev
Branch Preview URL: https://fix-dead-links-and-link-swee.specification-website.pages.dev

View logs

jdevalk and others added 3 commits June 23, 2026 14:19
GitHub's burst-limit 429s carry no retry-after header, so linkinator's own
--retry can't catch them, and they hit a random third-party github.com blob
link each run. Wrap the whole crawl in a 3-attempt loop: genuinely dead URLs
fail every attempt and stay red; transient flakes clear on a re-run. Keeps
real-rot detection on third-party github citations instead of skipping them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GitHub 429s the shared GitHub Actions runner IP for github.com web requests
regardless of link validity, and the limit window outlasts the retry loop, so
a valid citation (e.g. the NLWeb docs file) fails every attempt. Skip the whole
host rather than ship a red-by-default sweep; github citations are verified by
hand when added. The retry loop stays as a net for other transient hosts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The outer 3× crawl loop compounded with linkinator's retry-after waits and
could stall the step for 10+ minutes. Its original purpose (surviving GitHub's
header-less 429s) is moot now that github.com is skipped. Revert to a single
linkinator pass and add timeout-minutes: 10 as a fail-fast backstop against a
single upstream returning a large retry-after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jdevalk jdevalk merged commit fd35ce4 into main Jun 24, 2026
10 checks passed
@jdevalk jdevalk deleted the fix/dead-links-and-link-sweep branch June 24, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant