Skip to content

discovery: walk Squarespace blog archive via ?format=json-pretty#66

Open
davipontesblog wants to merge 1 commit into
Automattic:mainfrom
davipontesblog:improvement/squarespace-archive-index-walker
Open

discovery: walk Squarespace blog archive via ?format=json-pretty#66
davipontesblog wants to merge 1 commit into
Automattic:mainfrom
davipontesblog:improvement/squarespace-archive-index-walker

Conversation

@davipontesblog

Copy link
Copy Markdown

What this changes

Every Squarespace blog section exposes itself as a paginated JSON feed at
<blog-prefix>?format=json-pretty[&offset=<addedOn>]. Each page returns ~20
items with title, urlId, fullUrl, publishOn, addedOn, categories, tags,
assetUrl (cover image), authorId, and full author name. The endpoint
works without auth, on both 7.0 and 7.1, and crucially still serves real
metadata on 7.1 Fluid Engine sites
where per-URL ?format=json returns an
empty body.

Adds to squarespaceAdapter:

  • SqsBlogArchiveEntry type
  • detectBlogPrefixes() — finds blog sections by date-based URL pattern
    (/foo/YYYY/M/D/slug) or known conventions (/blog, /journal, /news,
    /posts, /stories)
  • fetchBlogArchive() — paginates via ?offset=, dedupes by urlId, stops
    on empty page / non-advancing offset / 200-page safety cap
  • discover() merges any post URLs the sitemap missed and stashes per-post
    metadata on SquarespaceInventory.blogArchive
  • extract() builds an archiveByUrl map inside extractPage() and uses
    the archive entries as a fallback for assetUrl, categories, tags,
    publishOn, and author display name when per-URL JSON is empty

How I found it

While migrating walkaboutchronicles.com (1,500+ post 7.1 blog, two
contributors). The per-URL probe path would have needed ~3,000 requests for
author reattribution + featured-image rebuild; the archive walker did it in
~75. Author IDs from the archive disambiguated two contributors who shared a
display name format.

Tested against

  • Real site (analogous PHP port, then verified against the live
    walkaboutchronicles.com archive feed)
  • Tests pass (npx vitest run — 409 passed, 2 skipped)
  • npx tsc --noEmit clean
  • 5 new unit tests for detectBlogPrefixes (date-based, conventional,
    ordering, threshold, no-match)
  • No new dependencies

Discovery log entry added to DISCOVERIES.md

  • Yes

Every Squarespace blog section exposes itself as a paginated JSON feed
at <blog-prefix>?format=json-pretty[&offset=<addedOn>]. Each page returns
~20 items with title, urlId, fullUrl, publishOn, addedOn, categories,
tags, assetUrl (cover image), authorId, and full author name. The
endpoint works without auth, on both 7.0 and 7.1, and crucially still
serves real metadata on 7.1 Fluid Engine sites where the per-URL
?format=json body comes back empty.

Adds:
- SqsBlogArchiveEntry type
- detectBlogPrefixes(): finds blog sections by date-based URL pattern
  (/foo/YYYY/M/D/slug) or known conventions (/blog, /journal, /news,
  /posts, /stories)
- fetchBlogArchive(): paginates via ?offset=, dedupes by urlId, stops on
  empty page / non-advancing offset / 200-page cap
- discover() merges any post URLs the sitemap missed and stashes per-post
  metadata on SquarespaceInventory.blogArchive
- extract() uses an archiveByUrl map inside extractPage() as a fallback
  for assetUrl, categories, tags, publishOn, and author display name

Found while migrating walkaboutchronicles.com (1,500+ post 7.1 blog,
two contributors). The per-URL probe path would have needed ~3,000
requests for author reattribution + featured-image rebuild; the archive
walker did it in ~75.

5 new unit tests for detectBlogPrefixes (all 409 tests pass).

See DISCOVERIES.md for full context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant