discovery: walk Squarespace blog archive via ?format=json-pretty by davipontesblog · Pull Request #66 · Automattic/data-liberation-agent

davipontesblog · 2026-05-28T22:46:53Z

What this changes

Every Squarespace blog section exposes itself as a paginated JSON feed at
<blog-prefix>?format=json-pretty[&offset=<addedOn>]. Each page returns ~20
items with title, urlId, fullUrl, publishOn, addedOn, categories, tags,
assetUrl (cover image), authorId, and full author name. The endpoint
works without auth, on both 7.0 and 7.1, and crucially still serves real
metadata on 7.1 Fluid Engine sites where per-URL ?format=json returns an
empty body.

Adds to squarespaceAdapter:

SqsBlogArchiveEntry type
detectBlogPrefixes() — finds blog sections by date-based URL pattern
(/foo/YYYY/M/D/slug) or known conventions (/blog, /journal, /news,
/posts, /stories)
fetchBlogArchive() — paginates via ?offset=, dedupes by urlId, stops
on empty page / non-advancing offset / 200-page safety cap
discover() merges any post URLs the sitemap missed and stashes per-post
metadata on SquarespaceInventory.blogArchive
extract() builds an archiveByUrl map inside extractPage() and uses
the archive entries as a fallback for assetUrl, categories, tags,
publishOn, and author display name when per-URL JSON is empty

How I found it

While migrating walkaboutchronicles.com (1,500+ post 7.1 blog, two
contributors). The per-URL probe path would have needed ~3,000 requests for
author reattribution + featured-image rebuild; the archive walker did it in
~75. Author IDs from the archive disambiguated two contributors who shared a
display name format.

Tested against

Real site (analogous PHP port, then verified against the live
walkaboutchronicles.com archive feed)
Tests pass (npx vitest run — 409 passed, 2 skipped)
npx tsc --noEmit clean
5 new unit tests for detectBlogPrefixes (date-based, conventional,
ordering, threshold, no-match)
No new dependencies

Discovery log entry added to DISCOVERIES.md

Yes

Every Squarespace blog section exposes itself as a paginated JSON feed at <blog-prefix>?format=json-pretty[&offset=<addedOn>]. Each page returns ~20 items with title, urlId, fullUrl, publishOn, addedOn, categories, tags, assetUrl (cover image), authorId, and full author name. The endpoint works without auth, on both 7.0 and 7.1, and crucially still serves real metadata on 7.1 Fluid Engine sites where the per-URL ?format=json body comes back empty. Adds: - SqsBlogArchiveEntry type - detectBlogPrefixes(): finds blog sections by date-based URL pattern (/foo/YYYY/M/D/slug) or known conventions (/blog, /journal, /news, /posts, /stories) - fetchBlogArchive(): paginates via ?offset=, dedupes by urlId, stops on empty page / non-advancing offset / 200-page cap - discover() merges any post URLs the sitemap missed and stashes per-post metadata on SquarespaceInventory.blogArchive - extract() uses an archiveByUrl map inside extractPage() as a fallback for assetUrl, categories, tags, publishOn, and author display name Found while migrating walkaboutchronicles.com (1,500+ post 7.1 blog, two contributors). The per-URL probe path would have needed ~3,000 requests for author reattribution + featured-image rebuild; the archive walker did it in ~75. 5 new unit tests for detectBlogPrefixes (all 409 tests pass). See DISCOVERIES.md for full context.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discovery: walk Squarespace blog archive via ?format=json-pretty#66

discovery: walk Squarespace blog archive via ?format=json-pretty#66
davipontesblog wants to merge 1 commit into
Automattic:mainfrom
davipontesblog:improvement/squarespace-archive-index-walker

davipontesblog commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davipontesblog commented May 28, 2026

What this changes

How I found it

Tested against

Discovery log entry added to DISCOVERIES.md

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant