discovery: walk Squarespace blog archive via ?format=json-pretty#66
Open
davipontesblog wants to merge 1 commit into
Open
Conversation
Every Squarespace blog section exposes itself as a paginated JSON feed at <blog-prefix>?format=json-pretty[&offset=<addedOn>]. Each page returns ~20 items with title, urlId, fullUrl, publishOn, addedOn, categories, tags, assetUrl (cover image), authorId, and full author name. The endpoint works without auth, on both 7.0 and 7.1, and crucially still serves real metadata on 7.1 Fluid Engine sites where the per-URL ?format=json body comes back empty. Adds: - SqsBlogArchiveEntry type - detectBlogPrefixes(): finds blog sections by date-based URL pattern (/foo/YYYY/M/D/slug) or known conventions (/blog, /journal, /news, /posts, /stories) - fetchBlogArchive(): paginates via ?offset=, dedupes by urlId, stops on empty page / non-advancing offset / 200-page cap - discover() merges any post URLs the sitemap missed and stashes per-post metadata on SquarespaceInventory.blogArchive - extract() uses an archiveByUrl map inside extractPage() as a fallback for assetUrl, categories, tags, publishOn, and author display name Found while migrating walkaboutchronicles.com (1,500+ post 7.1 blog, two contributors). The per-URL probe path would have needed ~3,000 requests for author reattribution + featured-image rebuild; the archive walker did it in ~75. 5 new unit tests for detectBlogPrefixes (all 409 tests pass). See DISCOVERIES.md for full context.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
Every Squarespace blog section exposes itself as a paginated JSON feed at
<blog-prefix>?format=json-pretty[&offset=<addedOn>]. Each page returns ~20items with title, urlId, fullUrl, publishOn, addedOn, categories, tags,
assetUrl (cover image), authorId, and full author name. The endpoint
works without auth, on both 7.0 and 7.1, and crucially still serves real
metadata on 7.1 Fluid Engine sites where per-URL
?format=jsonreturns anempty body.
Adds to
squarespaceAdapter:SqsBlogArchiveEntrytypedetectBlogPrefixes()— finds blog sections by date-based URL pattern(
/foo/YYYY/M/D/slug) or known conventions (/blog,/journal,/news,/posts,/stories)fetchBlogArchive()— paginates via?offset=, dedupes byurlId, stopson empty page / non-advancing offset / 200-page safety cap
discover()merges any post URLs the sitemap missed and stashes per-postmetadata on
SquarespaceInventory.blogArchiveextract()builds anarchiveByUrlmap insideextractPage()and usesthe archive entries as a fallback for
assetUrl,categories,tags,publishOn, and author display name when per-URL JSON is emptyHow I found it
While migrating walkaboutchronicles.com (1,500+ post 7.1 blog, two
contributors). The per-URL probe path would have needed ~3,000 requests for
author reattribution + featured-image rebuild; the archive walker did it in
~75. Author IDs from the archive disambiguated two contributors who shared a
display name format.
Tested against
walkaboutchronicles.com archive feed)
npx vitest run— 409 passed, 2 skipped)npx tsc --noEmitcleandetectBlogPrefixes(date-based, conventional,ordering, threshold, no-match)
Discovery log entry added to DISCOVERIES.md