discovery: prefer data-image over .src in Squarespace 7.1 DOM fallback#64
Open
davipontesblog wants to merge 1 commit into
Open
Conversation
Squarespace 7.1 lazy-loads <img> elements: the visible 'src' is often a 1x1 SVG placeholder (or empty) until the JS loader fires, and the real CDN URL waits in 'data-image='. On Playwright-driven DOM fallback this meant we captured placeholders for any image not yet hydrated by the loader — a Fluid Engine post with 8 photos would land in the WXR with 1 distinct media URL (the placeholder) referenced 8 times. Prefer data-image, then data-src, then fall back to .src. Validate with a https? regex so we don't pick up data URIs. Found while migrating walkaboutchronicles.com (1,500+ post 7.1 photo blog). See DISCOVERIES.md for full context.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this changes
In Squarespace 7.1 (Fluid Engine especially) the per-URL
?format=jsonbodyoften comes back empty, so the adapter falls through to the Playwright DOM
extractor. That extractor reads
(img as HTMLImageElement).src, butSquarespace 7.1 lazy-loads
<img>: the visiblesrcis a 1×1 SVG placeholder(or empty) until its JS image loader fires, and the real CDN URL waits in
data-image=. For any off-screen image not yet hydrated by the loader, theadapter was capturing the placeholder.
Result on real sites: a Fluid Engine post with 8 images landed in the WXR
with 1 distinct media URL (the placeholder) referenced 8 times.
Fix: read
data-imagefirst, thendata-src, then fall back to.src.Validate with a
^https?://regex so we don't pick up data URIs.How I found it
While migrating https://www.walkaboutchronicles.com — a 1,500+ post Squarespace
7.1 photo blog. The imported posts showed broken thumbnails everywhere; verify
reported no extraction failures but the actual media files were all 1×1 SVG
placeholders.
Tested against
walkaboutchronicles.com, but only via the analogous PHPport we built first — see https://github.com/davipontesblog/sqs71-to-gutenberg.
Same root cause)
npx vitest run— 411 passed, 2 skipped)npx tsc --noEmitcleanDiscovery log entry added to DISCOVERIES.md