Skip to content

discovery: prefer data-image over .src in Squarespace 7.1 DOM fallback#64

Open
davipontesblog wants to merge 1 commit into
Automattic:mainfrom
davipontesblog:improvement/squarespace-data-image-lazyload
Open

discovery: prefer data-image over .src in Squarespace 7.1 DOM fallback#64
davipontesblog wants to merge 1 commit into
Automattic:mainfrom
davipontesblog:improvement/squarespace-data-image-lazyload

Conversation

@davipontesblog

Copy link
Copy Markdown

What this changes

In Squarespace 7.1 (Fluid Engine especially) the per-URL ?format=json body
often comes back empty, so the adapter falls through to the Playwright DOM
extractor. That extractor reads (img as HTMLImageElement).src, but
Squarespace 7.1 lazy-loads <img>: the visible src is a 1×1 SVG placeholder
(or empty) until its JS image loader fires, and the real CDN URL waits in
data-image=. For any off-screen image not yet hydrated by the loader, the
adapter was capturing the placeholder.

Result on real sites: a Fluid Engine post with 8 images landed in the WXR
with 1 distinct media URL (the placeholder) referenced 8 times.

Fix: read data-image first, then data-src, then fall back to .src.
Validate with a ^https?:// regex so we don't pick up data URIs.

How I found it

While migrating https://www.walkaboutchronicles.com — a 1,500+ post Squarespace
7.1 photo blog. The imported posts showed broken thumbnails everywhere; verify
reported no extraction failures but the actual media files were all 1×1 SVG
placeholders.

Tested against

  • Real site (walkaboutchronicles.com, but only via the analogous PHP
    port we built first — see https://github.com/davipontesblog/sqs71-to-gutenberg.
    Same root cause)
  • Tests pass (npx vitest run — 411 passed, 2 skipped)
  • npx tsc --noEmit clean
  • No new dependencies

Discovery log entry added to DISCOVERIES.md

  • Yes

Squarespace 7.1 lazy-loads <img> elements: the visible 'src' is often a
1x1 SVG placeholder (or empty) until the JS loader fires, and the real
CDN URL waits in 'data-image='. On Playwright-driven DOM fallback this
meant we captured placeholders for any image not yet hydrated by the
loader — a Fluid Engine post with 8 photos would land in the WXR with
1 distinct media URL (the placeholder) referenced 8 times.

Prefer data-image, then data-src, then fall back to .src. Validate
with a https? regex so we don't pick up data URIs.

Found while migrating walkaboutchronicles.com (1,500+ post 7.1 photo
blog). See DISCOVERIES.md for full context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant