Skip to content

fix: image fetching, rendering, and lightbox click-through#19

Open
19-84 wants to merge 8 commits into
mainfrom
fix/image-pipeline-rendering
Open

fix: image fetching, rendering, and lightbox click-through#19
19-84 wants to merge 8 commits into
mainfrom
fix/image-pipeline-rendering

Conversation

@19-84

@19-84 19-84 commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Fixes 10 verified root causes in the image pipeline surfaced by an xhigh review (image fetching, non-rendered images, and the medium-inline / high-res-on-click lightbox behavior).

Non-rendered images

  • Root-relative /uploads/… never rewritten — downloaded under the absolutized URL, but the exporter skipped non-http srcs and asset_map was keyed by absolute URL, so the <img> kept /uploads/… → 404 offline. Now resolved via a URL-path index. (both HTML + Markdown)
  • Third-party-CDN / S3 inline images dropped (regression from e6d17e9) — restored by allowing Discourse upload path segments (uploads/optimized/original) from any host, as exact segments (not the bare substring CodeQL flagged).
  • Uppercase --urls domain dropped every inline imagesite_domain is now lowercased.
  • Protocol-relative // srcset — only the first entry was normalized; retina 1.5x/2x variants stayed remote. Now normalized per entry.

Lightbox click-through (medium vs. high quality)

  • Fragile higher-res pairing removedsplit("_")[0] substring + larger-on-disk matched unrelated images, and wrapped imgs already inside <a class="lightbox"> into invalid nested <a>. Now relies on Discourse's native lightbox anchor (rewritten to the local original).
  • Markdown click-through served the medium — directional startswith(stem) resolved an original href to its medium variant. Replaced with path-index + exact-filename matching.
  • Lightbox originals were fetched unfiltered (SSRF reopened) — now filtered through is_forum_url like srcset.

Path divergence & concurrency

  • update command wrote assets to output_dir/images (no assets/ marker) → broken images. Now output_dir/assets.
  • Watch daemon wrote to the wrong dir and never downloaded lightbox originals. Both fixed.
  • Concurrent filename collision — non-atomic check-then-write across 20 workers could silently overwrite distinct images with the same basename. Filename assignment is now lock-guarded.

Tests

5 regression tests added (root-relative resolution, per-entry srcset //, no lightbox double-wrap, markdown original-not-degraded-to-medium). Full suite: 566 passed, 39 skipped; ruff + pyright clean.

Note (separate, not fixed here)

Discourse's native <a class="lightbox"> anchors get no target="_blank" and aren't styled (only the now-removed .discourse-image-link was), so most click-throughs navigate in-page with no zoom affordance. Worth a follow-up: ship a small lightbox JS/CSS or add target="_blank" + cursor styling to a.lightbox.

19-84 added 8 commits June 9, 2026 01:48
is_forum_url regressions from e6d17e9:
- Lowercase site_domain so an uppercase --urls domain still matches the
  lowercase upload URLs Discourse emits (every inline image was dropped).
- Re-allow Discourse upload media (uploads/optimized/original path
  segments) served from a self-hosted S3/CloudFront/BunnyCDN host, which
  the host-only allowlist dropped. Uses exact path segments, not the bare
  substring the CodeQL fix removed.
- Filter lightbox hrefs through is_forum_url too (was unfiltered), closing
  the arbitrary-host fetch reopened for the click-through path and the
  inline-broken/click-works asymmetry.

Also write 'update' command assets under output_dir/assets like archive,
so the HTML exporter's assets/ marker resolves them.
The collision guard read local_path.exists() ~40 lines before the write,
with no lock across the 20-worker pool. Two distinct URLs sanitizing to
the same basename could both skip the hash suffix and overwrite each
other. Reserve the chosen path under a lock and disambiguate with a URL
hash when taken.
The watch daemon wrote assets to output_dir/images (missing the assets/
marker the HTML exporter needs) and never downloaded lightbox originals,
so watch-updated topics had broken images and broken click-through.
Mirror the archive path: output_dir/assets + extract_lightbox_urls.
rewrite_with_full_resolution_links:
- Add a URL-path index so root-relative '/uploads/..' srcs (stored under
  their absolutized URL) and host/scheme mismatches resolve to local
  files instead of rendering broken offline.
- Drop the synthetic higher-res wrapping: it matched unrelated images via
  split('_')[0] substring + larger-on-disk, and wrapped imgs already
  inside <a class=lightbox> producing invalid nested anchors. Rely on the
  lightbox href rewrite (Discourse already links the original).
- Normalize protocol-relative // per srcset entry, not just the first, so
  retina 1.5x/2x variants resolve.
- Lightbox/img resolution shares one helper using exact filename matching
  (never degrades an original to a medium variant).
_resolve_asset_url used a directional pname.startswith(stem) fallback
that (a) resolved a lightbox original to its medium variant, breaking the
medium-inline/full-on-click contract, and (b) failed when the stored name
was shorter than the queried one. Replace with a URL-path index plus
exact-filename matching.
Covers root-relative src resolution, per-entry srcset // normalization,
no lightbox double-wrap/nested anchors, and the markdown
original-not-degraded-to-medium case.
… JS)

Add target="_blank"+rel="noopener" to the rewritten lightbox anchors and
re-point the orphaned .discourse-image-link CSS (zoom cursor, hover
magnifier, focus ring) at a.lightbox, which is the anchor Discourse
actually emits. Keeps the export JS-free; the .meta caption is already
stripped by enhance_all_image_alt_text.
HTML pages resolve assets under <html_dir>/assets (page_depth assumes it),
but the AssetDownloader writes images/avatars/emoji/site to
<archive_root>/assets — a sibling of html/. copy_assets only brought in the
static css/js/fonts, so every downloaded image and avatar 404'd offline
even though the file existed (the dominant 'non-rendered images' cause).
Copy the downloaded media subdirs into html/assets so the tree is
self-contained and the page_depth=3 refs resolve.

Verified end-to-end against meta.discourse.org: all 42 inline/lightbox/
avatar refs across 8 topics now resolve on disk.
@19-84

19-84 commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

Real-world verification (meta.discourse.org) surfaced the primary bug

Verified this branch by archiving a live meta.discourse.org sample to an NFS share and inspecting the exported HTML. The PR's rewrite fixes all confirmed working in real output (relative srcs resolved, srcset rewritten, lightbox target="_blank" → local original, no nested anchors, collision-safe filenames).

But verification exposed a deeper, pre-existing bug underneath the rewrite layer (also on main): downloaded media is written to <root>/assets/ while HTML pages resolve assets under <root>/html/assets/ — only static css/js/fonts were copied there, so every downloaded image and avatar 404'd offline even though the file existed. This is almost certainly the dominant "non-rendered images" symptom.

Fixed in f211979 by copying the downloaded media subdirs into html/assets/. Re-verified end-to-end: all 42 inline/lightbox/avatar refs across 8 topics now resolve on disk (0 missing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant