fix: image fetching, rendering, and lightbox click-through#19
Conversation
is_forum_url regressions from e6d17e9: - Lowercase site_domain so an uppercase --urls domain still matches the lowercase upload URLs Discourse emits (every inline image was dropped). - Re-allow Discourse upload media (uploads/optimized/original path segments) served from a self-hosted S3/CloudFront/BunnyCDN host, which the host-only allowlist dropped. Uses exact path segments, not the bare substring the CodeQL fix removed. - Filter lightbox hrefs through is_forum_url too (was unfiltered), closing the arbitrary-host fetch reopened for the click-through path and the inline-broken/click-works asymmetry. Also write 'update' command assets under output_dir/assets like archive, so the HTML exporter's assets/ marker resolves them.
The collision guard read local_path.exists() ~40 lines before the write, with no lock across the 20-worker pool. Two distinct URLs sanitizing to the same basename could both skip the hash suffix and overwrite each other. Reserve the chosen path under a lock and disambiguate with a URL hash when taken.
The watch daemon wrote assets to output_dir/images (missing the assets/ marker the HTML exporter needs) and never downloaded lightbox originals, so watch-updated topics had broken images and broken click-through. Mirror the archive path: output_dir/assets + extract_lightbox_urls.
rewrite_with_full_resolution_links:
- Add a URL-path index so root-relative '/uploads/..' srcs (stored under
their absolutized URL) and host/scheme mismatches resolve to local
files instead of rendering broken offline.
- Drop the synthetic higher-res wrapping: it matched unrelated images via
split('_')[0] substring + larger-on-disk, and wrapped imgs already
inside <a class=lightbox> producing invalid nested anchors. Rely on the
lightbox href rewrite (Discourse already links the original).
- Normalize protocol-relative // per srcset entry, not just the first, so
retina 1.5x/2x variants resolve.
- Lightbox/img resolution shares one helper using exact filename matching
(never degrades an original to a medium variant).
_resolve_asset_url used a directional pname.startswith(stem) fallback that (a) resolved a lightbox original to its medium variant, breaking the medium-inline/full-on-click contract, and (b) failed when the stored name was shorter than the queried one. Replace with a URL-path index plus exact-filename matching.
Covers root-relative src resolution, per-entry srcset // normalization, no lightbox double-wrap/nested anchors, and the markdown original-not-degraded-to-medium case.
… JS) Add target="_blank"+rel="noopener" to the rewritten lightbox anchors and re-point the orphaned .discourse-image-link CSS (zoom cursor, hover magnifier, focus ring) at a.lightbox, which is the anchor Discourse actually emits. Keeps the export JS-free; the .meta caption is already stripped by enhance_all_image_alt_text.
HTML pages resolve assets under <html_dir>/assets (page_depth assumes it), but the AssetDownloader writes images/avatars/emoji/site to <archive_root>/assets — a sibling of html/. copy_assets only brought in the static css/js/fonts, so every downloaded image and avatar 404'd offline even though the file existed (the dominant 'non-rendered images' cause). Copy the downloaded media subdirs into html/assets so the tree is self-contained and the page_depth=3 refs resolve. Verified end-to-end against meta.discourse.org: all 42 inline/lightbox/ avatar refs across 8 topics now resolve on disk.
Real-world verification (meta.discourse.org) surfaced the primary bugVerified this branch by archiving a live meta.discourse.org sample to an NFS share and inspecting the exported HTML. The PR's rewrite fixes all confirmed working in real output (relative srcs resolved, srcset rewritten, lightbox But verification exposed a deeper, pre-existing bug underneath the rewrite layer (also on Fixed in |
Fixes 10 verified root causes in the image pipeline surfaced by an xhigh review (image fetching, non-rendered images, and the medium-inline / high-res-on-click lightbox behavior).
Non-rendered images
/uploads/…never rewritten — downloaded under the absolutized URL, but the exporter skipped non-httpsrcs andasset_mapwas keyed by absolute URL, so the<img>kept/uploads/…→ 404 offline. Now resolved via a URL-path index. (both HTML + Markdown)e6d17e9) — restored by allowing Discourse upload path segments (uploads/optimized/original) from any host, as exact segments (not the bare substring CodeQL flagged).--urlsdomain dropped every inline image —site_domainis now lowercased.//srcset — only the first entry was normalized; retina1.5x/2xvariants stayed remote. Now normalized per entry.Lightbox click-through (medium vs. high quality)
split("_")[0]substring + larger-on-disk matched unrelated images, and wrapped imgs already inside<a class="lightbox">into invalid nested<a>. Now relies on Discourse's native lightbox anchor (rewritten to the local original).startswith(stem)resolved an original href to its medium variant. Replaced with path-index + exact-filename matching.is_forum_urllike srcset.Path divergence & concurrency
updatecommand wrote assets tooutput_dir/images(noassets/marker) → broken images. Nowoutput_dir/assets.Tests
5 regression tests added (root-relative resolution, per-entry srcset
//, no lightbox double-wrap, markdown original-not-degraded-to-medium). Full suite: 566 passed, 39 skipped; ruff + pyright clean.Note (separate, not fixed here)
Discourse's native
<a class="lightbox">anchors get notarget="_blank"and aren't styled (only the now-removed.discourse-image-linkwas), so most click-throughs navigate in-page with no zoom affordance. Worth a follow-up: ship a small lightbox JS/CSS or addtarget="_blank"+ cursor styling toa.lightbox.