reader: serve image-dominant PDF pages as <img>, drop full-PDF mode by ajslater · Pull Request #743 · ajslater/codex

ajslater · 2026-05-09T03:37:22Z

Summary

Most "comic PDFs" are scanned-image wrappers — one full-bleed JPEG or PNG per page, no real vector content. Today every PDF page routes through vue-pdf-embed (and pdf.js on the client) regardless. After this PR, the backend runs an image-dominant detector and serves matched pages as plain image bytes; the browser renders them through <img> like any CBZ page. Vector-content pages keep the existing single-page-PDF + vue-pdf-embed path.

End result: comic-style PDF pages render through Codex's existing image reader (same path as CBZ/CBR), and "real" multi-element PDFs keep vector-quality rendering. The user can override per-comic when the detector misjudges.

The full design + empirical validation against a 14-PDF private corpus lives in tasks/pdf-image-detection/.

Sequencing — depends on two upstream PRs

This PR's CI is wired through [tool.uv.sources] git refs to:

ajslater/pdffile#22 — comicbox-pdffile 0.6.0 with classify_page, read_image_if_dominant, read_full_pixmap_jpeg.
ajslater/comicbox#131 — widens the pdffile pin (>=0.6,<0.7).

Merge order:

Land + publish pdffile 0.6.x to PyPI
Land + publish comicbox 3.0.x to PyPI
Drop the [tool.uv.sources] block + the explicit comicbox-pdffile direct dep here, regenerate uv.lock, merge.

Backend changes

codex/views/reader/page.py — new ?format=auto|pdf|image parameter. _try_pdf_image_serve reaches through Comicbox._get_archive() to the underlying PDFFile (private API for now — noqa: SLF001 flags the seam) and dispatches based on classify_page. Per-page verdicts memoized on _ArchiveEntry.
codex/views/reader/_archive_cache.py — adds a verdicts: dict slot on _ArchiveEntry. New open_entry() context manager yields the entry directly; existing open() keeps its Comicbox-only contract.
Old ?pixmap= parameter dropped (latent bug: returned PPM with image/jpeg content-type; no caller).

Frontend changes

BookPage always tries <ImgPage> first. On error for a PDF book it sets pdfFallback=true and the page re-mounts as <PDFDoc> with ?format=pdf appended. No HEAD pre-flight, no verdict threaded through the API response — the browser's image-load failure on an application/pdf body is the natural signal.
PagerFullPDF deleted. The full-document-load mode it served was the most fragile branch of the old code (CSS hacks, height-fudge for vertical, two-page mode unsupported). pager.vue picks between PagerHorizontal / PagerVertical purely on reading direction now.
Drops the cacheBook carve-out for vertical PDFs — PDFs prefetch alongside CBZ.
Adds a per-comic "PDF Rendering" radio in the reader settings drawer (Auto / Force image / Force vector) wired to a new clientSettings.pdfRenderMode field.

Test plan

make ty clean
make test-python — 48 passed
bun run build — clean; pdf.js chunk (~817 KB gzip) is lazy-loaded — only fetched when a vector PDF page actually mounts
bun run test:ci — frontend tests pass
End-to-end against a 14-PDF private corpus — Watchmen (image-dominant 6/6), Amphigorey (mixed: 6/12 image-dominant, rest correctly fall through), Nolo Press (text PDF: 0/12), CMYK page transcoded correctly
Browser eyeball test (can't drive a browser as an agent — needs human verification):
- Open a comic-style PDF → pages render as <img> with no pdf.js worker visible in DevTools network panel
- Open a vector PDF → first request returns application/pdf, the <img> errors silently, pdf-doc chunk loads, page renders via <vue-pdf-embed>
- Toggle "PDF Rendering → Force image" → text PDF pages now rasterize server-side
- Toggle "Force vector" → all PDF pages route through <vue-pdf-embed>
- Vertical scroll on a PDF → no longer falls back to per-page mode

Pre-existing lint failures (not from this PR)

make lint surfaces two pre-existing issues that fail on develop too:

NEWS.md prettier formatting (project-root prettier check)
remark --quiet . errors with "Cannot process specified file: it's ignored" — happens regardless of branch

🤖 Generated with Claude Code

Most "comic PDFs" are scanned-image wrappers — one full-bleed JPEG or PNG per page, no real vector content. Previously every PDF page was routed through ``vue-pdf-embed`` (and through pdf.js on the client) regardless. Now the backend runs an image-dominant page detector via ``comicbox-pdffile`` 0.6 and serves matched pages as plain image bytes; the browser renders them through ``<img>`` like any CBZ page. Vector-content pages keep the existing single-page-PDF + ``vue-pdf-embed`` path. Backend ======= * ``codex/views/reader/page.py`` - ``?format=auto|pdf|image`` query parameter. ``auto`` (default) runs the detector. ``pdf`` skips the detector and forces the legacy single-page-PDF path. ``image`` always rasterizes — works for any PDF page but spends more CPU on vector-heavy pages. - ``_try_pdf_image_serve`` reaches through Comicbox to the underlying ``PDFFile`` (private API for now; comments mark the seam for a future public-getter swap) and dispatches to ``classify_page`` / ``read_image_if_dominant`` / ``read_full_pixmap_jpeg``. Per-page verdicts are memoized on the ``_ArchiveEntry`` so prev/curr/next prefetch is effectively free. - Old ``?pixmap=`` query parameter dropped — it returned PPM bytes with ``image/jpeg`` content-type (latent labeling bug; no caller). - OpenAPI schema advertises three possible response content-types: ``application/pdf`` (fallback), ``image/jpeg``, ``image/png``. * ``codex/views/reader/_archive_cache.py`` - Adds a ``verdicts: dict[int, PageVerdict]`` slot on ``_ArchiveEntry`` for the per-page detector cache. - New ``open_entry()`` context manager yields the entry directly (existing ``open()`` keeps yielding ``Comicbox`` for callers that don't need verdict state). Frontend ======== * ``BookPage`` (``page/page.vue``) always tries ``<ImgPage>`` first. On ``error`` for a PDF book it sets ``pdfFallback=true`` and the page re-mounts as ``<PDFDoc>`` against the same URL with ``?format=pdf`` appended. No HEAD pre-flight, no verdict threaded through the API response — the browser's image-load failure on a ``application/pdf`` body is the natural signal. * Drops ``PagerFullPDF`` and the whole-document-load mode it served. ``pager.vue`` now picks between ``PagerHorizontal`` / ``PagerVertical`` based purely on reading direction. * Drops the ``cacheBook`` carve-out for vertical PDFs in ``stores/reader.js`` — PDFs prefetch alongside CBZ now. * Adds a per-comic "PDF Rendering" radio in the reader settings drawer (Auto / Force image / Force vector) wired to a new ``clientSettings.pdfRenderMode`` field. Forwarded to the page endpoint as ``?format=``. * ``getComicPageSource`` accepts an optional ``format`` parameter. Omitting it preserves the URL shape so HTTP caches don't fragment. Sequencing ========== This change depends on two upstream PRs: * ``comicbox-pdffile`` 0.6 — image-dominant detector + extractors: ajslater/pdffile#22 * ``comicbox`` widened pdffile pin (>=0.6,<0.7): ajslater/comicbox#131 The ``[tool.uv.sources]`` block in ``pyproject.toml`` temporarily points both deps at their PR branches so this branch's CI can resolve. Once both upstreams land on PyPI, drop the sources block and the explicit ``comicbox-pdffile`` direct dep. The full design + empirical validation against a 14-PDF private corpus lives in ``tasks/pdf-image-detection/``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reports of 404s on PDF page requests with the new image-dominant detector landed. The catch-all ``except Exception`` in ``ReaderPageView.get`` was logging only ``str(exc)`` — no traceback, no path context — so the actual failure point was invisible. Hook loguru's ``logger.exception`` to surface the traceback, plus ``logger.debug`` markers at the decision points so we can see which branch each request took: * image-serve auto: which verdict + xref + ext we got * image-serve force-image: bytes/ext returned * image-serve declined: why we fell through * legacy PDF path: bytes/content-type served, plus a wrapper that surfaces ``get_page_by_index`` failures with a traceback before the catch-all flattens them to 404 All log lines tagged ``[pdf-debug]`` for grep + an easy revert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The earlier debug pass logged the catch-all exception path but silently 404'd through the Comic.DoesNotExist / FileNotFoundError handlers (no log line). User reports a 404 on /api/v3/c/1/0/page.jpg that has *no* matching [pdf-debug] log line, meaning the request either bombs out before _get_page_image (in _resolve_path_and_type's ACL/DB lookup) or gets caught by one of the silent handlers. Add view-entry, view-exit, and per-handler log lines so every 404 correlates to one specific [pdf-debug] line. Entry log includes the User-Agent and Referer so we can tell whether a failing request comes from <img>, vue-pdf-embed's fetch, a prefetch, or a 'Read in Tab' direct nav. All TEMP DEBUG; revert with grep -l '[pdf-debug]' once stable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bug === PDF pages were 404'ing with ``?format=pdf`` (the ``serve as a single single-page PDF blob to vue-pdf-embed`` path). Server logs showed the URL request reaching ``ReaderPageView.get`` for the no-param ``<img>`` attempt and serving the PDF bytes successfully. The follow-up ``?format=pdf`` request 404'd before the view's entry-debug log fired, with the response body shaped like DRF's ``{"detail": "Not found."}`` and ``Content-Type: application/json``. Cause ===== DRF reserves ``?format=`` (``REST_FRAMEWORK['URL_FORMAT_OVERRIDE']``, default ``'format'``) as a renderer-format selector. ``DefaultContentNegotiation.filter_renderers(renderers, 'pdf')`` runs inside ``APIView.dispatch.initial`` *before* the view's ``get`` handler. With no PDF renderer registered, that method raises ``exceptions.NotFound`` per DRF source: def filter_renderers(self, renderers, format): renderers = [r for r in renderers if r.format == format] if not renderers: raise exceptions.NotFound(...) return renderers So the request was getting a 404 from DRF's content negotiator before the view code ran — explaining the missing entry log. Fix === Rename the query parameter from ``format`` to ``serve`` end-to-end: * Backend: ``_FORMAT_*`` constants → ``_SERVE_*``; OpenAPI schema parameter renamed; debug-log fields renamed. * Frontend: ``getComicPageSource({ ..., format })`` → ``({ ..., serve })`` and the URL builder emits ``&serve=`` instead of ``&format=``. Internal field names (``pdfRenderMode`` etc.) unchanged — only the wire param renamed. Verified by curl: HEAD ?ts=...&format=pdf → 404 (DRF NotFound) HEAD ?ts=...&serve=pdf → 200 application/pdf (✓ fixed) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Companion to f970a3b. The ``cache_control(max_age=PAGE_MAX_AGE, public=True)`` decorator on ``page.jpg`` was patching its ``Cache-Control`` header onto *every* response — including 4xx — because Django's ``cache_control`` doesn't filter by status. That turned every transient failure into a week-long cache poison. Symptom in the field: three concurrent ``?format=pdf`` 404s in the browser surfaced only one server-side log line; the other two were served from the browser HTTP cache (cached from earlier ``format=pdf`` requests in prior sessions, when DRF's URL_FORMAT_OVERRIDE rejected the param). Once a 4xx ships with ``Cache-Control: public, max-age= 604800`` the browser pins it for a week and never asks again. Add ``codex.views.util.cache_control_2xx`` — same shape as Django's ``cache_control`` but only patches the header on responses with status 200-299. Swap it in on the page endpoint. Other routes that use ``cache_control`` are mostly cover endpoints + book.pdf (server-side cached via ``cache_page``), where 4xx is rare and short-lived; leave them alone for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The PDF rendering paths are stable now (?format= → ?serve= rename fixed the DRF NotFound issue, cache_control_2xx fixed the cached-404 issue). Strip the temporary view-entry/exit, image-serve-decision, and exception logging that helped diagnose those. Reverts: * logger.exception(...) in catch-all → logger.warning(exc) (original) * Surfacing legacy get_page_by_index failures with traceback * All [pdf-debug] DEBUG-level decision logs The structural fixes — ?serve= param, classify-on-cache-entry, cache_control_2xx — stay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ajslater and others added 7 commits May 8, 2026 20:36

Merge branch 'develop' into claude/eager-beaver-b576f1

7d7a72c

ajslater merged commit 12d1239 into develop May 9, 2026
3 checks passed

ajslater deleted the claude/eager-beaver-b576f1 branch May 11, 2026 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reader: serve image-dominant PDF pages as <img>, drop full-PDF mode#743

reader: serve image-dominant PDF pages as <img>, drop full-PDF mode#743
ajslater merged 7 commits into
developfrom
claude/eager-beaver-b576f1

ajslater commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajslater commented May 9, 2026

Summary

Sequencing — depends on two upstream PRs

Backend changes

Frontend changes

Test plan

Pre-existing lint failures (not from this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant