reader: serve image-dominant PDF pages as <img>, drop full-PDF mode#743
Merged
Conversation
Most "comic PDFs" are scanned-image wrappers — one full-bleed JPEG
or PNG per page, no real vector content. Previously every PDF page
was routed through ``vue-pdf-embed`` (and through pdf.js on the
client) regardless. Now the backend runs an image-dominant page
detector via ``comicbox-pdffile`` 0.6 and serves matched pages as
plain image bytes; the browser renders them through ``<img>`` like
any CBZ page. Vector-content pages keep the existing single-page-PDF
+ ``vue-pdf-embed`` path.
Backend
=======
* ``codex/views/reader/page.py``
- ``?format=auto|pdf|image`` query parameter. ``auto`` (default)
runs the detector. ``pdf`` skips the detector and forces the
legacy single-page-PDF path. ``image`` always rasterizes — works
for any PDF page but spends more CPU on vector-heavy pages.
- ``_try_pdf_image_serve`` reaches through Comicbox to the
underlying ``PDFFile`` (private API for now; comments mark the
seam for a future public-getter swap) and dispatches to
``classify_page`` / ``read_image_if_dominant`` /
``read_full_pixmap_jpeg``. Per-page verdicts are memoized on the
``_ArchiveEntry`` so prev/curr/next prefetch is effectively free.
- Old ``?pixmap=`` query parameter dropped — it returned PPM bytes
with ``image/jpeg`` content-type (latent labeling bug; no caller).
- OpenAPI schema advertises three possible response content-types:
``application/pdf`` (fallback), ``image/jpeg``, ``image/png``.
* ``codex/views/reader/_archive_cache.py``
- Adds a ``verdicts: dict[int, PageVerdict]`` slot on
``_ArchiveEntry`` for the per-page detector cache.
- New ``open_entry()`` context manager yields the entry directly
(existing ``open()`` keeps yielding ``Comicbox`` for callers
that don't need verdict state).
Frontend
========
* ``BookPage`` (``page/page.vue``) always tries ``<ImgPage>`` first.
On ``error`` for a PDF book it sets ``pdfFallback=true`` and the
page re-mounts as ``<PDFDoc>`` against the same URL with
``?format=pdf`` appended. No HEAD pre-flight, no verdict threaded
through the API response — the browser's image-load failure on a
``application/pdf`` body is the natural signal.
* Drops ``PagerFullPDF`` and the whole-document-load mode it served.
``pager.vue`` now picks between ``PagerHorizontal`` /
``PagerVertical`` based purely on reading direction.
* Drops the ``cacheBook`` carve-out for vertical PDFs in
``stores/reader.js`` — PDFs prefetch alongside CBZ now.
* Adds a per-comic "PDF Rendering" radio in the reader settings
drawer (Auto / Force image / Force vector) wired to a new
``clientSettings.pdfRenderMode`` field. Forwarded to the page
endpoint as ``?format=``.
* ``getComicPageSource`` accepts an optional ``format`` parameter.
Omitting it preserves the URL shape so HTTP caches don't fragment.
Sequencing
==========
This change depends on two upstream PRs:
* ``comicbox-pdffile`` 0.6 — image-dominant detector + extractors:
ajslater/pdffile#22
* ``comicbox`` widened pdffile pin (>=0.6,<0.7):
ajslater/comicbox#131
The ``[tool.uv.sources]`` block in ``pyproject.toml`` temporarily
points both deps at their PR branches so this branch's CI can
resolve. Once both upstreams land on PyPI, drop the sources block
and the explicit ``comicbox-pdffile`` direct dep.
The full design + empirical validation against a 14-PDF private
corpus lives in ``tasks/pdf-image-detection/``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reports of 404s on PDF page requests with the new image-dominant detector landed. The catch-all ``except Exception`` in ``ReaderPageView.get`` was logging only ``str(exc)`` — no traceback, no path context — so the actual failure point was invisible. Hook loguru's ``logger.exception`` to surface the traceback, plus ``logger.debug`` markers at the decision points so we can see which branch each request took: * image-serve auto: which verdict + xref + ext we got * image-serve force-image: bytes/ext returned * image-serve declined: why we fell through * legacy PDF path: bytes/content-type served, plus a wrapper that surfaces ``get_page_by_index`` failures with a traceback before the catch-all flattens them to 404 All log lines tagged ``[pdf-debug]`` for grep + an easy revert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier debug pass logged the catch-all exception path but silently 404'd through the Comic.DoesNotExist / FileNotFoundError handlers (no log line). User reports a 404 on /api/v3/c/1/0/page.jpg that has *no* matching [pdf-debug] log line, meaning the request either bombs out before _get_page_image (in _resolve_path_and_type's ACL/DB lookup) or gets caught by one of the silent handlers. Add view-entry, view-exit, and per-handler log lines so every 404 correlates to one specific [pdf-debug] line. Entry log includes the User-Agent and Referer so we can tell whether a failing request comes from <img>, vue-pdf-embed's fetch, a prefetch, or a 'Read in Tab' direct nav. All TEMP DEBUG; revert with grep -l '[pdf-debug]' once stable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug
===
PDF pages were 404'ing with ``?format=pdf`` (the ``serve as a single
single-page PDF blob to vue-pdf-embed`` path). Server logs showed the
URL request reaching ``ReaderPageView.get`` for the no-param ``<img>``
attempt and serving the PDF bytes successfully. The follow-up
``?format=pdf`` request 404'd before the view's entry-debug log
fired, with the response body shaped like DRF's
``{"detail": "Not found."}`` and ``Content-Type: application/json``.
Cause
=====
DRF reserves ``?format=`` (``REST_FRAMEWORK['URL_FORMAT_OVERRIDE']``,
default ``'format'``) as a renderer-format selector.
``DefaultContentNegotiation.filter_renderers(renderers, 'pdf')`` runs
inside ``APIView.dispatch.initial`` *before* the view's
``get`` handler. With no PDF renderer registered, that method
raises ``exceptions.NotFound`` per DRF source:
def filter_renderers(self, renderers, format):
renderers = [r for r in renderers if r.format == format]
if not renderers:
raise exceptions.NotFound(...)
return renderers
So the request was getting a 404 from DRF's content negotiator
before the view code ran — explaining the missing entry log.
Fix
===
Rename the query parameter from ``format`` to ``serve`` end-to-end:
* Backend: ``_FORMAT_*`` constants → ``_SERVE_*``; OpenAPI schema
parameter renamed; debug-log fields renamed.
* Frontend: ``getComicPageSource({ ..., format })`` →
``({ ..., serve })`` and the URL builder emits ``&serve=`` instead
of ``&format=``. Internal field names (``pdfRenderMode`` etc.)
unchanged — only the wire param renamed.
Verified by curl:
HEAD ?ts=...&format=pdf → 404 (DRF NotFound)
HEAD ?ts=...&serve=pdf → 200 application/pdf (✓ fixed)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to f970a3b. The ``cache_control(max_age=PAGE_MAX_AGE, public=True)`` decorator on ``page.jpg`` was patching its ``Cache-Control`` header onto *every* response — including 4xx — because Django's ``cache_control`` doesn't filter by status. That turned every transient failure into a week-long cache poison. Symptom in the field: three concurrent ``?format=pdf`` 404s in the browser surfaced only one server-side log line; the other two were served from the browser HTTP cache (cached from earlier ``format=pdf`` requests in prior sessions, when DRF's URL_FORMAT_OVERRIDE rejected the param). Once a 4xx ships with ``Cache-Control: public, max-age= 604800`` the browser pins it for a week and never asks again. Add ``codex.views.util.cache_control_2xx`` — same shape as Django's ``cache_control`` but only patches the header on responses with status 200-299. Swap it in on the page endpoint. Other routes that use ``cache_control`` are mostly cover endpoints + book.pdf (server-side cached via ``cache_page``), where 4xx is rare and short-lived; leave them alone for now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PDF rendering paths are stable now (?format= → ?serve= rename fixed the DRF NotFound issue, cache_control_2xx fixed the cached-404 issue). Strip the temporary view-entry/exit, image-serve-decision, and exception logging that helped diagnose those. Reverts: * logger.exception(...) in catch-all → logger.warning(exc) (original) * Surfacing legacy get_page_by_index failures with traceback * All [pdf-debug] DEBUG-level decision logs The structural fixes — ?serve= param, classify-on-cache-entry, cache_control_2xx — stay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Most "comic PDFs" are scanned-image wrappers — one full-bleed JPEG or PNG per page, no real vector content. Today every PDF page routes through
vue-pdf-embed(and pdf.js on the client) regardless. After this PR, the backend runs an image-dominant detector and serves matched pages as plain image bytes; the browser renders them through<img>like any CBZ page. Vector-content pages keep the existing single-page-PDF +vue-pdf-embedpath.End result: comic-style PDF pages render through Codex's existing image reader (same path as CBZ/CBR), and "real" multi-element PDFs keep vector-quality rendering. The user can override per-comic when the detector misjudges.
The full design + empirical validation against a 14-PDF private corpus lives in
tasks/pdf-image-detection/.Sequencing — depends on two upstream PRs
This PR's CI is wired through
[tool.uv.sources]git refs to:ajslater/pdffile#22— comicbox-pdffile 0.6.0 withclassify_page,read_image_if_dominant,read_full_pixmap_jpeg.ajslater/comicbox#131— widens the pdffile pin (>=0.6,<0.7).Merge order:
[tool.uv.sources]block + the explicitcomicbox-pdffiledirect dep here, regenerateuv.lock, merge.Backend changes
codex/views/reader/page.py— new?format=auto|pdf|imageparameter._try_pdf_image_servereaches throughComicbox._get_archive()to the underlyingPDFFile(private API for now —noqa: SLF001flags the seam) and dispatches based onclassify_page. Per-page verdicts memoized on_ArchiveEntry.codex/views/reader/_archive_cache.py— adds averdicts: dictslot on_ArchiveEntry. Newopen_entry()context manager yields the entry directly; existingopen()keeps itsComicbox-only contract.?pixmap=parameter dropped (latent bug: returned PPM withimage/jpegcontent-type; no caller).Frontend changes
BookPagealways tries<ImgPage>first. Onerrorfor a PDF book it setspdfFallback=trueand the page re-mounts as<PDFDoc>with?format=pdfappended. No HEAD pre-flight, no verdict threaded through the API response — the browser's image-load failure on anapplication/pdfbody is the natural signal.PagerFullPDFdeleted. The full-document-load mode it served was the most fragile branch of the old code (CSS hacks, height-fudge for vertical, two-page mode unsupported).pager.vuepicks betweenPagerHorizontal/PagerVerticalpurely on reading direction now.cacheBookcarve-out for vertical PDFs — PDFs prefetch alongside CBZ.clientSettings.pdfRenderModefield.Test plan
make tycleanmake test-python— 48 passedbun run build— clean; pdf.js chunk (~817 KB gzip) is lazy-loaded — only fetched when a vector PDF page actually mountsbun run test:ci— frontend tests pass<img>with no pdf.js worker visible in DevTools network panelapplication/pdf, the<img>errors silently, pdf-doc chunk loads, page renders via<vue-pdf-embed><vue-pdf-embed>Pre-existing lint failures (not from this PR)
make lintsurfaces two pre-existing issues that fail ondeveloptoo:NEWS.mdprettier formatting (project-root prettier check)remark --quiet .errors with "Cannot process specified file: it's ignored" — happens regardless of branch🤖 Generated with Claude Code