Skip to content

reader: serve image-dominant PDF pages as <img>, drop full-PDF mode#743

Merged
ajslater merged 7 commits into
developfrom
claude/eager-beaver-b576f1
May 9, 2026
Merged

reader: serve image-dominant PDF pages as <img>, drop full-PDF mode#743
ajslater merged 7 commits into
developfrom
claude/eager-beaver-b576f1

Conversation

@ajslater
Copy link
Copy Markdown
Owner

@ajslater ajslater commented May 9, 2026

Summary

Most "comic PDFs" are scanned-image wrappers — one full-bleed JPEG or PNG per page, no real vector content. Today every PDF page routes through vue-pdf-embed (and pdf.js on the client) regardless. After this PR, the backend runs an image-dominant detector and serves matched pages as plain image bytes; the browser renders them through <img> like any CBZ page. Vector-content pages keep the existing single-page-PDF + vue-pdf-embed path.

End result: comic-style PDF pages render through Codex's existing image reader (same path as CBZ/CBR), and "real" multi-element PDFs keep vector-quality rendering. The user can override per-comic when the detector misjudges.

The full design + empirical validation against a 14-PDF private corpus lives in tasks/pdf-image-detection/.

Sequencing — depends on two upstream PRs

This PR's CI is wired through [tool.uv.sources] git refs to:

  1. ajslater/pdffile#22 — comicbox-pdffile 0.6.0 with classify_page, read_image_if_dominant, read_full_pixmap_jpeg.
  2. ajslater/comicbox#131 — widens the pdffile pin (>=0.6,<0.7).

Merge order:

  1. Land + publish pdffile 0.6.x to PyPI
  2. Land + publish comicbox 3.0.x to PyPI
  3. Drop the [tool.uv.sources] block + the explicit comicbox-pdffile direct dep here, regenerate uv.lock, merge.

Backend changes

  • codex/views/reader/page.py — new ?format=auto|pdf|image parameter. _try_pdf_image_serve reaches through Comicbox._get_archive() to the underlying PDFFile (private API for now — noqa: SLF001 flags the seam) and dispatches based on classify_page. Per-page verdicts memoized on _ArchiveEntry.
  • codex/views/reader/_archive_cache.py — adds a verdicts: dict slot on _ArchiveEntry. New open_entry() context manager yields the entry directly; existing open() keeps its Comicbox-only contract.
  • Old ?pixmap= parameter dropped (latent bug: returned PPM with image/jpeg content-type; no caller).

Frontend changes

  • BookPage always tries <ImgPage> first. On error for a PDF book it sets pdfFallback=true and the page re-mounts as <PDFDoc> with ?format=pdf appended. No HEAD pre-flight, no verdict threaded through the API response — the browser's image-load failure on an application/pdf body is the natural signal.
  • PagerFullPDF deleted. The full-document-load mode it served was the most fragile branch of the old code (CSS hacks, height-fudge for vertical, two-page mode unsupported). pager.vue picks between PagerHorizontal / PagerVertical purely on reading direction now.
  • Drops the cacheBook carve-out for vertical PDFs — PDFs prefetch alongside CBZ.
  • Adds a per-comic "PDF Rendering" radio in the reader settings drawer (Auto / Force image / Force vector) wired to a new clientSettings.pdfRenderMode field.

Test plan

  • make ty clean
  • make test-python — 48 passed
  • bun run build — clean; pdf.js chunk (~817 KB gzip) is lazy-loaded — only fetched when a vector PDF page actually mounts
  • bun run test:ci — frontend tests pass
  • End-to-end against a 14-PDF private corpus — Watchmen (image-dominant 6/6), Amphigorey (mixed: 6/12 image-dominant, rest correctly fall through), Nolo Press (text PDF: 0/12), CMYK page transcoded correctly
  • Browser eyeball test (can't drive a browser as an agent — needs human verification):
    • Open a comic-style PDF → pages render as <img> with no pdf.js worker visible in DevTools network panel
    • Open a vector PDF → first request returns application/pdf, the <img> errors silently, pdf-doc chunk loads, page renders via <vue-pdf-embed>
    • Toggle "PDF Rendering → Force image" → text PDF pages now rasterize server-side
    • Toggle "Force vector" → all PDF pages route through <vue-pdf-embed>
    • Vertical scroll on a PDF → no longer falls back to per-page mode

Pre-existing lint failures (not from this PR)

make lint surfaces two pre-existing issues that fail on develop too:

  1. NEWS.md prettier formatting (project-root prettier check)
  2. remark --quiet . errors with "Cannot process specified file: it's ignored" — happens regardless of branch

🤖 Generated with Claude Code

ajslater and others added 7 commits May 8, 2026 20:36
Most "comic PDFs" are scanned-image wrappers — one full-bleed JPEG
or PNG per page, no real vector content. Previously every PDF page
was routed through ``vue-pdf-embed`` (and through pdf.js on the
client) regardless. Now the backend runs an image-dominant page
detector via ``comicbox-pdffile`` 0.6 and serves matched pages as
plain image bytes; the browser renders them through ``<img>`` like
any CBZ page. Vector-content pages keep the existing single-page-PDF
+ ``vue-pdf-embed`` path.

Backend
=======

* ``codex/views/reader/page.py``
  - ``?format=auto|pdf|image`` query parameter. ``auto`` (default)
    runs the detector. ``pdf`` skips the detector and forces the
    legacy single-page-PDF path. ``image`` always rasterizes — works
    for any PDF page but spends more CPU on vector-heavy pages.
  - ``_try_pdf_image_serve`` reaches through Comicbox to the
    underlying ``PDFFile`` (private API for now; comments mark the
    seam for a future public-getter swap) and dispatches to
    ``classify_page`` / ``read_image_if_dominant`` /
    ``read_full_pixmap_jpeg``. Per-page verdicts are memoized on the
    ``_ArchiveEntry`` so prev/curr/next prefetch is effectively free.
  - Old ``?pixmap=`` query parameter dropped — it returned PPM bytes
    with ``image/jpeg`` content-type (latent labeling bug; no caller).
  - OpenAPI schema advertises three possible response content-types:
    ``application/pdf`` (fallback), ``image/jpeg``, ``image/png``.
* ``codex/views/reader/_archive_cache.py``
  - Adds a ``verdicts: dict[int, PageVerdict]`` slot on
    ``_ArchiveEntry`` for the per-page detector cache.
  - New ``open_entry()`` context manager yields the entry directly
    (existing ``open()`` keeps yielding ``Comicbox`` for callers
    that don't need verdict state).

Frontend
========

* ``BookPage`` (``page/page.vue``) always tries ``<ImgPage>`` first.
  On ``error`` for a PDF book it sets ``pdfFallback=true`` and the
  page re-mounts as ``<PDFDoc>`` against the same URL with
  ``?format=pdf`` appended. No HEAD pre-flight, no verdict threaded
  through the API response — the browser's image-load failure on a
  ``application/pdf`` body is the natural signal.
* Drops ``PagerFullPDF`` and the whole-document-load mode it served.
  ``pager.vue`` now picks between ``PagerHorizontal`` /
  ``PagerVertical`` based purely on reading direction.
* Drops the ``cacheBook`` carve-out for vertical PDFs in
  ``stores/reader.js`` — PDFs prefetch alongside CBZ now.
* Adds a per-comic "PDF Rendering" radio in the reader settings
  drawer (Auto / Force image / Force vector) wired to a new
  ``clientSettings.pdfRenderMode`` field. Forwarded to the page
  endpoint as ``?format=``.
* ``getComicPageSource`` accepts an optional ``format`` parameter.
  Omitting it preserves the URL shape so HTTP caches don't fragment.

Sequencing
==========

This change depends on two upstream PRs:

* ``comicbox-pdffile`` 0.6 — image-dominant detector + extractors:
  ajslater/pdffile#22
* ``comicbox`` widened pdffile pin (>=0.6,<0.7):
  ajslater/comicbox#131

The ``[tool.uv.sources]`` block in ``pyproject.toml`` temporarily
points both deps at their PR branches so this branch's CI can
resolve. Once both upstreams land on PyPI, drop the sources block
and the explicit ``comicbox-pdffile`` direct dep.

The full design + empirical validation against a 14-PDF private
corpus lives in ``tasks/pdf-image-detection/``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reports of 404s on PDF page requests with the new image-dominant
detector landed. The catch-all ``except Exception`` in
``ReaderPageView.get`` was logging only ``str(exc)`` — no traceback,
no path context — so the actual failure point was invisible. Hook
loguru's ``logger.exception`` to surface the traceback, plus
``logger.debug`` markers at the decision points so we can see which
branch each request took:

* image-serve auto: which verdict + xref + ext we got
* image-serve force-image: bytes/ext returned
* image-serve declined: why we fell through
* legacy PDF path: bytes/content-type served, plus a wrapper that
  surfaces ``get_page_by_index`` failures with a traceback before
  the catch-all flattens them to 404

All log lines tagged ``[pdf-debug]`` for grep + an easy revert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier debug pass logged the catch-all exception path but
silently 404'd through the Comic.DoesNotExist / FileNotFoundError
handlers (no log line). User reports a 404 on /api/v3/c/1/0/page.jpg
that has *no* matching [pdf-debug] log line, meaning the request
either bombs out before _get_page_image (in _resolve_path_and_type's
ACL/DB lookup) or gets caught by one of the silent handlers.

Add view-entry, view-exit, and per-handler log lines so every 404
correlates to one specific [pdf-debug] line. Entry log includes the
User-Agent and Referer so we can tell whether a failing request
comes from <img>, vue-pdf-embed's fetch, a prefetch, or a 'Read in
Tab' direct nav.

All TEMP DEBUG; revert with grep -l '[pdf-debug]' once stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug
===

PDF pages were 404'ing with ``?format=pdf`` (the ``serve as a single
single-page PDF blob to vue-pdf-embed`` path). Server logs showed the
URL request reaching ``ReaderPageView.get`` for the no-param ``<img>``
attempt and serving the PDF bytes successfully. The follow-up
``?format=pdf`` request 404'd before the view's entry-debug log
fired, with the response body shaped like DRF's
``{"detail": "Not found."}`` and ``Content-Type: application/json``.

Cause
=====

DRF reserves ``?format=`` (``REST_FRAMEWORK['URL_FORMAT_OVERRIDE']``,
default ``'format'``) as a renderer-format selector.
``DefaultContentNegotiation.filter_renderers(renderers, 'pdf')`` runs
inside ``APIView.dispatch.initial`` *before* the view's
``get`` handler. With no PDF renderer registered, that method
raises ``exceptions.NotFound`` per DRF source:

    def filter_renderers(self, renderers, format):
        renderers = [r for r in renderers if r.format == format]
        if not renderers:
            raise exceptions.NotFound(...)
        return renderers

So the request was getting a 404 from DRF's content negotiator
before the view code ran — explaining the missing entry log.

Fix
===

Rename the query parameter from ``format`` to ``serve`` end-to-end:

* Backend: ``_FORMAT_*`` constants → ``_SERVE_*``; OpenAPI schema
  parameter renamed; debug-log fields renamed.
* Frontend: ``getComicPageSource({ ..., format })`` →
  ``({ ..., serve })`` and the URL builder emits ``&serve=`` instead
  of ``&format=``. Internal field names (``pdfRenderMode`` etc.)
  unchanged — only the wire param renamed.

Verified by curl:

    HEAD ?ts=...&format=pdf  →  404 (DRF NotFound)
    HEAD ?ts=...&serve=pdf   →  200 application/pdf  (✓ fixed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to f970a3b. The ``cache_control(max_age=PAGE_MAX_AGE,
public=True)`` decorator on ``page.jpg`` was patching its
``Cache-Control`` header onto *every* response — including 4xx —
because Django's ``cache_control`` doesn't filter by status.

That turned every transient failure into a week-long cache poison.
Symptom in the field: three concurrent ``?format=pdf`` 404s in the
browser surfaced only one server-side log line; the other two were
served from the browser HTTP cache (cached from earlier ``format=pdf``
requests in prior sessions, when DRF's URL_FORMAT_OVERRIDE rejected
the param). Once a 4xx ships with ``Cache-Control: public, max-age=
604800`` the browser pins it for a week and never asks again.

Add ``codex.views.util.cache_control_2xx`` — same shape as Django's
``cache_control`` but only patches the header on responses with
status 200-299. Swap it in on the page endpoint. Other routes that
use ``cache_control`` are mostly cover endpoints + book.pdf
(server-side cached via ``cache_page``), where 4xx is rare and
short-lived; leave them alone for now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PDF rendering paths are stable now (?format= → ?serve= rename
fixed the DRF NotFound issue, cache_control_2xx fixed the cached-404
issue). Strip the temporary view-entry/exit, image-serve-decision,
and exception logging that helped diagnose those.

Reverts:
* logger.exception(...) in catch-all → logger.warning(exc) (original)
* Surfacing legacy get_page_by_index failures with traceback
* All [pdf-debug] DEBUG-level decision logs

The structural fixes — ?serve= param, classify-on-cache-entry,
cache_control_2xx — stay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajslater ajslater merged commit 12d1239 into develop May 9, 2026
3 checks passed
@ajslater ajslater deleted the claude/eager-beaver-b576f1 branch May 11, 2026 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant