Skip to content

reader: ?hide_text=1 to suppress visible OCR text on PDF pages#740

Merged
ajslater merged 1 commit into
developfrom
claude/reader-hide-text-pdf
May 9, 2026
Merged

reader: ?hide_text=1 to suppress visible OCR text on PDF pages#740
ajslater merged 1 commit into
developfrom
claude/reader-hide-text-pdf

Conversation

@ajslater
Copy link
Copy Markdown
Owner

@ajslater ajslater commented May 9, 2026

Summary

Some scanned PDFs draw their OCR layer with rendering mode 0 (visible) on top of the page's rasterized scan, doubling the text under any renderer that respects the content stream — including PDF.js as embedded by vue-pdf-embed. Setting textLayer={false} client-side doesn't help because that prop only gates the selectable overlay, not text drawn into the canvas from the content stream.

This PR plumbs a new ?hide_text=1 query param through the reader page view down to the PDF backend. With it set, every PDF page is served with 3 Tr (text rendering mode = invisible) prepended to its content stream — visible glyphs go away, but text content stays in the file so PDF.js's selectable text overlay continues to work for selection / search.

Dependency chain

The pyright/ty # ignores on the call site cover the dev gap until those two land on PyPI and this repo's pyproject.toml deps are bumped to the released versions. They're a one-line cleanup follow-up after the bump.

Test plan

  • make fix clean
  • make lint clean (the only error is the pre-existing remark config issue on develop — see codex#738)
  • bin/test-python.sh — 48 passed
  • End-to-end verified through archive_cache.open(...).get_page_by_index(..., hide_text=True) on a real badly-OCR'd PDF: bytes differ from baseline, rendered output differs, text stays extractable
  • Manual: navigate the reader to a problem PDF, confirm ?hide_text=1 URL serves the un-doubled rendering and that selection still works in vue-pdf-embed

Out of scope

  • UI knob for the toggle (settings drawer / per-comic preference) — the query param is the minimum surface; happy to follow up with a reader setting if you want a clickable affordance.

🤖 Generated with Claude Code

Some scanned PDFs draw their OCR layer with rendering mode 0
(visible) on top of the page's rasterized scan, doubling the text
under any renderer that respects the content stream — including
PDF.js as embedded by vue-pdf-embed. Setting ``textLayer={false}``
client-side doesn't help because that prop only gates the
selectable overlay, not text drawn from the content stream.

Forward the new ``hide_text`` kwarg from comicbox >= 3.0.1 (which
forwards to comicbox-pdffile >= 0.5.1) when ``?hide_text=1`` is
present on the page request. The PDF / pixmap response still
contains the text content — only the rendering mode changes — so
the selectable overlay continues to work.

The pyright/ty ignores cover the dev gap until ``pyproject.toml``
deps are bumped to the released versions of comicbox / comicbox-
pdffile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajslater ajslater merged commit 8572935 into develop May 9, 2026
3 checks passed
@ajslater ajslater deleted the claude/reader-hide-text-pdf branch May 11, 2026 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant