reader: ?hide_text=1 to suppress visible OCR text on PDF pages#740
Merged
Conversation
Some scanned PDFs draw their OCR layer with rendering mode 0
(visible) on top of the page's rasterized scan, doubling the text
under any renderer that respects the content stream — including
PDF.js as embedded by vue-pdf-embed. Setting ``textLayer={false}``
client-side doesn't help because that prop only gates the
selectable overlay, not text drawn from the content stream.
Forward the new ``hide_text`` kwarg from comicbox >= 3.0.1 (which
forwards to comicbox-pdffile >= 0.5.1) when ``?hide_text=1`` is
present on the page request. The PDF / pixmap response still
contains the text content — only the rendering mode changes — so
the selectable overlay continues to work.
The pyright/ty ignores cover the dev gap until ``pyproject.toml``
deps are bumped to the released versions of comicbox / comicbox-
pdffile.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Some scanned PDFs draw their OCR layer with rendering mode 0 (visible) on top of the page's rasterized scan, doubling the text under any renderer that respects the content stream — including PDF.js as embedded by vue-pdf-embed. Setting
textLayer={false}client-side doesn't help because that prop only gates the selectable overlay, not text drawn into the canvas from the content stream.This PR plumbs a new
?hide_text=1query param through the reader page view down to the PDF backend. With it set, every PDF page is served with3 Tr(text rendering mode = invisible) prepended to its content stream — visible glyphs go away, but text content stays in the file so PDF.js's selectable text overlay continues to work for selection / search.Dependency chain
hide_textknob onread_pdf/read_pixmap/read.hide_textfromget_page_by_indexetc. through toPDFFile.read.The pyright/ty
# ignores on the call site cover the dev gap until those two land on PyPI and this repo'spyproject.tomldeps are bumped to the released versions. They're a one-line cleanup follow-up after the bump.Test plan
make fixcleanmake lintclean (the only error is the pre-existingremarkconfig issue ondevelop— see codex#738)bin/test-python.sh— 48 passedarchive_cache.open(...).get_page_by_index(..., hide_text=True)on a real badly-OCR'd PDF: bytes differ from baseline, rendered output differs, text stays extractable?hide_text=1URL serves the un-doubled rendering and that selection still works in vue-pdf-embedOut of scope
🤖 Generated with Claude Code