reader: ?hide_text=1 to suppress visible OCR text on PDF pages by ajslater · Pull Request #740 · ajslater/codex

ajslater · 2026-05-09T00:12:21Z

Summary

Some scanned PDFs draw their OCR layer with rendering mode 0 (visible) on top of the page's rasterized scan, doubling the text under any renderer that respects the content stream — including PDF.js as embedded by vue-pdf-embed. Setting textLayer={false} client-side doesn't help because that prop only gates the selectable overlay, not text drawn into the canvas from the content stream.

This PR plumbs a new ?hide_text=1 query param through the reader page view down to the PDF backend. With it set, every PDF page is served with 3 Tr (text rendering mode = invisible) prepended to its content stream — visible glyphs go away, but text content stays in the file so PDF.js's selectable text overlay continues to work for selection / search.

Dependency chain

pdffile: add hide_text knob to read_pdf / read_pixmap / read pdffile#19 — exposes the hide_text knob on read_pdf / read_pixmap / read.
comicbox: forward hide_text through archive read path to PDFFile comicbox#129 — forwards hide_text from get_page_by_index etc. through to PDFFile.read.
this PR — wires the query param to the comicbox call.

The pyright/ty # ignores on the call site cover the dev gap until those two land on PyPI and this repo's pyproject.toml deps are bumped to the released versions. They're a one-line cleanup follow-up after the bump.

Test plan

make fix clean
make lint clean (the only error is the pre-existing remark config issue on develop — see codex#738)
bin/test-python.sh — 48 passed
End-to-end verified through archive_cache.open(...).get_page_by_index(..., hide_text=True) on a real badly-OCR'd PDF: bytes differ from baseline, rendered output differs, text stays extractable
Manual: navigate the reader to a problem PDF, confirm ?hide_text=1 URL serves the un-doubled rendering and that selection still works in vue-pdf-embed

Out of scope

UI knob for the toggle (settings drawer / per-comic preference) — the query param is the minimum surface; happy to follow up with a reader setting if you want a clickable affordance.

🤖 Generated with Claude Code

Some scanned PDFs draw their OCR layer with rendering mode 0 (visible) on top of the page's rasterized scan, doubling the text under any renderer that respects the content stream — including PDF.js as embedded by vue-pdf-embed. Setting ``textLayer={false}`` client-side doesn't help because that prop only gates the selectable overlay, not text drawn from the content stream. Forward the new ``hide_text`` kwarg from comicbox >= 3.0.1 (which forwards to comicbox-pdffile >= 0.5.1) when ``?hide_text=1`` is present on the page request. The PDF / pixmap response still contains the text content — only the rendering mode changes — so the selectable overlay continues to work. The pyright/ty ignores cover the dev gap until ``pyproject.toml`` deps are bumped to the released versions of comicbox / comicbox- pdffile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ajslater merged commit 8572935 into develop May 9, 2026
3 checks passed

ajslater mentioned this pull request May 9, 2026

Revert "reader: ?hide_text=1 to suppress visible OCR text on PDF pages" #741

Merged

ajslater deleted the claude/reader-hide-text-pdf branch May 11, 2026 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reader: ?hide_text=1 to suppress visible OCR text on PDF pages#740

reader: ?hide_text=1 to suppress visible OCR text on PDF pages#740
ajslater merged 1 commit into
developfrom
claude/reader-hide-text-pdf

ajslater commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajslater commented May 9, 2026

Summary

Dependency chain

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant