Cache file header bytes on PathInfo for detectors by ajslater · Pull Request #120 · ajslater/picopt

ajslater · 2026-05-05T06:04:42Z

Summary

PathInfo.header_bytes() lazily reads and caches the first 4 KB of the file (or full data when in-memory).
PdfDetector.identify() now consumes path_info.header_bytes() instead of opening the file and reading 1024 bytes itself.

Why

PdfDetector is the only non-PIL detector that lacks a suffix gate — it opens every file PIL didn't recognize and reads 1024 bytes to check for %PDF-. The other archive detectors gate by suffix and short-circuit in <1us per file.

Impact

Microbenchmark of running the full non-PIL detector chain on non-image files (pyproject.toml, README.md, uv.lock, Makefile):

	Before	After
Total non-PIL detector chain	~16us / file	~2us / file
`PdfDetector.identify()` alone	~14us	~1us

Modest absolute savings, but consistent on every non-image / non-PDF file. The new header_bytes slot also leaves room for future detectors that want a magic-byte gate without a fresh open().

Test plan

make test passes (150 passed, 6 skipped).
make fix, make lint, make ty all clean.
PDF detection still works for files with and without .pdf extension (unchanged behavior — same magic check, same 1024-byte window via slicing).
In-archive PathInfos (with _data set) populate header_bytes from cached data without disk I/O.

🤖 Generated with Claude Code

PdfDetector was opening every file and reading 1024 bytes for every detection attempt -- including non-PDF files where it would read, fail the magic check, and return None. On a directory of non-image non-archive files, this was ~14us per file (88% of the non-PIL detector chain cost). PathInfo now lazily reads up to 4 KB of file head into _header_bytes on first access. PdfDetector consumes path_info.header_bytes() instead of reopening the file. Local benchmark of running the full non-PIL detector chain on non-image files (pyproject.toml, README.md, uv.lock, Makefile): ~16us before -> ~2us after (8x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ajslater merged commit c708dc0 into develop May 5, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache file header bytes on PathInfo for detectors#120

Cache file header bytes on PathInfo for detectors#120
ajslater merged 1 commit into
developfrom
claude/cache-detector-header-bytes

ajslater commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajslater commented May 5, 2026

Summary

Why

Impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant