Cache file header bytes on PathInfo for detectors#120
Merged
Conversation
PdfDetector was opening every file and reading 1024 bytes for every detection attempt -- including non-PDF files where it would read, fail the magic check, and return None. On a directory of non-image non-archive files, this was ~14us per file (88% of the non-PIL detector chain cost). PathInfo now lazily reads up to 4 KB of file head into _header_bytes on first access. PdfDetector consumes path_info.header_bytes() instead of reopening the file. Local benchmark of running the full non-PIL detector chain on non-image files (pyproject.toml, README.md, uv.lock, Makefile): ~16us before -> ~2us after (8x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PathInfo.header_bytes()lazily reads and caches the first 4 KB of the file (or full data when in-memory).PdfDetector.identify()now consumespath_info.header_bytes()instead of opening the file and reading 1024 bytes itself.Why
PdfDetectoris the only non-PIL detector that lacks a suffix gate — it opens every file PIL didn't recognize and reads 1024 bytes to check for%PDF-. The other archive detectors gate by suffix and short-circuit in <1us per file.Impact
Microbenchmark of running the full non-PIL detector chain on non-image files (pyproject.toml, README.md, uv.lock, Makefile):
PdfDetector.identify()aloneModest absolute savings, but consistent on every non-image / non-PDF file. The new
header_bytesslot also leaves room for future detectors that want a magic-byte gate without a freshopen().Test plan
make testpasses (150 passed, 6 skipped).make fix,make lint,make tyall clean..pdfextension (unchanged behavior — same magic check, same 1024-byte window via slicing)._dataset) populateheader_bytesfrom cached data without disk I/O.🤖 Generated with Claude Code