Skip to content

Cache file header bytes on PathInfo for detectors#120

Merged
ajslater merged 1 commit into
developfrom
claude/cache-detector-header-bytes
May 5, 2026
Merged

Cache file header bytes on PathInfo for detectors#120
ajslater merged 1 commit into
developfrom
claude/cache-detector-header-bytes

Conversation

@ajslater
Copy link
Copy Markdown
Owner

@ajslater ajslater commented May 5, 2026

Summary

  • PathInfo.header_bytes() lazily reads and caches the first 4 KB of the file (or full data when in-memory).
  • PdfDetector.identify() now consumes path_info.header_bytes() instead of opening the file and reading 1024 bytes itself.

Why

PdfDetector is the only non-PIL detector that lacks a suffix gate — it opens every file PIL didn't recognize and reads 1024 bytes to check for %PDF-. The other archive detectors gate by suffix and short-circuit in <1us per file.

Impact

Microbenchmark of running the full non-PIL detector chain on non-image files (pyproject.toml, README.md, uv.lock, Makefile):

Before After
Total non-PIL detector chain ~16us / file ~2us / file
PdfDetector.identify() alone ~14us ~1us

Modest absolute savings, but consistent on every non-image / non-PDF file. The new header_bytes slot also leaves room for future detectors that want a magic-byte gate without a fresh open().

Test plan

  • make test passes (150 passed, 6 skipped).
  • make fix, make lint, make ty all clean.
  • PDF detection still works for files with and without .pdf extension (unchanged behavior — same magic check, same 1024-byte window via slicing).
  • In-archive PathInfos (with _data set) populate header_bytes from cached data without disk I/O.

🤖 Generated with Claude Code

PdfDetector was opening every file and reading 1024 bytes for every
detection attempt -- including non-PDF files where it would read,
fail the magic check, and return None. On a directory of non-image
non-archive files, this was ~14us per file (88% of the non-PIL
detector chain cost).

PathInfo now lazily reads up to 4 KB of file head into _header_bytes
on first access. PdfDetector consumes path_info.header_bytes() instead
of reopening the file.

Local benchmark of running the full non-PIL detector chain on
non-image files (pyproject.toml, README.md, uv.lock, Makefile):
~16us before -> ~2us after (8x).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajslater ajslater merged commit c708dc0 into develop May 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant