Summary
The per-file read cap added by the bounded-reads work (50 MiB, enforced in build_context._validate_file_sizes and the analyzer guards) sits downstream of InputHandler.resolve(), which pulls the scan target from a URL, zip, or git clone with no size budget of its own. A large download or a decompression bomb exhausts memory or disk before the per-file gate runs, so the 50 MiB guarantee does not hold for url/zip/git inputs.
Specifics (src/skillspector/input_handler.py)
_download_file (157-159): client.get(url) then response.content buffers the whole body in memory, no streaming, no Content-Length or byte-count limit. A multi-GB URL is a memory DoS.
_extract_zip (182): zf.extractall(extract_dir) with no uncompressed-size or file-count budget. A zip bomb fills the disk. (zipfile sanitises member names, so this is not Zip Slip / path traversal.)
_clone_git (131): git clone --depth 1 has a 60s timeout but no post-clone size cap; a large shallow repo still lands on disk.
Suggested direction
- Stream URL downloads with a hard byte ceiling, abort once exceeded.
- Bound zip extraction by total uncompressed size and member count (check
ZipInfo.file_size before extracting).
- Cap clone size (a disk-usage check after clone, or a partial-clone filter), and surface the cap.
- Document in the README that 50 MiB is a per-file analysis limit, not an ingest limit.
Tests missing
zip-bomb, oversized-URL, oversized-clone, and single-file fail-closed coverage.
Surfaced during review of the bounded-reads PR (#19) by @rng1995.
Summary
The per-file read cap added by the bounded-reads work (50 MiB, enforced in
build_context._validate_file_sizesand the analyzer guards) sits downstream ofInputHandler.resolve(), which pulls the scan target from a URL, zip, or git clone with no size budget of its own. A large download or a decompression bomb exhausts memory or disk before the per-file gate runs, so the 50 MiB guarantee does not hold for url/zip/git inputs.Specifics (
src/skillspector/input_handler.py)_download_file(157-159):client.get(url)thenresponse.contentbuffers the whole body in memory, no streaming, no Content-Length or byte-count limit. A multi-GB URL is a memory DoS._extract_zip(182):zf.extractall(extract_dir)with no uncompressed-size or file-count budget. A zip bomb fills the disk. (zipfile sanitises member names, so this is not Zip Slip / path traversal.)_clone_git(131):git clone --depth 1has a 60s timeout but no post-clone size cap; a large shallow repo still lands on disk.Suggested direction
ZipInfo.file_sizebefore extracting).Tests missing
zip-bomb, oversized-URL, oversized-clone, and single-file fail-closed coverage.
Surfaced during review of the bounded-reads PR (#19) by @rng1995.