Skip to content

Ingest layer (URL / zip / git clone) is unbounded upstream of the 50 MiB per-file gate #131

Description

@wernerkasselman-au

Summary

The per-file read cap added by the bounded-reads work (50 MiB, enforced in build_context._validate_file_sizes and the analyzer guards) sits downstream of InputHandler.resolve(), which pulls the scan target from a URL, zip, or git clone with no size budget of its own. A large download or a decompression bomb exhausts memory or disk before the per-file gate runs, so the 50 MiB guarantee does not hold for url/zip/git inputs.

Specifics (src/skillspector/input_handler.py)

  • _download_file (157-159): client.get(url) then response.content buffers the whole body in memory, no streaming, no Content-Length or byte-count limit. A multi-GB URL is a memory DoS.
  • _extract_zip (182): zf.extractall(extract_dir) with no uncompressed-size or file-count budget. A zip bomb fills the disk. (zipfile sanitises member names, so this is not Zip Slip / path traversal.)
  • _clone_git (131): git clone --depth 1 has a 60s timeout but no post-clone size cap; a large shallow repo still lands on disk.

Suggested direction

  • Stream URL downloads with a hard byte ceiling, abort once exceeded.
  • Bound zip extraction by total uncompressed size and member count (check ZipInfo.file_size before extracting).
  • Cap clone size (a disk-usage check after clone, or a partial-clone filter), and surface the cap.
  • Document in the README that 50 MiB is a per-file analysis limit, not an ingest limit.

Tests missing

zip-bomb, oversized-URL, oversized-clone, and single-file fail-closed coverage.

Surfaced during review of the bounded-reads PR (#19) by @rng1995.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions