Bound build_context file reads and fail closed on oversized inputs#19
Bound build_context file reads and fail closed on oversized inputs#19wernerkasselman-au wants to merge 4 commits into
Conversation
|
I opened #21 for the ingest-layer hardening I mentioned above (the unbounded URL download, the |
|
While I was in this area I also opened #20, which is a different concern but a related one: detecting large blocks of whitespace used to hide prompt-injection instructions from a human reviewer (padding a file so the injected text sits well below, or off to the right of, what anyone sees on opening it). It is a detection pattern rather than ingest bounding, so it does not belong in this MR, but I am noting it here since it came out of the same read-through. |
rng1995
left a comment
There was a problem hiding this comment.
Thanks for the unusually candid writeup — the self-review of the known gaps is exactly the right way to raise something like this. The core bounded-reads work is well designed, but the diff as it stands does more than the description says, and that plus one real bug is why I'm requesting changes rather than approving. Reviewed the bounded-reads commit (a4b9fc83…) on its own, separate from the stacked #6 commit.
The bounded-reads core (the stated purpose) — looks good
_validate_file_sizesruns onstat().st_sizebefore_read_file_cache, so an oversized file aborts the scan instead of being cached as empty and reported "safe". That directly fixes the original concern, and failing closed (raising) is the right call for a security tool.- The byte-counted
raise_if_content_exceeds_limitguard added to the static / AST / taint / YARA entry points closes the "bypassbuild_contextand silently skip" path consistently, and switching fromlen(content)(chars) tolen(content.encode("utf-8"))(bytes) is the correct unit. - Chunked newline counting in
_count_linesand the bounded manifest prefix keep peak memory bounded for pathological inputs. Collapsing the cap to a single number aligned withstatic_runner.MAX_FILE_BYTESis good hygiene.
Blocking 1 — the PR is not actually self-contained; it bundles unrelated, conflicting changes
The description says this MR is "kept to the per-file build_context bounding (which is self-contained)", but commit a4b9fc83… also rewrites llm_analyzer_base.py (+133/−19) and meta_analyzer.py (+56/−3) with things that have nothing to do with bounded reads:
- rate-limit retry/backoff helpers (
_ainvoke_with_retry/_invoke_with_retry) and a configurableDEFAULT_MAX_CONCURRENCY— this is the same feature as #29; - per-batch failure isolation in
run_batches/arun_batches(try/except → skip/return None) plus a_partition_findings_by_batch_coveragefallback inmeta_analyzer— this is the same feature as #32; - Pydantic schema changes (
LLMFinding/MetaAnalyzerFinding: droppingge/lebounds, adding runtimefield_validators) for structured-output compatibility.
None of these are mentioned in the title or body, and the retry/isolation pieces directly overlap competing implementations in #29 and #32 (the PR is already mergeable_state: dirty). Please split this PR down to just the bounded-reads work so it matches its description, and let #29 / #32 own retry and batch-isolation (or coordinate which PR lands those). The Pydantic structured-output change in particular is a meaningful behavioral change to the LLM contract and deserves its own PR + rationale rather than riding along here.
Blocking 2 — import-time crash on an empty/invalid env var
DEFAULT_MAX_CONCURRENCY = int(os.environ.get("SKILLSPECTOR_MAX_CONCURRENCY", "2"))
_RATE_LIMIT_MAX_RETRIES = int(os.environ.get("SKILLSPECTOR_RATELIMIT_RETRIES", "5"))
_RATE_LIMIT_BASE_DELAY = float(os.environ.get("SKILLSPECTOR_RATELIMIT_BASE_DELAY", "2.0"))These run at module import. os.environ.get(key, default) only returns the default when the key is absent; a present-but-empty value (SKILLSPECTOR_MAX_CONCURRENCY=) or a non-numeric one makes int()/float() raise ValueError at import, which breaks import skillspector.llm_analyzer_base and takes the whole tool down before any scan runs. (Same class of bug as #29, but worse because it's at import time.) Parse defensively — strip, fall back to the default on empty/invalid, and clamp concurrency to ≥ 1. If the retry/concurrency code moves out per Blocking 1, fix it wherever it lands.
Please also confirm / track the gaps you flagged
Your self-review is accurate and I'd treat these as follow-ups, with one I'd prioritize:
- (Most important) The upstream
input_handleringest layer is unbounded — full-body URL buffering,extractallwith no uncompressed-size budget (zip bomb), and an unbounded clone — so the 50 MiB per-file gate can be defeated before it runs. A multi-GB download / decompression bomb is a real DoS for a scanner. Worth at least a tracking issue and a code/README note so the 50 MiB guarantee isn't overstated. - LLM
semantic_*path:get_batcheshas no size check, so a programmatic caller injecting an oversizedfile_cachestill gets through. --yara-rules-dirrule files are compiled with no size cap.- The manifest prefix read is in text mode, so it bounds characters, not UTF-8 bytes — the "same memory bound" comment is slightly off for multibyte content.
Also worth calling out
MAX_FILE_BYTESgoes from 1 MB to 50 MiB — a 50× increase. It's defensible (mid-size files that used to be silently skipped are now actually analyzed, which is better for detection, and it aligns the two limits), but it also means those files now get chunked and sent to the LLM, so it's a real resource/cost change. Please make sure that's intentional and ideally documented.- Not currently mergeable (dirty) — expected while stacked on #6; will need a rebase once #6 lands.
- Minor/non-blocking:
_count_linescounts only\n, so bare-\ror Unicode-separator line endings count differently thansplitlines()did. It's documented and not security-relevant.
Net: the bounded-reads design is solid and I'd be glad to approve it once it's scoped down to that, the import-time env parsing is hardened, and the ingest-layer gap at least has a tracking issue. Happy to keep iterating on the thread.
build_context read whole files into memory with no size guard, so a large or binary file in a scanned directory, zip, or cloned repo could exhaust memory. Bound the read paths and fail the scan (rather than silently skip) when a file exceeds the per-file analysis limit, so oversized content cannot hide from analysis and be reported as safe: - _MAX_READ_BYTES raised to 50 MiB and aligned with static_runner. - build_context aborts via _validate_file_sizes (on stat().st_size, before populating file_cache) when any discovered file exceeds the limit. - _count_lines counts newlines in fixed 64 KiB binary chunks, so a multi-GB or newline-free file is never materialized. - _parse_manifest reads only a bounded prefix (frontmatter lives at the top of the file). - Direct analyzer entry points (static_runner, behavioral_ast, behavioral_taint_tracking, static_yara) raise via raise_if_content_exceeds_limit, counting UTF-8 bytes (not Python chars), so bypassing build_context cannot reintroduce a silent skip. Adds regression tests for the fail-closed behavior and all bounded read paths; each fails against the pre-fix code. Signed-off-by: Werner Kasselman <145896621+wernerkasselman-au@users.noreply.github.com>
…ifest Close the bounded-reads follow-up gaps so the per-file limit holds on the paths the core commit left open: - get_batches calls raise_if_content_exceeds_limit per cached file, so a programmatic caller injecting an oversized file_cache on the semantic_* LLM path fails closed instead of being chunked and sent to the LLM. Imported lazily to avoid an analyzers-package import cycle. - _collect_rule_files skips operator-supplied --yara-rules-dir files over the byte limit (with a warning), so an oversized rule file cannot exhaust memory at yara.compile(). - _parse_manifest reads a bounded byte prefix (binary, then decode) rather than a text-mode read, so the cap is a true byte ceiling, not a character count that multibyte content could inflate. Adds a regression test for each path. Signed-off-by: Werner Kasselman <145896621+wernerkasselman-au@users.noreply.github.com>
raise_if_content_exceeds_limit re-encoded content that build_context decodes with errors="replace", so each undecodable on-disk byte (a 3-byte U+FFFD on re-encode) inflated the count. A legitimate binary file under the 50 MiB on-disk gate (_validate_file_sizes, via stat) could then exceed the guard and falsely abort the whole scan: a ~17 MiB binary re-encodes past 50 MiB. Character count is always <= the source UTF-8 byte length, so content the authoritative on-disk gate admitted never trips this guard, while a programmatically injected oversized string still does. The byte-accurate enforcement stays where it can be measured correctly: _validate_file_sizes on stat().st_size. Adds a regression test for the sub-limit multibyte case. Signed-off-by: Werner Kasselman <145896621+wernerkasselman-au@users.noreply.github.com>
a4b9fc8 to
dea25b0
Compare
Record that the per-file cap fails closed (not silently skips), that it is a per-file analysis limit rather than an ingest limit (the url/zip/git fetch is unbounded, tracked separately), and that raising it from 1 MiB to 50 MiB trades resource and token cost for better mid-size-file coverage. Signed-off-by: Werner Kasselman <145896621+wernerkasselman-au@users.noreply.github.com>
|
Thanks again, all fair. The split, the env crash, and the follow-ups are handled, and I have rebased onto current main now that #6 has landed, so the earlier dirty state is gone. Blocking 1 (not self-contained). Done. The PR is now just the per-file read bounding (12 files, +323/-29 against main): Blocking 2 (import-time crash). That code is out of this PR with the split. For the record, the concurrency work I am proposing no longer parses env at import at all: the module-level int/float reads are replaced by a helper called lazily from The follow-ups, now folded in (not deferred). On reflection these all sit inside the bounded-reads surface, so they belong here:
One change worth your eye: the analyzer guard now counts characters, not re-encoded bytes. You rightly liked the switch to MAX_FILE_BYTES, 1 MB to 50 MiB. Intentional, and you are right about the cost: it stops silently skipping mid-size files (better detection, aligns the two limits), but those files now get chunked and sent to the LLM, so it is a real resource and spend change, noted in the docs. The ingest layer (the one I would prioritize). Out of scope for this PR but tracked now at #131:
|
|
Whenever you get a chance, this one is ready for another look. I split it down to just the per-file read bounding as you asked, moved the env parsing to a lazy, defensive helper so that an empty or non-numeric value falls back to the default rather than crashing at import, and rebased onto current main now that #6 has landed, so the earlier dirty state is gone. The Pydantic structured-output change went out on its own as #133 (merged), with the rationale you wanted. No rush, just flagging that the three blockers from the 06-21 review are all handled now. |
Closes NVIDIA#21. Closes NVIDIA#131. The per-file analysis cap (MAX_FILE_BYTES, 1 MB) sits downstream of InputHandler.resolve(), which pulls scan targets from URLs, zips, or git clones with no size budget of its own. As a result, the per-file cap can be defeated upstream: a multi-GB URL is a memory DoS, a zip bomb fills the disk before extraction is gated, and a large clone lands on disk regardless of the per-file analysis budget. PR NVIDIA#19 explicitly deferred this; this PR addresses the deferred work. Adds two ingest budgets enforced at every remote/archive ingest path: - INGEST_MAX_BYTES (100 MiB): caps streamed URL downloads, total uncompressed size of zip archives, and post-clone disk usage of Git repos. - INGEST_MAX_ZIP_MEMBERS (10,000): caps the number of entries in a single zip (defends against the "many tiny files" zip-bomb variant). Per ingest path: - _download_file streams the body in 64 KiB chunks directly to a temp file inside the existing session temp dir, with a running byte counter. The cap check fires before each chunk is written, so the response body is never accumulated in memory. Content-Length is checked up-front when the server provides it (aborts before reading any body bytes); the streamed counter is authoritative if the header is missing or malformed. A breach mid-stream removes the partial file before the exception propagates, so an attacker who ships exactly INGEST_MAX_BYTES + 1 bytes cannot fill the temp dir. - _extract_zip sums ZipInfo.file_size across all members and checks the member count *before* calling extractall, so classic zip bombs (small archive, huge declared uncompressed size) are rejected without materialising any of the bomb on disk. - _clone_git measures the cloned tree's on-disk size after the existing 60s timeout completes and rejects + cleans up if it exceeds the cap. Symlinks are skipped to avoid runaway counts on malicious /dev/zero-style links. All limit breaches raise IngestLimitExceededError (subclass of ValueError so existing callers that catch ValueError keep working) with a clear message including the limit and the observed size. Adds 10 unit tests covering: under-cap success paths for URL/zip/clone; oversized Content-Length rejected before body read; chunked overflow caught by the streamed counter; *the partial download file is removed when a breach fires mid-stream*; *streaming to disk produces a file of the expected size with no intermediate in-memory concatenation*; declared-uncompressed-oversize zip rejected without extraction; member-count zip-bomb rejected; oversize clone rejected and cleaned up. Full suite: 730 passed, 12 skipped. README documents both caps and their relationship to the per-file analysis cap. Signed-off-by: Rohan Isawe <rohan.isawe@smartsheet.com>
rng1995
left a comment
There was a problem hiding this comment.
[Automated SkillSpector Review]
Re-review: all blockers resolved — approving.
- Scope: the PR is now just the bounded-reads work — the retry/concurrency (->#29), per-batch isolation (->#32), and Pydantic structured-output change (shipped separately as #133) are all removed;
llm_analyzer_base.pyis down to a singleraise_if_content_exceeds_limitguard andmeta_analyzer.pyis untouched. - Import-time env crash: gone with the removed concurrency code (no
int(os.environ...)at import remains). - Ingest-layer gap: documented in README (per-file analysis limit vs unbounded ingest) and tracked/bounded separately in #164.
The fail-closed design is sound: _validate_file_sizes (pre-read stat gate) plus consistent raise_if_content_exceeds_limit guards across static/AST/taint/YARA/LLM entry points, a well-reasoned char-vs-byte measurement (with a multibyte no-false-abort test), bounded _count_lines, and a byte-bounded manifest prefix — all well tested. Approving.
Hi Keshav,
This is the separate MR we agreed to on #6, holding the
_MAX_READ_BYTESwork that I pulled out so that one could stay scoped to the TP3 fix. It is stacked on #6 (the first commit shown here is #6's TP3 commit), so please merge #6 first, after which this diff reduces to the single bounded-reads commit.What it does:
static_runner.MAX_FILE_BYTES, so the limit lives as one number rather than two that can drift.build_contextfail closed: it now aborts in_validate_file_sizesbefore_read_file_cacheever runs, rather than silently caching an oversized file as empty. This is the heart of your original concern, a file large enough that we skip it should not come back reported as "safe".build_contextdoes not reintroduce a silent skip.Before raising this I had three other models review it independently against the actual code (not my description of it), and I want to be upfront about what they found, because it bears on exactly the discussion you wanted to have:
semantic_*analyzers.llm_analyzer_base.get_batcheshas no size check, so a programmatic caller that injects an oversizedfile_cachestill gets through on that path._parse_manifestdirectly.--yara-rules-dir) are compiled with no size cap. They are not skill components, so they sidestep the gate.input_handler.pyruns beforebuild_contextever sees a file, and it is unbounded. The URL download buffers the full body in memory,extractallhas no uncompressed-size bound so a "zip bomb" fills the disk, and the git clone has no post-clone size check. So the 50 MiB per-file gate can be defeated upstream, before it runs.My instinct, and I am happy to be argued out of it, is to keep this MR to the per-file
build_contextbounding (which is self-contained and tested at 605 passing), and to raise a separate issue for the ingest-layer hardening, since that is a bigger change with its own design questions (streaming downloads against a running byte counter, a total-uncompressed-size budget for zip extraction, a clone size cap). I can fold (1) to (3) into this MR if you would rather have them here, or split them out as well, whichever you prefer.Please add comments inline or on the thread, whichever is easier for you.