Skip to content

Bound build_context file reads and fail closed on oversized inputs#19

Open
wernerkasselman-au wants to merge 4 commits into
NVIDIA:mainfrom
wernerkasselman-au:fix/bound-build-context-reads
Open

Bound build_context file reads and fail closed on oversized inputs#19
wernerkasselman-au wants to merge 4 commits into
NVIDIA:mainfrom
wernerkasselman-au:fix/bound-build-context-reads

Conversation

@wernerkasselman-au

Copy link
Copy Markdown
Contributor

Hi Keshav,

This is the separate MR we agreed to on #6, holding the _MAX_READ_BYTES work that I pulled out so that one could stay scoped to the TP3 fix. It is stacked on #6 (the first commit shown here is #6's TP3 commit), so please merge #6 first, after which this diff reduces to the single bounded-reads commit.

What it does:

  • raises the per-file cap to 50 MiB and aligns it with static_runner.MAX_FILE_BYTES, so the limit lives as one number rather than two that can drift.
  • makes build_context fail closed: it now aborts in _validate_file_sizes before _read_file_cache ever runs, rather than silently caching an oversized file as empty. This is the heart of your original concern, a file large enough that we skip it should not come back reported as "safe".
  • counts lines in fixed 64 KiB binary chunks (so a multi-GB, newline-free file is never materialised whole).
  • reads only a bounded prefix of an oversized SKILL.md, since the frontmatter sits at the top of the file.
  • adds the same byte-counted guard to the direct analyzer entry points (static, AST, taint, YARA), so a caller that bypasses build_context does not reintroduce a silent skip.

Before raising this I had three other models review it independently against the actual code (not my description of it), and I want to be upfront about what they found, because it bears on exactly the discussion you wanted to have:

  1. the fail-closed guard does not yet cover the LLM semantic_* analyzers. llm_analyzer_base.get_batches has no size check, so a programmatic caller that injects an oversized file_cache still gets through on that path.
  2. the manifest prefix read is in text mode, so the cap is counting characters and not UTF-8 bytes (the comment claims bytes), which only matters for multibyte content reaching _parse_manifest directly.
  3. user-supplied YARA rule files (via --yara-rules-dir) are compiled with no size cap. They are not skill components, so they sidestep the gate.
  4. the larger one, which is really a surrounding issue rather than something this MR introduces: the whole ingest layer in input_handler.py runs before build_context ever sees a file, and it is unbounded. The URL download buffers the full body in memory, extractall has no uncompressed-size bound so a "zip bomb" fills the disk, and the git clone has no post-clone size check. So the 50 MiB per-file gate can be defeated upstream, before it runs.

My instinct, and I am happy to be argued out of it, is to keep this MR to the per-file build_context bounding (which is self-contained and tested at 605 passing), and to raise a separate issue for the ingest-layer hardening, since that is a bigger change with its own design questions (streaming downloads against a running byte counter, a total-uncompressed-size budget for zip extraction, a clone size cap). I can fold (1) to (3) into this MR if you would rather have them here, or split them out as well, whichever you prefer.

Please add comments inline or on the thread, whichever is easier for you.

@wernerkasselman-au

Copy link
Copy Markdown
Contributor Author

I opened #21 for the ingest-layer hardening I mentioned above (the unbounded URL download, the extractall zip-bomb, the git clone with no size check), so this MR can stay scoped to the per-file build_context bounding. A multi-model review of the issue draft turned up two more paths worth folding into that work, the single-file shutil.copy2 in _wrap_single_file, and symlinks planted in a zip or clone (a symlink to /etc/passwd passes the size gate today and gets read), so #21 now covers those as well.

@wernerkasselman-au

Copy link
Copy Markdown
Contributor Author

While I was in this area I also opened #20, which is a different concern but a related one: detecting large blocks of whitespace used to hide prompt-injection instructions from a human reviewer (padding a file so the injected text sits well below, or off to the right of, what anyone sees on opening it). It is a detection pattern rather than ingest bounding, so it does not belong in this MR, but I am noting it here since it came out of the same read-through.

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the unusually candid writeup — the self-review of the known gaps is exactly the right way to raise something like this. The core bounded-reads work is well designed, but the diff as it stands does more than the description says, and that plus one real bug is why I'm requesting changes rather than approving. Reviewed the bounded-reads commit (a4b9fc83…) on its own, separate from the stacked #6 commit.

The bounded-reads core (the stated purpose) — looks good

  • _validate_file_sizes runs on stat().st_size before _read_file_cache, so an oversized file aborts the scan instead of being cached as empty and reported "safe". That directly fixes the original concern, and failing closed (raising) is the right call for a security tool.
  • The byte-counted raise_if_content_exceeds_limit guard added to the static / AST / taint / YARA entry points closes the "bypass build_context and silently skip" path consistently, and switching from len(content) (chars) to len(content.encode("utf-8")) (bytes) is the correct unit.
  • Chunked newline counting in _count_lines and the bounded manifest prefix keep peak memory bounded for pathological inputs. Collapsing the cap to a single number aligned with static_runner.MAX_FILE_BYTES is good hygiene.

Blocking 1 — the PR is not actually self-contained; it bundles unrelated, conflicting changes
The description says this MR is "kept to the per-file build_context bounding (which is self-contained)", but commit a4b9fc83… also rewrites llm_analyzer_base.py (+133/−19) and meta_analyzer.py (+56/−3) with things that have nothing to do with bounded reads:

  • rate-limit retry/backoff helpers (_ainvoke_with_retry / _invoke_with_retry) and a configurable DEFAULT_MAX_CONCURRENCY — this is the same feature as #29;
  • per-batch failure isolation in run_batches/arun_batches (try/except → skip/return None) plus a _partition_findings_by_batch_coverage fallback in meta_analyzer — this is the same feature as #32;
  • Pydantic schema changes (LLMFinding / MetaAnalyzerFinding: dropping ge/le bounds, adding runtime field_validators) for structured-output compatibility.

None of these are mentioned in the title or body, and the retry/isolation pieces directly overlap competing implementations in #29 and #32 (the PR is already mergeable_state: dirty). Please split this PR down to just the bounded-reads work so it matches its description, and let #29 / #32 own retry and batch-isolation (or coordinate which PR lands those). The Pydantic structured-output change in particular is a meaningful behavioral change to the LLM contract and deserves its own PR + rationale rather than riding along here.

Blocking 2 — import-time crash on an empty/invalid env var

DEFAULT_MAX_CONCURRENCY = int(os.environ.get("SKILLSPECTOR_MAX_CONCURRENCY", "2"))
_RATE_LIMIT_MAX_RETRIES = int(os.environ.get("SKILLSPECTOR_RATELIMIT_RETRIES", "5"))
_RATE_LIMIT_BASE_DELAY = float(os.environ.get("SKILLSPECTOR_RATELIMIT_BASE_DELAY", "2.0"))

These run at module import. os.environ.get(key, default) only returns the default when the key is absent; a present-but-empty value (SKILLSPECTOR_MAX_CONCURRENCY=) or a non-numeric one makes int()/float() raise ValueError at import, which breaks import skillspector.llm_analyzer_base and takes the whole tool down before any scan runs. (Same class of bug as #29, but worse because it's at import time.) Parse defensively — strip, fall back to the default on empty/invalid, and clamp concurrency to ≥ 1. If the retry/concurrency code moves out per Blocking 1, fix it wherever it lands.

Please also confirm / track the gaps you flagged
Your self-review is accurate and I'd treat these as follow-ups, with one I'd prioritize:

  • (Most important) The upstream input_handler ingest layer is unbounded — full-body URL buffering, extractall with no uncompressed-size budget (zip bomb), and an unbounded clone — so the 50 MiB per-file gate can be defeated before it runs. A multi-GB download / decompression bomb is a real DoS for a scanner. Worth at least a tracking issue and a code/README note so the 50 MiB guarantee isn't overstated.
  • LLM semantic_* path: get_batches has no size check, so a programmatic caller injecting an oversized file_cache still gets through.
  • --yara-rules-dir rule files are compiled with no size cap.
  • The manifest prefix read is in text mode, so it bounds characters, not UTF-8 bytes — the "same memory bound" comment is slightly off for multibyte content.

Also worth calling out

  • MAX_FILE_BYTES goes from 1 MB to 50 MiB — a 50× increase. It's defensible (mid-size files that used to be silently skipped are now actually analyzed, which is better for detection, and it aligns the two limits), but it also means those files now get chunked and sent to the LLM, so it's a real resource/cost change. Please make sure that's intentional and ideally documented.
  • Not currently mergeable (dirty) — expected while stacked on #6; will need a rebase once #6 lands.
  • Minor/non-blocking: _count_lines counts only \n, so bare-\r or Unicode-separator line endings count differently than splitlines() did. It's documented and not security-relevant.

Net: the bounded-reads design is solid and I'd be glad to approve it once it's scoped down to that, the import-time env parsing is hardened, and the ingest-layer gap at least has a tracking issue. Happy to keep iterating on the thread.

build_context read whole files into memory with no size guard, so a
large or binary file in a scanned directory, zip, or cloned repo could
exhaust memory. Bound the read paths and fail the scan (rather than
silently skip) when a file exceeds the per-file analysis limit, so
oversized content cannot hide from analysis and be reported as safe:

- _MAX_READ_BYTES raised to 50 MiB and aligned with static_runner.
- build_context aborts via _validate_file_sizes (on stat().st_size,
  before populating file_cache) when any discovered file exceeds the
  limit.
- _count_lines counts newlines in fixed 64 KiB binary chunks, so a
  multi-GB or newline-free file is never materialized.
- _parse_manifest reads only a bounded prefix (frontmatter lives at the
  top of the file).
- Direct analyzer entry points (static_runner, behavioral_ast,
  behavioral_taint_tracking, static_yara) raise via
  raise_if_content_exceeds_limit, counting UTF-8 bytes (not Python
  chars), so bypassing build_context cannot reintroduce a silent skip.

Adds regression tests for the fail-closed behavior and all bounded read
paths; each fails against the pre-fix code.

Signed-off-by: Werner Kasselman <145896621+wernerkasselman-au@users.noreply.github.com>
…ifest

Close the bounded-reads follow-up gaps so the per-file limit holds on the
paths the core commit left open:

- get_batches calls raise_if_content_exceeds_limit per cached file, so a
  programmatic caller injecting an oversized file_cache on the semantic_*
  LLM path fails closed instead of being chunked and sent to the LLM.
  Imported lazily to avoid an analyzers-package import cycle.
- _collect_rule_files skips operator-supplied --yara-rules-dir files over
  the byte limit (with a warning), so an oversized rule file cannot
  exhaust memory at yara.compile().
- _parse_manifest reads a bounded byte prefix (binary, then decode)
  rather than a text-mode read, so the cap is a true byte ceiling, not a
  character count that multibyte content could inflate.

Adds a regression test for each path.

Signed-off-by: Werner Kasselman <145896621+wernerkasselman-au@users.noreply.github.com>
raise_if_content_exceeds_limit re-encoded content that build_context
decodes with errors="replace", so each undecodable on-disk byte (a 3-byte
U+FFFD on re-encode) inflated the count. A legitimate binary file under the
50 MiB on-disk gate (_validate_file_sizes, via stat) could then exceed the
guard and falsely abort the whole scan: a ~17 MiB binary re-encodes past
50 MiB.

Character count is always <= the source UTF-8 byte length, so content the
authoritative on-disk gate admitted never trips this guard, while a
programmatically injected oversized string still does. The byte-accurate
enforcement stays where it can be measured correctly: _validate_file_sizes
on stat().st_size. Adds a regression test for the sub-limit multibyte case.

Signed-off-by: Werner Kasselman <145896621+wernerkasselman-au@users.noreply.github.com>
Record that the per-file cap fails closed (not silently skips), that it is
a per-file analysis limit rather than an ingest limit (the url/zip/git
fetch is unbounded, tracked separately), and that raising it from 1 MiB to
50 MiB trades resource and token cost for better mid-size-file coverage.

Signed-off-by: Werner Kasselman <145896621+wernerkasselman-au@users.noreply.github.com>
@wernerkasselman-au

Copy link
Copy Markdown
Contributor Author

Thanks again, all fair. The split, the env crash, and the follow-ups are handled, and I have rebased onto current main now that #6 has landed, so the earlier dirty state is gone.

Blocking 1 (not self-contained). Done. The PR is now just the per-file read bounding (12 files, +323/-29 against main): build_context.py, static_runner.py, static_yara.py, behavioral_ast.py, behavioral_taint_tracking.py, llm_analyzer_base.py (one guard, see below) and their tests. I dropped the three pieces that did not belong: the retry/backoff and configurable concurrency (overlaps #29), the per-batch isolation and the _partition_findings_by_batch_coverage fallback (overlaps #32), and the Pydantic structured-output change (LLMFinding / MetaAnalyzerFinding ge/le to field_validator), which I will raise as its own PR with the rationale since it changes the LLM contract. meta_analyzer.py is untouched here, and I am happy to coordinate with whoever lands retry and batch-isolation.

Blocking 2 (import-time crash). That code is out of this PR with the split. For the record, the concurrency work I am proposing no longer parses env at import at all: the module-level int/float reads are replaced by a helper called lazily from resolve_max_concurrency() and the retry paths, which strips, falls back on empty or non-numeric input, warns, and clamps concurrency to a minimum of 1, so SKILLSPECTOR_MAX_CONCURRENCY= yields the default rather than a traceback.

The follow-ups, now folded in (not deferred). On reflection these all sit inside the bounded-reads surface, so they belong here:

  • get_batches now calls raise_if_content_exceeds_limit per cached file, so a programmatic caller injecting an oversized file_cache on the semantic_* path fails closed instead of being chunked and sent to the LLM. It is imported lazily inside the function, since a module-level import would cycle through the analyzers package __init__ (which imports the semantic_* analyzers, which import llm_analyzer_base).
  • --yara-rules-dir rule files are now stat-capped in _collect_rule_files: an operator rule file over the limit is skipped with a warning (rule files are config, not the artifact under analysis, so skip rather than abort).
  • _parse_manifest now reads a bounded binary prefix and decodes, so the cap is a true byte ceiling rather than a character count, closing the multibyte gap you flagged in the "same memory bound" comment.

One change worth your eye: the analyzer guard now counts characters, not re-encoded bytes. You rightly liked the switch to len(content.encode("utf-8")), but it mis-fires on the build_context path: _read_file_cache decodes with errors="replace", so an undecodable on-disk byte becomes a 3-byte U+FFFD on re-encode. A legitimate binary file that passes the authoritative on-disk gate (_validate_file_sizes, stat) then inflates past 50 MiB on re-encode and aborts the whole scan (a ~17 MiB binary re-encodes to ~52 MiB). So raise_if_content_exceeds_limit now measures len(content) characters, always <= the source UTF-8 byte length, so content the on-disk gate admitted can never trip it while an injected oversized string still does. The byte-accurate enforcement stays where it can be measured correctly, on stat().st_size in _validate_file_sizes. There is a regression test for the sub-limit multibyte case.

MAX_FILE_BYTES, 1 MB to 50 MiB. Intentional, and you are right about the cost: it stops silently skipping mid-size files (better detection, aligns the two limits), but those files now get chunked and sent to the LLM, so it is a real resource and spend change, noted in the docs.

The ingest layer (the one I would prioritize). Out of scope for this PR but tracked now at #131: _download_file buffers the full response.content (input_handler.py:159), _extract_zip calls zf.extractall with no uncompressed-size or file-count budget (input_handler.py:182), and the shallow clone has a 60s timeout but no post-clone size cap (input_handler.py:131), so a multi-GB download or a decompression bomb defeats the 50 MiB per-file gate before it runs (zipfile sanitises member names, so DoS not path traversal). The README note makes clear the 50 MiB figure is a per-file analysis limit, not an ingest limit.

_count_lines counting only \n: acknowledged, documented, not security-relevant; happy to switch to a splitlines-equivalent if you would prefer parity.

@wernerkasselman-au

Copy link
Copy Markdown
Contributor Author

Whenever you get a chance, this one is ready for another look. I split it down to just the per-file read bounding as you asked, moved the env parsing to a lazy, defensive helper so that an empty or non-numeric value falls back to the default rather than crashing at import, and rebased onto current main now that #6 has landed, so the earlier dirty state is gone. The Pydantic structured-output change went out on its own as #133 (merged), with the rationale you wanted. No rush, just flagging that the three blockers from the 06-21 review are all handled now.

rcha0s pushed a commit to rcha0s/SkillSpector that referenced this pull request Jun 23, 2026
Closes NVIDIA#21. Closes NVIDIA#131.

The per-file analysis cap (MAX_FILE_BYTES, 1 MB) sits downstream of
InputHandler.resolve(), which pulls scan targets from URLs, zips, or
git clones with no size budget of its own.  As a result, the per-file
cap can be defeated upstream: a multi-GB URL is a memory DoS, a zip
bomb fills the disk before extraction is gated, and a large clone
lands on disk regardless of the per-file analysis budget.  PR NVIDIA#19
explicitly deferred this; this PR addresses the deferred work.

Adds two ingest budgets enforced at every remote/archive ingest path:

- INGEST_MAX_BYTES (100 MiB): caps streamed URL downloads, total
  uncompressed size of zip archives, and post-clone disk usage of
  Git repos.
- INGEST_MAX_ZIP_MEMBERS (10,000): caps the number of entries in a
  single zip (defends against the "many tiny files" zip-bomb variant).

Per ingest path:

- _download_file streams the body in 64 KiB chunks directly to a temp
  file inside the existing session temp dir, with a running byte
  counter.  The cap check fires before each chunk is written, so the
  response body is never accumulated in memory.  Content-Length is
  checked up-front when the server provides it (aborts before reading
  any body bytes); the streamed counter is authoritative if the
  header is missing or malformed.  A breach mid-stream removes the
  partial file before the exception propagates, so an attacker who
  ships exactly INGEST_MAX_BYTES + 1 bytes cannot fill the temp dir.
- _extract_zip sums ZipInfo.file_size across all members and checks
  the member count *before* calling extractall, so classic zip bombs
  (small archive, huge declared uncompressed size) are rejected
  without materialising any of the bomb on disk.
- _clone_git measures the cloned tree's on-disk size after the
  existing 60s timeout completes and rejects + cleans up if it
  exceeds the cap.  Symlinks are skipped to avoid runaway counts on
  malicious /dev/zero-style links.

All limit breaches raise IngestLimitExceededError (subclass of
ValueError so existing callers that catch ValueError keep working)
with a clear message including the limit and the observed size.

Adds 10 unit tests covering: under-cap success paths for URL/zip/clone;
oversized Content-Length rejected before body read; chunked overflow
caught by the streamed counter; *the partial download file is removed
when a breach fires mid-stream*; *streaming to disk produces a file of
the expected size with no intermediate in-memory concatenation*;
declared-uncompressed-oversize zip rejected without extraction;
member-count zip-bomb rejected; oversize clone rejected and cleaned
up.  Full suite: 730 passed, 12 skipped.

README documents both caps and their relationship to the per-file
analysis cap.

Signed-off-by: Rohan Isawe <rohan.isawe@smartsheet.com>

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Automated SkillSpector Review]

Re-review: all blockers resolved — approving.

  1. Scope: the PR is now just the bounded-reads work — the retry/concurrency (->#29), per-batch isolation (->#32), and Pydantic structured-output change (shipped separately as #133) are all removed; llm_analyzer_base.py is down to a single raise_if_content_exceeds_limit guard and meta_analyzer.py is untouched.
  2. Import-time env crash: gone with the removed concurrency code (no int(os.environ...) at import remains).
  3. Ingest-layer gap: documented in README (per-file analysis limit vs unbounded ingest) and tracked/bounded separately in #164.

The fail-closed design is sound: _validate_file_sizes (pre-read stat gate) plus consistent raise_if_content_exceeds_limit guards across static/AST/taint/YARA/LLM entry points, a well-reasoned char-vs-byte measurement (with a multibyte no-false-abort test), bounded _count_lines, and a byte-bounded manifest prefix — all well tested. Approving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants