fix(sieve): utilize constrained depth file discovery for dependency checks#225
fix(sieve): utilize constrained depth file discovery for dependency checks#225Jaydeep869 wants to merge 2 commits into
Conversation
mlieberman85
left a comment
There was a problem hiding this comment.
Thanks for tackling this! CI lint is red — uv run ruff check --fix . will clear the W293 / SIM114 warnings in the new helper. Once that's clean, two things need a closer look:
-
Possible bug — at
builtin_handlers.py:64,for file in files + dirs:iterates over the combined files-and-directories lists fromos.walk. That means a directory namedpackage.jsonwould match apackage.jsonfile pattern. Did you meanfor file in files:? -
Sieve-wide blast radius —
max_depth=3is now the default for every control usingfile_existsandregex, not just the nested-dependency case from #221. That's a behavioral change across ~60 controls.For this PR, let's keep the current default behavior unchanged (unbounded recursive glob) and make the new depth-limited search opt-in via TOML — controls that need it set
max_depth = 3explicitly, everyone else keeps today's semantics. That keeps scope tight and unblocks #221 without risking regressions elsewhere.I'll open a follow-up issue to explore whether depth-limited search should become the default later (with a proper audit-regression check).
Minor: the pattern.replace("**/", "") strip-and-retry fallback isn't standard glob semantics. If you need recursive matching with a depth cap, pathlib.Path.rglob plus a depth filter would be more predictable than the current branching.
|
Follow-up issue for evaluating depth-limited search as the default: #226 |
|
Thanks for the work here, @Jaydeep869. I dug into this and there's a gap I want to flag before we merge. The PR ships the capability but no control opts inAfter commit 21399f3 made $ grep -c max_depth packages/darnit-baseline/openssf-baseline.toml
0I then reproduced issue #221 directly against this branch: # /tmp/nested-test/src/app/package.json exists
config = {
"files": ["package.json", "package-lock.json", "requirements.txt",
"pyproject.toml", "Cargo.toml", "go.mod", "go.sum",
"pom.xml", "build.gradle", "Gemfile.lock"],
} # exact OSPS-BR-05.01 config from the TOML
ctx = HandlerContext(local_path="/tmp/nested-test", control_id="OSPS-BR-05.01")
file_exists_handler(config, ctx)
# Status: fail
# Msg: None of the required files found
config["max_depth"] = 3
file_exists_handler(config, ctx)
# Status: pass
# Msg: Required file found: src/app/package.jsonSo as currently configured, a user running Why opt-in is correct (don't change this part)Several controls intentionally check root-only because finding the file anywhere would be a false positive: Suggested fixAdd [[controls."OSPS-BR-05.01".passes]]
handler = "file_exists"
max_depth = 3
files = [
"package.json", "package-lock.json", ...
]Worth applying to other dep-adjacent controls too (e.g. Smaller items while we're here
|
21399f3 to
723300e
Compare
|
Hey @mlieberman85, can you review this pr now. |
mlieberman85
left a comment
There was a problem hiding this comment.
Code Review: Constrained Depth File Discovery
Overview
Fixes #221 by replacing root-only file discovery in file_exists_handler and regex_handler with a depth-bounded walk (os.walk + ignore-list pruning). New optional max_depth config field on passes; OSPS-BR-05.01 opts in with max_depth = 3 so monorepo manifests like src/app/package.json are found. Adds three tests.
What works
- Targeted fix lands cleanly on the reported bug (BR-05.01) without changing other controls' default behavior.
os.walk+dirs[:] = [...]pruning is the correct idiom for skippingnode_modules/.gitcheaply.- Sorted, deduped output (
sorted(set(found))) is deterministic. - Schema compatibility:
PassConfigusesextra="allow", somax_depthflows through without schema changes. - Tests cover both directions: nested-found (positive) and
node_modules-pruned (negative).
Issues to address
1. Docstring contradicts behavior (builtin_handlers.py:115)
max_depth: int - Max directory depth to search (default: 3)
Actual default is None (unbounded glob, root-only for non-globs). Either change the docstring to (default: None, no depth limit), or actually default to 3 in the handler. Misleading docstrings here are a footgun for TOML authors.
2. Subtle behavior change for file_exists_handler glob patterns
Old: glob.glob(path) — recursive=False.
New (max_depth=None branch): glob.glob(path, recursive=True).
For * patterns this is identical, but **/... patterns now recurse where before they didn't. Practically no current TOML uses **/ with file_exists, but worth a note in the commit message and ideally a regression test asserting backward-compat for the unbounded path.
3. Code/comment mismatch on boundary-depth dirs (builtin_handlers.py:80–88)
if depth >= max_depth:
dirs.clear()
...
for item in files + dirs: # dirs is now emptyThe comment claims "directory names at exactly the boundary depth are still yielded in dirs and can match" — but dirs.clear() empties the list before iteration, so they can't. Either fix the comment, or move the dirs.clear() to after the match loop if matching boundary-depth dirs is intentional. Low real-world impact, but the comment will mislead the next reader.
4. Issue #221 mentions go.mod and pyproject.toml — those live in other controls and didn't get max_depth = 3 here. If the goal is to fully resolve the bug as filed, audit other dependency-manifest controls (requirements.txt, Cargo.toml, go.mod, pyproject.toml, Gemfile, pom.xml, build.gradle) for the same opt-in. Otherwise the bug recurs for non-JS monorepos.
5. ignore_dirs set is JS/Python-biased
Missing common large dirs: target (Rust/Java), vendor (Go/PHP), bin, obj, .gradle, .next, .nuxt, .cache, .pytest_cache, .mypy_cache. Consider expanding — pruning is essentially free and avoids surprise slowdowns.
Suggestions
- The three-strategy match (pathlib + fnmatch-stripped + plain fnmatch) is over-engineered for the patterns in use.
PurePath(rel_path).match(pattern)alone covers all current TOML patterns and is easier to reason about. Consider simplifying unless there's a concrete pattern that needs the fallbacks. - Hoist imports (
fnmatch,glob,PurePath) to the module top. The existing per-callimport globis a pre-existing wart, not a convention worth preserving. - Default
max_depth = 3for the helper, opt-out viaNone, rather than opt-in. Most controls would benefit from depth-limited search; few benefit from unbounded recursion. Follow-up PR if scope-limiting this one.
Risk
Low for the targeted scope (BR-05.01). Medium-low elsewhere — the recursive=True flip in the unbounded path is the main thing to watch in CI on real repos.
Test coverage
Adequate for the new code paths. Missing: a backward-compat assertion that file_exists_handler({"files": ["pkg.json"]}, ctx) (no max_depth) behaves exactly as before.
Verdict: Docstring fix (#1) and comment fix (#3) as blockers; the rest as follow-ups or commit-message notes.
…ecks Add max_depth support to _find_files_with_depth() so dependency controls can discover nested manifests (e.g. src/app/package.json) without imposing unbounded recursive scans. The default behavior (max_depth=None) remains unchanged—unbounded glob—keeping all existing controls unaffected. Key changes: - builtin_handlers.py: clarify match-expression precedence by splitting the or/and chain into named booleans (pathlib_match, recursive_match, simple_match). Document recursive=True in the unbounded-glob fast path, and add a comment explaining dirs.clear() semantics at the depth boundary. - openssf-baseline.toml: opt OSPS-BR-05.01 into max_depth=3 so nested dependency manifests are correctly discovered. Metadata-file controls (LICENSE, SECURITY, CODE_OF_CONDUCT) are intentionally left unbounded to avoid false positives from vendored copies. - tests: move test_pass_when_file_exists_nested into TestFileExistsHandler (was misplaced in TestProjectUpdateHandler), add coverage for regex_handler with max_depth, and add ignore_dirs pruning assertion. Fixes kusari-oss#221 Signed-off-by: jaydeep869 <jaydeeppokhariya2106@gmail.com>
- Update docstring for max_depth in file_exists_handler - Fix boundary-depth directory iteration bug inside _find_files_with_depth - Expand ignore_dirs list for other language ecosystems (target, vendor, etc.) - Opt-in max_depth=3 for OSPS-QA-02.01 so standard manifests like go.mod are found Note: Unbounded globs now default to passing recursive=True down to glob. Signed-off-by: jaydeep869 <jaydeeppokhariya2106@gmail.com>
e5b8315 to
af83acf
Compare
Summary
Fixes #221. Modified the Sieve dependency checkers (
file_exists_handlerandregex_handler) by switching from unbounded recursive globbing to a constrained depth search (os.walklimited to 3 directories deep). This efficiently discovers nested manifests likesrc/app/package.jsonwhile explicitly rejecting large unneeded directories (.git,node_modules,venv, etc.) without incurring the massive performance penalty of full recursive scans.Type of Change
Framework Changes Checklist
If this PR modifies the darnit framework (
packages/darnit/):openspec/specs/framework-design/spec.md) if behavior changeduv run python scripts/validate_sync.py --verboseand it passesuv run python scripts/generate_docs.pyand committed any doc changesControl/TOML Changes Checklist
If this PR modifies controls or TOML configuration:
Testing
uv run pytest tests/ -v)uv run ruff check .)Additional Notes
Added
test_pass_when_file_exists_nestedtotests/darnit/sieve/test_builtin_handlers.pyto assert that files hidden two folders deep (e.g.src/app/package.json) are correctly flagged by the updated file_exists_handler logic.