Skip to content

fix(sieve): utilize constrained depth file discovery for dependency checks#225

Open
Jaydeep869 wants to merge 2 commits into
kusari-oss:mainfrom
Jaydeep869:fix-issue-221-nested-dependencies
Open

fix(sieve): utilize constrained depth file discovery for dependency checks#225
Jaydeep869 wants to merge 2 commits into
kusari-oss:mainfrom
Jaydeep869:fix-issue-221-nested-dependencies

Conversation

@Jaydeep869
Copy link
Copy Markdown
Contributor

Summary

Fixes #221. Modified the Sieve dependency checkers (file_exists_handler and regex_handler) by switching from unbounded recursive globbing to a constrained depth search (os.walk limited to 3 directories deep). This efficiently discovers nested manifests like src/app/package.json while explicitly rejecting large unneeded directories (.git, node_modules, venv, etc.) without incurring the massive performance penalty of full recursive scans.

Type of Change

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality)
  • Breaking change (fix or feature causing existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)

Framework Changes Checklist

If this PR modifies the darnit framework (packages/darnit/):

  • Updated framework spec (openspec/specs/framework-design/spec.md) if behavior changed
  • Ran uv run python scripts/validate_sync.py --verbose and it passes
  • Ran uv run python scripts/generate_docs.py and committed any doc changes

Control/TOML Changes Checklist

If this PR modifies controls or TOML configuration:

  • Control metadata defined in TOML (not Python code)
  • SARIF fields (description, severity, help_url) included where appropriate
  • Ran validation to confirm TOML schema compliance

Testing

  • Tests pass locally (uv run pytest tests/ -v)
  • Added tests for new functionality (if applicable)
  • Linting passes (uv run ruff check .)

Additional Notes

Added test_pass_when_file_exists_nested to tests/darnit/sieve/test_builtin_handlers.py to assert that files hidden two folders deep (e.g. src/app/package.json) are correctly flagged by the updated file_exists_handler logic.

Copy link
Copy Markdown
Contributor

@mlieberman85 mlieberman85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! CI lint is red — uv run ruff check --fix . will clear the W293 / SIM114 warnings in the new helper. Once that's clean, two things need a closer look:

  1. Possible bug — at builtin_handlers.py:64, for file in files + dirs: iterates over the combined files-and-directories lists from os.walk. That means a directory named package.json would match a package.json file pattern. Did you mean for file in files:?

  2. Sieve-wide blast radiusmax_depth=3 is now the default for every control using file_exists and regex, not just the nested-dependency case from #221. That's a behavioral change across ~60 controls.

    For this PR, let's keep the current default behavior unchanged (unbounded recursive glob) and make the new depth-limited search opt-in via TOML — controls that need it set max_depth = 3 explicitly, everyone else keeps today's semantics. That keeps scope tight and unblocks #221 without risking regressions elsewhere.

    I'll open a follow-up issue to explore whether depth-limited search should become the default later (with a proper audit-regression check).

Minor: the pattern.replace("**/", "") strip-and-retry fallback isn't standard glob semantics. If you need recursive matching with a depth cap, pathlib.Path.rglob plus a depth filter would be more predictable than the current branching.

@mlieberman85
Copy link
Copy Markdown
Contributor

Follow-up issue for evaluating depth-limited search as the default: #226

@mlieberman85
Copy link
Copy Markdown
Contributor

Thanks for the work here, @Jaydeep869. I dug into this and there's a gap I want to flag before we merge.

The PR ships the capability but no control opts in

After commit 21399f3 made max_depth opt-in (which is the right call — see below), I checked whether anything actually uses it:

$ grep -c max_depth packages/darnit-baseline/openssf-baseline.toml
0

I then reproduced issue #221 directly against this branch:

# /tmp/nested-test/src/app/package.json exists
config = {
    "files": ["package.json", "package-lock.json", "requirements.txt",
              "pyproject.toml", "Cargo.toml", "go.mod", "go.sum",
              "pom.xml", "build.gradle", "Gemfile.lock"],
}  # exact OSPS-BR-05.01 config from the TOML
ctx = HandlerContext(local_path="/tmp/nested-test", control_id="OSPS-BR-05.01")

file_exists_handler(config, ctx)
# Status: fail
# Msg: None of the required files found

config["max_depth"] = 3
file_exists_handler(config, ctx)
# Status: pass
# Msg: Required file found: src/app/package.json

So as currently configured, a user running darnit audit against a monorepo with src/app/package.json will see the same FAIL they did before the PR. The fix needs the corresponding TOML opt-in.

Why opt-in is correct (don't change this part)

Several controls intentionally check root-only because finding the file anywhere would be a false positive: LICENSE, CODE_OF_CONDUCT.md, SECURITY.md, CONTRIBUTING.md. A repo that vendors a dep with vendor/somelib/LICENSE shouldn't pass a license check on the transitive copy — the question is "does this project declare X" not "is there an X anywhere in the tree." The ignore_dirs set doesn't cover Go vendor/, git submodules, third_party/, etc., so a global default-on would silently flip several controls into false positives. That violates our conservative-by-default principle.

Suggested fix

Add max_depth = 3 to the controls that should match nested manifests — the dependency/build controls, not metadata files. For #221 specifically:

[[controls."OSPS-BR-05.01".passes]]
handler = "file_exists"
max_depth = 3
files = [
    "package.json", "package-lock.json", ...
]

Worth applying to other dep-adjacent controls too (e.g. OSPS-BR-04.01 if it checks lockfiles). Please skip LICENSE/SECURITY/CODE_OF_CONDUCT-style controls.

Smaller items while we're here

  • The match expression is hard to scan due to or/and precedence:

    if PurePath(rel_path).match(pattern) or pattern.startswith("**/") and fnmatch.fnmatch(rel_path, pattern[3:]) or fnmatch.fnmatch(rel_path, pattern):

    Python parses this as A or (B and C) or D. Splitting into named branches would make the intent clearer.

  • dirs.clear() at depth >= max_depth means directory matches at the boundary depth are skipped (verified: a target/ dir at depth 2 with max_depth=2 returns [] while a target.txt file at the same depth is found). Probably fine for these handlers since manifests are files, but worth a comment.

  • file_exists_handler now passes recursive=True to glob.glob for unbounded mode — it didn't before. This is a subtle behavior change for patterns like **/foo. Probably benign but uncalled-out.

  • Test coverage: the new test covers file_exists_handler with max_depth=3. No coverage for regex_handler with max_depth, and no test that ignore_dirs actually prunes.

  • New TOML schema field (max_depth) — the spec validation checkbox in the PR description is unchecked. Worth running uv run python scripts/validate_sync.py --verbose and updating the spec if it touches the schema docs.

@Jaydeep869 Jaydeep869 force-pushed the fix-issue-221-nested-dependencies branch from 21399f3 to 723300e Compare May 1, 2026 00:40
@Jaydeep869
Copy link
Copy Markdown
Contributor Author

Hey @mlieberman85, can you review this pr now.

Copy link
Copy Markdown
Contributor

@mlieberman85 mlieberman85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Constrained Depth File Discovery

Overview

Fixes #221 by replacing root-only file discovery in file_exists_handler and regex_handler with a depth-bounded walk (os.walk + ignore-list pruning). New optional max_depth config field on passes; OSPS-BR-05.01 opts in with max_depth = 3 so monorepo manifests like src/app/package.json are found. Adds three tests.

What works

  • Targeted fix lands cleanly on the reported bug (BR-05.01) without changing other controls' default behavior.
  • os.walk + dirs[:] = [...] pruning is the correct idiom for skipping node_modules/.git cheaply.
  • Sorted, deduped output (sorted(set(found))) is deterministic.
  • Schema compatibility: PassConfig uses extra="allow", so max_depth flows through without schema changes.
  • Tests cover both directions: nested-found (positive) and node_modules-pruned (negative).

Issues to address

1. Docstring contradicts behavior (builtin_handlers.py:115)

max_depth: int - Max directory depth to search (default: 3)

Actual default is None (unbounded glob, root-only for non-globs). Either change the docstring to (default: None, no depth limit), or actually default to 3 in the handler. Misleading docstrings here are a footgun for TOML authors.

2. Subtle behavior change for file_exists_handler glob patterns
Old: glob.glob(path)recursive=False.
New (max_depth=None branch): glob.glob(path, recursive=True).
For * patterns this is identical, but **/... patterns now recurse where before they didn't. Practically no current TOML uses **/ with file_exists, but worth a note in the commit message and ideally a regression test asserting backward-compat for the unbounded path.

3. Code/comment mismatch on boundary-depth dirs (builtin_handlers.py:80–88)

if depth >= max_depth:
    dirs.clear()
...
for item in files + dirs:  # dirs is now empty

The comment claims "directory names at exactly the boundary depth are still yielded in dirs and can match" — but dirs.clear() empties the list before iteration, so they can't. Either fix the comment, or move the dirs.clear() to after the match loop if matching boundary-depth dirs is intentional. Low real-world impact, but the comment will mislead the next reader.

4. Issue #221 mentions go.mod and pyproject.toml — those live in other controls and didn't get max_depth = 3 here. If the goal is to fully resolve the bug as filed, audit other dependency-manifest controls (requirements.txt, Cargo.toml, go.mod, pyproject.toml, Gemfile, pom.xml, build.gradle) for the same opt-in. Otherwise the bug recurs for non-JS monorepos.

5. ignore_dirs set is JS/Python-biased
Missing common large dirs: target (Rust/Java), vendor (Go/PHP), bin, obj, .gradle, .next, .nuxt, .cache, .pytest_cache, .mypy_cache. Consider expanding — pruning is essentially free and avoids surprise slowdowns.

Suggestions

  • The three-strategy match (pathlib + fnmatch-stripped + plain fnmatch) is over-engineered for the patterns in use. PurePath(rel_path).match(pattern) alone covers all current TOML patterns and is easier to reason about. Consider simplifying unless there's a concrete pattern that needs the fallbacks.
  • Hoist imports (fnmatch, glob, PurePath) to the module top. The existing per-call import glob is a pre-existing wart, not a convention worth preserving.
  • Default max_depth = 3 for the helper, opt-out via None, rather than opt-in. Most controls would benefit from depth-limited search; few benefit from unbounded recursion. Follow-up PR if scope-limiting this one.

Risk

Low for the targeted scope (BR-05.01). Medium-low elsewhere — the recursive=True flip in the unbounded path is the main thing to watch in CI on real repos.

Test coverage

Adequate for the new code paths. Missing: a backward-compat assertion that file_exists_handler({"files": ["pkg.json"]}, ctx) (no max_depth) behaves exactly as before.

Verdict: Docstring fix (#1) and comment fix (#3) as blockers; the rest as follow-ups or commit-message notes.

Jaydeep869 added 2 commits May 8, 2026 01:45
…ecks

Add max_depth support to _find_files_with_depth() so dependency controls
can discover nested manifests (e.g. src/app/package.json) without
imposing unbounded recursive scans. The default behavior (max_depth=None)
remains unchanged—unbounded glob—keeping all existing controls unaffected.

Key changes:

- builtin_handlers.py: clarify match-expression precedence by splitting
  the or/and chain into named booleans (pathlib_match, recursive_match,
  simple_match). Document recursive=True in the unbounded-glob fast
  path, and add a comment explaining dirs.clear() semantics at the
  depth boundary.

- openssf-baseline.toml: opt OSPS-BR-05.01 into max_depth=3 so nested
  dependency manifests are correctly discovered. Metadata-file controls
  (LICENSE, SECURITY, CODE_OF_CONDUCT) are intentionally left unbounded
  to avoid false positives from vendored copies.

- tests: move test_pass_when_file_exists_nested into TestFileExistsHandler
  (was misplaced in TestProjectUpdateHandler), add coverage for
  regex_handler with max_depth, and add ignore_dirs pruning assertion.

Fixes kusari-oss#221

Signed-off-by: jaydeep869 <jaydeeppokhariya2106@gmail.com>
- Update docstring for max_depth in file_exists_handler
- Fix boundary-depth directory iteration bug inside _find_files_with_depth
- Expand ignore_dirs list for other language ecosystems (target, vendor, etc.)
- Opt-in max_depth=3 for OSPS-QA-02.01 so standard manifests like go.mod are found
Note: Unbounded globs now default to passing recursive=True down to glob.

Signed-off-by: jaydeep869 <jaydeeppokhariya2106@gmail.com>
@Jaydeep869 Jaydeep869 force-pushed the fix-issue-221-nested-dependencies branch from e5b8315 to af83acf Compare May 7, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Dependency detection (e.g., go.mod, pyproject.toml) fails if not at absolute repo root

2 participants