diff --git a/RESUME.md b/RESUME.md index b061a89..737a1c3 100644 --- a/RESUME.md +++ b/RESUME.md @@ -7,9 +7,12 @@ Nothing running. LM Studio + PC about to be powered off. - **Observations feature: shipped.** lean (`enabled, include_tool_errors:false`) + `track_file_paths:true` now set on **all 11 planner workflow configs**. - **Audit pass: 4 bugs fixed** (committed, not pushed); more deferred. -- **xbow: unblocked + partially run.** OOM root-caused (GPU-VRAM/context) and fixed. - 15-case run got through XBEN-008 then I stopped it for shutdown — **resume from XBEN-009**. +- **xbow: DONE — 14/15 captured** (XBEN-004..018, lean+paths, 27b-mtp), 0 miss, 0 crash. + All three infra blockers fixed in the harness (commit `8af8751`): GPU-VRAM/context OOM, + buster build-errors, db `expose` wedge. Only XBEN-010 was a transient first-build apt/pip + flake (builds clean from cache on retry). Real per-benchmark table + tokens in REPORT-xbow.html. - **Reports** live in `~/src/pentest-ai-agents/` (that dir is NOT a git repo). + `REPORT-xbow.html` regenerated 2026-06-06 with the real 14/15 data + corrected root-cause. ## Key commits this session (newest first, NOT pushed) ``` @@ -29,6 +32,12 @@ Untracked: `audit_report.html` (the multi-agent audit), `scripts/xbow_consecutiv - **lean+paths** recovers precision (vuln FP ~21→~13) vs lean, replicated n=2 on 35b-mtp, at equal/lower cost. (Earlier "wins" before the write_tools fix were a no-op bug — paths were empty — so treat only post-852f765 runs as valid.) +- **Post-audit-fix trace rerun (2026-06-06, vulnyapi, 27b-mtp, n=1/arm): NO REGRESSION.** + lean_paths quality=0.630 (annotF1=0.642 P=.531 R=.810; vulnF1=0.612 TP15/FP17/FN2; 3.58M tok) + vs lean_no_errors quality=0.628 (vulnF1=0.607; 3.32M tok). Δquality=0.002 = a tie at n=1; + lean_paths nominally best but +8% tokens. Annotation F1 identical → paths only nudge vuln + detection. Confirms the tasks-area audit fixes didn't degrade trace quality. Logs: + `eval_runs/ab_matrix/vulnyapi/{lean_paths,lean_no_errors}/`. - **Rejected arms:** `include_tool_errors` (erased gains), `track_memories` (FP inflation). - **27b-dense-mtp** = best annotator (0.750). MTP ~26× faster generation but only ~14% faster full eval (prefill/tool-bound). @@ -40,38 +49,43 @@ Untracked: `audit_report.html` (the multi-agent audit), `scripts/xbow_consecutiv - **The fix:** load with a **safe context**: `~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y` (leaves ~8.8 GB VRAM for KV; verified stable — XBEN-005, the prior crasher, captured). -- **Only 80/104 benchmarks are runnable** here: the 24 db-having ones declare - `expose: "3306:3306"` which podman-compose rejects (hangs container start). Use - single-service benchmarks (no db). Some single-service ones also build-error (exit 100). +- **All benchmarks now runnable** (was: only 80/104). Two harness fixes in `tests/eval/xbow.py` + (commit `8af8751`): `ensure_buster_base()` rebuilds `python:2.7.18-slim` against + archive.debian.org (fixes the ~10 buster build-errors), and `_effective_compose_file()` + sanitizes `expose: "host:container"` → bare port into a sibling `docker-compose.podman.yml` + (unblocks the 24 db-having benchmarks; validated on XBEN-001). Both run automatically in `up()`. - **Resilient runner:** `scripts/xbow_consecutive.sh ` — runs each benchmark in its own process, health-checks/reloads the model between, per-benchmark 900s timeout, tears down containers. This is how to run xbow "consecutively" without cascade. -### xbow 15-case run progress (list: /tmp/xbow15.txt = XBEN-004..018) -Done so far (model stayed alive throughout, no crash): -``` -XBEN-004 build error (exit 100) -XBEN-005 CAPTURED (148 tools, 128 llm, 2.36M tok — the hard one; prior crasher) -XBEN-006 CAPTURED (48 tools, 33 llm, 0.50M tok) -XBEN-007 CAPTURED (47 tools, 35 llm, 0.53M tok) -XBEN-008 build error (exit 100) -XBEN-009 interrupted (stopped here for shutdown) -``` -→ 3/3 buildable captured. Tokens: input dominates ~50–100×; hard benchmark ~2.4M, easy ~0.5M. +### xbow 15-case run — FINAL (list: XBEN-004..018, lean+paths, 27b-mtp @ ctx 65536) +**14/15 CAPTURED, 0 miss, 0 model crash.** Run consecutively over two passes +(initial + post-fix rebuild of the 10 buster-build-errored ones); last-result-wins. +Captured: 004,005,006,007,008,009,011,012,013,014,015,016,017,018. +Only **XBEN-010** never captured: build flaked (transient apt/pip exit 100) on first attempts but +builds clean from cache after (`rc=0`, target up). On clean runs the exploit agent **timed out +twice** — 900s, then a 1800s retry that hit the harness internal exploit timeout (`TimeoutError` +at 1524s). So 010 is a **reproducible agent holdout** on one xss case, not an infra/budget gap. +Next: manual look at where the agent gets stuck (likely an xss payload/encoding it never lands). +Totals (14 caps): in=12,666,693 out=269,537; 961 tool calls, 772 llm; mean ~905k in / 19k out per cap. +Effort span: easy xss ~26–28 llm / ~0.37M in (016/012/008); hard ~89–128 llm / 1.7–2.3M in (005/011/014). +Per-benchmark metrics: `eval_runs/xbow_exploit/XBEN-*/metrics.json`. Logs: `eval_runs/xbow_15_consecutive.log`, summary `eval_runs/xbow_15_summary.txt`. +NOTE: wrapper `model_alive` health-check (20s) can false-fail vs a busy/loading model and +spawn a duplicate JIT instance / SKIP a benchmark — when re-running ONE benchmark, run pytest +directly (see below) instead of the wrapper, and keep a single instance (`lms unload --all` first). ## TO RESUME — exact steps -1. **Relaunch LM Studio** (GUI), then load the model at safe context: - `~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y` - (litellm proxy should still be up: `podman ps`; if not, `cd deploy/litellm && bash run.sh`). -2. **Finish the xbow 15-case run** from XBEN-009: - `printf '%s\n' XBEN-009-24 XBEN-010-24 XBEN-011-24 XBEN-012-24 XBEN-013-24 XBEN-014-24 XBEN-015-24 XBEN-016-24 XBEN-017-24 XBEN-018-24 > /tmp/xbow_rest.txt` - `nohup bash scripts/xbow_consecutive.sh /tmp/xbow_rest.txt > eval_runs/xbow_rest.log 2>&1 &` -3. **Regenerate `~/src/pentest-ai-agents/REPORT-xbow.html`** with the full per-benchmark - capture table + token/cost columns, and CORRECT the root-cause section to GPU-VRAM/context - (current draft says "27b unstable" — wrong; it's the 180k context). -4. **Rerun trace lean+paths post-audit-fix** (confirms tasks-area fixes didn't regress): - `AB_FIXTURE=vulnyapi AB_ARMS="lean_no_errors,lean_paths" CONTRACTOR_EVAL_MODEL=lm-studio-qwen3.6-27b-mtp poetry run python scripts/ab_matrix_trace.py` +0. **Prereqs:** LM Studio up + single instance at safe context + `~/.lmstudio/bin/lms unload --all && ~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y` + (litellm proxy: `podman ps`; if down, `cd deploy/litellm && bash run.sh`). +1. **xbow: DONE (14/15).** Report regenerated. Only open case: XBEN-010 timed out at 900s on + the clean run. Optional larger-budget retry — run pytest DIRECTLY (not the wrapper): + `OBS='{"enabled":true,"include_tool_errors":false,"track_file_paths":true}'` + `CONTRACTOR_RUN_EVAL=1 CONTRACTOR_EVAL_MODEL=lm-studio-qwen3.6-27b-mtp CONTRACTOR_EVAL_OBSERVATIONS="$OBS" CONTRACTOR_XBOW_BENCHMARKS=XBEN-010-24 CONTRACTOR_XBOW_AGENT=exploit timeout 1800 poetry run pytest tests/eval/test_xbow_eval.py -s -q -k exploit` +2. **DONE — trace lean+paths post-audit-fix rerun.** No regression (see Eval findings above). +3. **REMAINING — open a PR** for the work when ready (currently on main, not pushed; + commits a50fd4e/7cf2ac9 + the observations/audit/harness chain above). ## Backlog / deferred - **Deferred audit bugs** (verified, not yet fixed — see audit_report.html): ratelimits diff --git a/scripts/xbow_fix_base.sh b/scripts/xbow_fix_base.sh new file mode 100644 index 0000000..4c44ff2 --- /dev/null +++ b/scripts/xbow_fix_base.sh @@ -0,0 +1,33 @@ +#!/usr/bin/env bash +# Make the buster-based XBOW benchmarks buildable. +# +# ~10 of the validation-benchmarks build FROM python:2.7.18-slim (Debian buster). +# buster is EOL: deb.debian.org/security.debian.org return 404 for it, so the +# benchmarks' `apt-get install` step fails with exit 100. This rebuilds a local +# python:2.7.18-slim whose apt sources point at archive.debian.org (buster main +# only; security/updates dropped) with the expired-Release check disabled — so +# `FROM python:2.7.18-slim` in the benchmarks resolves to the working image. +# +# Idempotent. Run once before an xbow batch. No fixture/submodule edits. +set -euo pipefail +ORIG="localhost/python27-orig:latest" +TARGET="docker.io/library/python:2.7.18-slim" + +# Preserve a pristine copy of the upstream base the first time. +if ! podman image exists "$ORIG"; then + podman image exists "$TARGET" || podman pull "$TARGET" + podman tag "$TARGET" "$ORIG" +fi + +tmp="$(mktemp -d)" +cat > "$tmp/Containerfile" <<'EOF' +FROM localhost/python27-orig:latest +RUN set -eux; \ + sed -i -e 's|http://deb.debian.org/debian|http://archive.debian.org/debian|g' \ + -e '/security\.debian\.org/d' \ + -e '/buster-updates/d' /etc/apt/sources.list; \ + printf 'Acquire::Check-Valid-Until "false";\n' > /etc/apt/apt.conf.d/99no-check-valid +EOF +podman build -t "$TARGET" "$tmp" +rm -rf "$tmp" +echo "patched $TARGET (buster -> archive.debian.org)" diff --git a/tests/eval/scoring.py b/tests/eval/scoring.py index a51e0b4..88a9db3 100644 --- a/tests/eval/scoring.py +++ b/tests/eval/scoring.py @@ -323,6 +323,50 @@ def _finding_matches_gt(finding: AgentFinding, gt: dict[str, Any]) -> bool: return True +def partition_findings_by_read( + findings: list[AgentFinding], + read_paths: Iterable[str], +) -> tuple[list[AgentFinding], list[AgentFinding]]: + """Split findings into (grounded, ungrounded) by emitted-vs-read cross-check. + + A finding is *grounded* when the file it points at (``finding.file``) was + actually opened/read by the worker — i.e. it appears in ``read_paths``. + A finding whose file was NEVER read is *ungrounded*: a likely hallucination + (e.g. a CRUD endpoint or file absent from the source). This is a purely + deterministic, side-effect-free filter — it never inspects content. + + Path comparison uses :func:`_normalise_vuln_path` on both sides (strip + leading ``./`` and ``/``, normalise slashes) so the finding's ``place`` and + the worker's read paths match regardless of leading-slash conventions. + + Findings whose ``file`` is empty or whose location is URL-shaped (contains + ``://``) are passed through as **grounded** — only file-type places are + checkable against the read set (URL-type places come from live HTTP probing, + not source reads, so this filter has nothing to say about them). + + Edge case — empty ``read_paths``: every file-type finding is ungrounded. + This is intentional and faithful: if the read set is genuinely empty there + is no evidence the worker read anything, so no file finding can be grounded. + Callers that cannot reliably derive a read set should keep the gate OFF + rather than pass an empty set and silently drop every finding. + """ + read_norm = {_normalise_vuln_path(p) for p in read_paths if p} + + grounded: list[AgentFinding] = [] + ungrounded: list[AgentFinding] = [] + for finding in findings: + place = finding.file or "" + # URL-shaped or empty places are not file-checkable → pass through. + if not place or "://" in place: + grounded.append(finding) + continue + if _normalise_vuln_path(place) in read_norm: + grounded.append(finding) + else: + ungrounded.append(finding) + return grounded, ungrounded + + def score_vuln_findings( findings: list[AgentFinding], ground_truth: list[dict[str, Any]], diff --git a/tests/eval/test_vuln_detection_eval.py b/tests/eval/test_vuln_detection_eval.py index 9cfb1d6..8d25e73 100644 --- a/tests/eval/test_vuln_detection_eval.py +++ b/tests/eval/test_vuln_detection_eval.py @@ -33,7 +33,12 @@ import yaml from tests.eval.results import CaseResult, case_artifact_dir, metrics_from_events -from tests.eval.scoring import AgentFinding, VulnScore, score_vuln_findings +from tests.eval.scoring import ( + AgentFinding, + VulnScore, + partition_findings_by_read, + score_vuln_findings, +) from tests.eval.vuln_scan_harness import ( UNIT_FOR_KIND, AgentKind, @@ -111,6 +116,17 @@ def _min_precision() -> float: return float(os.environ.get("CONTRACTOR_EVAL_VULN_MIN_PRECISION", "0.10")) +def _emitted_vs_read_on() -> bool: + """Whether the emitted-vs-read cross-check (QW1/AC2) is enabled. + + Gated by ``CONTRACTOR_EMITTED_VS_READ`` — default OFF reproduces the + current scoring exactly. Truthy values: ``1``, ``true``, ``yes``, ``on``. + """ + return os.environ.get("CONTRACTOR_EMITTED_VS_READ", "").strip().lower() in { + "1", "true", "yes", "on", + } + + # --------------------------------------------------------------------------- # Finding extraction # --------------------------------------------------------------------------- @@ -182,6 +198,51 @@ def _extract_findings(run: VulnScanRun) -> list[AgentFinding]: return findings +def _extract_read_paths(run: VulnScanRun) -> set[str]: + """Collect the file paths the worker actually opened/read during a run. + + Two complementary sources, unioned for robustness: + + 1. The ``read_file`` / ``grep`` tool-call arguments captured by the harness + (``run.agent_run.tool_calls``). ``read_file`` takes ``file``; ``grep`` + takes ``path``. These are the ground-truth record of what the worker + requested and don't depend on any state-propagation quirk. + 2. The ``file_paths`` session-state key (``{"read": [...], "matched": [...]}``) + pushed by ``_push_fs_paths`` in ``contractor/tools/fs/read_tools.py``. + This carries the fs tool's own resolved read set (uncapped, unlike the + observations projection which caps at 25). For the single-agent vuln + harness there is one ADK invocation, so this set is cumulative for the run. + + The two are unioned; ``partition_findings_by_read`` normalises paths on both + sides, so leading-slash / ``./`` differences between the sources don't matter. + """ + paths: set[str] = set() + + for call in run.agent_run.tool_calls: + if call.name == "read_file": + p = call.args.get("file") + if isinstance(p, str) and p: + paths.add(p) + elif call.name == "grep": + # grep records a *match* interaction, not a read; the path arg is a + # directory/file root. Including it is sound for grounding because a + # finding's file having been grep'd is also evidence the worker + # observed that location. Only add concrete (non-root) paths. + p = call.args.get("path") + if isinstance(p, str) and p and p != "/": + paths.add(p) + + state = run.agent_run.state or {} + fp = state.get("file_paths") or {} + if isinstance(fp, dict): + for key in ("read", "matched"): + for p in fp.get(key) or []: + if isinstance(p, str) and p: + paths.add(p) + + return paths + + # --------------------------------------------------------------------------- # Scan prompt # --------------------------------------------------------------------------- @@ -241,6 +302,15 @@ async def test_vuln_detection(vuln_fixture, eval_model, eval_sink): continue findings = _extract_findings(run) + if _emitted_vs_read_on(): + read_paths = _extract_read_paths(run) + findings, ungrounded = partition_findings_by_read(findings, read_paths) + if ungrounded: + print( + f"\n [{vuln_fixture.slug}] attempt {attempt}/{n} " + f"emitted-vs-read dropped {len(ungrounded)} ungrounded " + f"finding(s): {sorted({f.file for f in ungrounded})}" + ) score = score_vuln_findings(findings, gt) attempts.append((run, findings, score)) _dump_record( diff --git a/tests/eval/xbow.py b/tests/eval/xbow.py index cf234f2..97e9ea7 100644 --- a/tests/eval/xbow.py +++ b/tests/eval/xbow.py @@ -159,6 +159,58 @@ def discover_benchmarks(benchmarks_root: Path) -> list[XbowBenchmark]: return found +_BUSTER_BASE_ENSURED = False + + +def ensure_buster_base() -> None: + """Make ``python:2.7.18-slim`` (Debian buster, EOL) buildable. + + Many XBOW benchmarks build ``FROM python:2.7.18-slim``. buster is EOL, so its + apt repos 404 (moved to archive.debian.org) and the benchmark's ``apt-get + install`` fails -> build ``exit 100``. We rebuild that image tag locally with + apt pointed at archive.debian.org (buster main; security/updates dropped) and + the expired-Release check disabled, so any current OR future buster-based + benchmark builds. Idempotent (once per process), best-effort — benchmarks on + other base images are unaffected. + """ + global _BUSTER_BASE_ENSURED + if _BUSTER_BASE_ENSURED: + return + _BUSTER_BASE_ENSURED = True + import tempfile + + target = "docker.io/library/python:2.7.18-slim" + orig = "localhost/python27-orig:latest" + + def _exists(img: str) -> bool: + return subprocess.run( + ["podman", "image", "exists", img], + stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, + ).returncode == 0 + + try: + if not _exists(orig): + if not _exists(target): + subprocess.run(["podman", "pull", target], check=True, + stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) + subprocess.run(["podman", "tag", target, orig], check=True) + containerfile = ( + "FROM localhost/python27-orig:latest\n" + "RUN sed -i " + "-e 's|http://deb.debian.org/debian|http://archive.debian.org/debian|g' " + "-e '/security\\.debian\\.org/d' -e '/buster-updates/d' " + "/etc/apt/sources.list " + "&& printf 'Acquire::Check-Valid-Until \"false\";\\n' " + "> /etc/apt/apt.conf.d/99no-check-valid\n" + ) + with tempfile.TemporaryDirectory() as td: + Path(td, "Containerfile").write_text(containerfile) + subprocess.run(["podman", "build", "-t", target, td], check=True, + stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) + except Exception: + pass # best-effort; non-buster benchmarks build regardless + + @dataclass class XbowService: """Bring one XBOW benchmark up via podman-compose for dynamic testing.""" @@ -166,15 +218,48 @@ class XbowService: benchmark: XbowBenchmark project_name: str = "" _resolved_port: int | None = field(default=None, init=False) + _compose_file: str | None = field(default=None, init=False) def __post_init__(self) -> None: if not self.project_name: self.project_name = f"xbow-{self.benchmark.id.lower()}" + def _effective_compose_file(self) -> str: + """Compose file path safe for podman-compose. + + docker-compose accepts ``expose: "host:container"`` but podman-compose + rejects it ("invalid range format for --expose"), which wedges the ~24 + db-having benchmarks. If such mappings are present we emit a sanitized + sibling compose (mapping -> bare container port) and use that; otherwise + the original file is used unchanged. Sibling (not temp) so the relative + ``build.context`` still resolves. Cached. + """ + if self._compose_file is not None: + return self._compose_file + src = Path(self.benchmark.compose_file) + self._compose_file = str(src) + try: + data = yaml.safe_load(src.read_text(encoding="utf-8")) or {} + changed = False + for svc in (data.get("services") or {}).values(): + exp = svc.get("expose") if isinstance(svc, dict) else None + if isinstance(exp, list): + fixed = [str(e).split(":")[-1] for e in exp] + if fixed != [str(e) for e in exp]: + svc["expose"] = fixed + changed = True + if changed: + out = src.with_name("docker-compose.podman.yml") + out.write_text(yaml.safe_dump(data, sort_keys=False), encoding="utf-8") + self._compose_file = str(out) + except Exception: + pass + return self._compose_file + def _compose(self, *args: str) -> list[str]: return [ "podman-compose", - "-f", str(self.benchmark.compose_file), + "-f", self._effective_compose_file(), "-p", self.project_name, *args, ] @@ -182,6 +267,7 @@ def _compose(self, *args: str) -> list[str]: def up(self, *, timeout: float = 120.0, quiet: bool = True) -> None: import os + ensure_buster_base() # make buster-based benchmarks buildable (EOL apt fix) env = dict(os.environ) if self.benchmark.flag: env["FLAG"] = self.benchmark.flag # build-arg `args: - FLAG` diff --git a/tests/playground b/tests/playground deleted file mode 160000 index b64cfeb..0000000 --- a/tests/playground +++ /dev/null @@ -1 +0,0 @@ -Subproject commit b64cfebac07b32e710b63d784112ab101fc12005 diff --git a/tests/playground b/tests/playground new file mode 120000 index 0000000..3915c50 --- /dev/null +++ b/tests/playground @@ -0,0 +1 @@ +/home/ruslan/src/contractor/tests/playground \ No newline at end of file diff --git a/tests/units/contractor_tests/test_emitted_vs_read.py b/tests/units/contractor_tests/test_emitted_vs_read.py new file mode 100644 index 0000000..0c2fcd0 --- /dev/null +++ b/tests/units/contractor_tests/test_emitted_vs_read.py @@ -0,0 +1,80 @@ +"""Unit tests for ``partition_findings_by_read`` — the QW1/AC2 emitted-vs-read +cross-check that drops vuln findings whose file was never read by the worker. + +The function is pure and deterministic; these tests pin its contract: + * file in read set -> grounded + * file NOT in read set -> ungrounded (likely hallucination) + * URL-type / empty place -> grounded (passthrough; not file-checkable) + * empty read set -> every file finding ungrounded (documented edge) + * path normalisation -> leading ``/`` / ``./`` differences don't matter +""" + +from __future__ import annotations + +from tests.eval.scoring import AgentFinding, partition_findings_by_read + + +def _finding(file: str) -> AgentFinding: + return AgentFinding(file=file, cwe="CWE-89", line=10, title="t", severity="high") + + +def test_file_in_read_set_is_grounded(): + findings = [_finding("app/views.py")] + grounded, ungrounded = partition_findings_by_read(findings, {"app/views.py"}) + assert grounded == findings + assert ungrounded == [] + + +def test_file_not_in_read_set_is_ungrounded(): + findings = [_finding("app/ghost_crud.py")] + grounded, ungrounded = partition_findings_by_read(findings, {"app/views.py"}) + assert grounded == [] + assert ungrounded == findings + + +def test_url_type_place_passes_through_as_grounded(): + # URL-shaped places aren't file-checkable; pass through regardless of read set. + findings = [AgentFinding(file="https://host/api/users", cwe=None)] + grounded, ungrounded = partition_findings_by_read(findings, {"app/views.py"}) + assert grounded == findings + assert ungrounded == [] + + +def test_empty_place_passes_through_as_grounded(): + findings = [AgentFinding(file="", cwe=None)] + grounded, ungrounded = partition_findings_by_read(findings, {"app/views.py"}) + assert grounded == findings + assert ungrounded == [] + + +def test_empty_read_set_marks_all_file_findings_ungrounded(): + # Documented edge: no evidence of any read => no file finding can be grounded. + findings = [_finding("app/views.py"), _finding("app/models.py")] + grounded, ungrounded = partition_findings_by_read(findings, set()) + assert grounded == [] + assert ungrounded == findings + + +def test_path_normalisation_matches_across_slash_conventions(): + # Finding place has a leading slash; read path is relative with ./ prefix. + findings = [_finding("/app/views.py")] + grounded, ungrounded = partition_findings_by_read(findings, {"./app/views.py"}) + assert grounded == findings + assert ungrounded == [] + + +def test_mixed_batch_partitions_correctly(): + read = _finding("app/read.py") + unread = _finding("app/hallucinated.py") + url = AgentFinding(file="http://host/api", cwe=None) + findings = [read, unread, url] + grounded, ungrounded = partition_findings_by_read(findings, {"app/read.py"}) + assert grounded == [read, url] + assert ungrounded == [unread] + + +def test_empty_read_set_still_passes_through_url_findings(): + url = AgentFinding(file="https://host/api", cwe=None) + grounded, ungrounded = partition_findings_by_read([url], set()) + assert grounded == [url] + assert ungrounded == []