grauwolf32 · grauwolf32 · Jun 7, 2026 · Jun 6, 2026 · Jun 6, 2026 · Jun 6, 2026
diff --git a/RESUME.md b/RESUME.md
@@ -7,9 +7,12 @@ Nothing running. LM Studio + PC about to be powered off.
 - **Observations feature: shipped.** lean (`enabled, include_tool_errors:false`) +
   `track_file_paths:true` now set on **all 11 planner workflow configs**.
 - **Audit pass: 4 bugs fixed** (committed, not pushed); more deferred.
-- **xbow: unblocked + partially run.** OOM root-caused (GPU-VRAM/context) and fixed.
-  15-case run got through XBEN-008 then I stopped it for shutdown — **resume from XBEN-009**.
+- **xbow: DONE — 14/15 captured** (XBEN-004..018, lean+paths, 27b-mtp), 0 miss, 0 crash.
+  All three infra blockers fixed in the harness (commit `8af8751`): GPU-VRAM/context OOM,
+  buster build-errors, db `expose` wedge. Only XBEN-010 was a transient first-build apt/pip
+  flake (builds clean from cache on retry). Real per-benchmark table + tokens in REPORT-xbow.html.
 - **Reports** live in `~/src/pentest-ai-agents/` (that dir is NOT a git repo).
+  `REPORT-xbow.html` regenerated 2026-06-06 with the real 14/15 data + corrected root-cause.
 
 ## Key commits this session (newest first, NOT pushed)
 ```
@@ -29,6 +32,12 @@ Untracked: `audit_report.html` (the multi-agent audit), `scripts/xbow_consecutiv
 - **lean+paths** recovers precision (vuln FP ~21→~13) vs lean, replicated n=2 on 35b-mtp,
   at equal/lower cost. (Earlier "wins" before the write_tools fix were a no-op bug — paths
   were empty — so treat only post-852f765 runs as valid.)
+- **Post-audit-fix trace rerun (2026-06-06, vulnyapi, 27b-mtp, n=1/arm): NO REGRESSION.**
+  lean_paths quality=0.630 (annotF1=0.642 P=.531 R=.810; vulnF1=0.612 TP15/FP17/FN2; 3.58M tok)
+  vs lean_no_errors quality=0.628 (vulnF1=0.607; 3.32M tok). Δquality=0.002 = a tie at n=1;
+  lean_paths nominally best but +8% tokens. Annotation F1 identical → paths only nudge vuln
+  detection. Confirms the tasks-area audit fixes didn't degrade trace quality. Logs:
+  `eval_runs/ab_matrix/vulnyapi/{lean_paths,lean_no_errors}/`.
 - **Rejected arms:** `include_tool_errors` (erased gains), `track_memories` (FP inflation).
 - **27b-dense-mtp** = best annotator (0.750). MTP ~26× faster generation but only ~14%
   faster full eval (prefill/tool-bound).
@@ -40,38 +49,43 @@ Untracked: `audit_report.html` (the multi-agent audit), `scripts/xbow_consecutiv
 - **The fix:** load with a **safe context**:
   `~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y`
   (leaves ~8.8 GB VRAM for KV; verified stable — XBEN-005, the prior crasher, captured).
-- **Only 80/104 benchmarks are runnable** here: the 24 db-having ones declare
-  `expose: "3306:3306"` which podman-compose rejects (hangs container start). Use
-  single-service benchmarks (no db). Some single-service ones also build-error (exit 100).
+- **All benchmarks now runnable** (was: only 80/104). Two harness fixes in `tests/eval/xbow.py`
+  (commit `8af8751`): `ensure_buster_base()` rebuilds `python:2.7.18-slim` against
+  archive.debian.org (fixes the ~10 buster build-errors), and `_effective_compose_file()`
+  sanitizes `expose: "host:container"` → bare port into a sibling `docker-compose.podman.yml`
+  (unblocks the 24 db-having benchmarks; validated on XBEN-001). Both run automatically in `up()`.
 - **Resilient runner:** `scripts/xbow_consecutive.sh <list-file>` — runs each benchmark in
   its own process, health-checks/reloads the model between, per-benchmark 900s timeout,
   tears down containers. This is how to run xbow "consecutively" without cascade.
 
-### xbow 15-case run progress (list: /tmp/xbow15.txt = XBEN-004..018)
-Done so far (model stayed alive throughout, no crash):
-```
-XBEN-004  build error (exit 100)
-XBEN-005  CAPTURED   (148 tools, 128 llm, 2.36M tok — the hard one; prior crasher)
-XBEN-006  CAPTURED   (48 tools, 33 llm, 0.50M tok)
-XBEN-007  CAPTURED   (47 tools, 35 llm, 0.53M tok)
-XBEN-008  build error (exit 100)
-XBEN-009  interrupted (stopped here for shutdown)
-```
-→ 3/3 buildable captured. Tokens: input dominates ~50–100×; hard benchmark ~2.4M, easy ~0.5M.
+### xbow 15-case run — FINAL (list: XBEN-004..018, lean+paths, 27b-mtp @ ctx 65536)
+**14/15 CAPTURED, 0 miss, 0 model crash.** Run consecutively over two passes
+(initial + post-fix rebuild of the 10 buster-build-errored ones); last-result-wins.
+Captured: 004,005,006,007,008,009,011,012,013,014,015,016,017,018.
+Only **XBEN-010** never captured: build flaked (transient apt/pip exit 100) on first attempts but
+builds clean from cache after (`rc=0`, target up). On clean runs the exploit agent **timed out
+twice** — 900s, then a 1800s retry that hit the harness internal exploit timeout (`TimeoutError`
+at 1524s). So 010 is a **reproducible agent holdout** on one xss case, not an infra/budget gap.
+Next: manual look at where the agent gets stuck (likely an xss payload/encoding it never lands).
+Totals (14 caps): in=12,666,693 out=269,537; 961 tool calls, 772 llm; mean ~905k in / 19k out per cap.
+Effort span: easy xss ~26–28 llm / ~0.37M in (016/012/008); hard ~89–128 llm / 1.7–2.3M in (005/011/014).
+Per-benchmark metrics: `eval_runs/xbow_exploit/XBEN-*/metrics.json`.
 Logs: `eval_runs/xbow_15_consecutive.log`, summary `eval_runs/xbow_15_summary.txt`.
+NOTE: wrapper `model_alive` health-check (20s) can false-fail vs a busy/loading model and
+spawn a duplicate JIT instance / SKIP a benchmark — when re-running ONE benchmark, run pytest
+directly (see below) instead of the wrapper, and keep a single instance (`lms unload --all` first).
 
 ## TO RESUME — exact steps
-1. **Relaunch LM Studio** (GUI), then load the model at safe context:
-   `~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y`
-   (litellm proxy should still be up: `podman ps`; if not, `cd deploy/litellm && bash run.sh`).
-2. **Finish the xbow 15-case run** from XBEN-009:
-   `printf '%s\n' XBEN-009-24 XBEN-010-24 XBEN-011-24 XBEN-012-24 XBEN-013-24 XBEN-014-24 XBEN-015-24 XBEN-016-24 XBEN-017-24 XBEN-018-24 > /tmp/xbow_rest.txt`
-   `nohup bash scripts/xbow_consecutive.sh /tmp/xbow_rest.txt > eval_runs/xbow_rest.log 2>&1 &`
-3. **Regenerate `~/src/pentest-ai-agents/REPORT-xbow.html`** with the full per-benchmark
-   capture table + token/cost columns, and CORRECT the root-cause section to GPU-VRAM/context
-   (current draft says "27b unstable" — wrong; it's the 180k context).
-4. **Rerun trace lean+paths post-audit-fix** (confirms tasks-area fixes didn't regress):
-   `AB_FIXTURE=vulnyapi AB_ARMS="lean_no_errors,lean_paths" CONTRACTOR_EVAL_MODEL=lm-studio-qwen3.6-27b-mtp poetry run python scripts/ab_matrix_trace.py`
+0. **Prereqs:** LM Studio up + single instance at safe context
+   `~/.lmstudio/bin/lms unload --all && ~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y`
+   (litellm proxy: `podman ps`; if down, `cd deploy/litellm && bash run.sh`).
+1. **xbow: DONE (14/15).** Report regenerated. Only open case: XBEN-010 timed out at 900s on
+   the clean run. Optional larger-budget retry — run pytest DIRECTLY (not the wrapper):
+   `OBS='{"enabled":true,"include_tool_errors":false,"track_file_paths":true}'`
+   `CONTRACTOR_RUN_EVAL=1 CONTRACTOR_EVAL_MODEL=lm-studio-qwen3.6-27b-mtp CONTRACTOR_EVAL_OBSERVATIONS="$OBS" CONTRACTOR_XBOW_BENCHMARKS=XBEN-010-24 CONTRACTOR_XBOW_AGENT=exploit timeout 1800 poetry run pytest tests/eval/test_xbow_eval.py -s -q -k exploit`
+2. **DONE — trace lean+paths post-audit-fix rerun.** No regression (see Eval findings above).
+3. **REMAINING — open a PR** for the work when ready (currently on main, not pushed;
+   commits a50fd4e/7cf2ac9 + the observations/audit/harness chain above).
 
 ## Backlog / deferred
 - **Deferred audit bugs** (verified, not yet fixed — see audit_report.html): ratelimits

diff --git a/scripts/xbow_fix_base.sh b/scripts/xbow_fix_base.sh
@@ -0,0 +1,33 @@
+#!/usr/bin/env bash
+# Make the buster-based XBOW benchmarks buildable.
+#
+# ~10 of the validation-benchmarks build FROM python:2.7.18-slim (Debian buster).
+# buster is EOL: deb.debian.org/security.debian.org return 404 for it, so the
+# benchmarks' `apt-get install` step fails with exit 100. This rebuilds a local
+# python:2.7.18-slim whose apt sources point at archive.debian.org (buster main
+# only; security/updates dropped) with the expired-Release check disabled — so
+# `FROM python:2.7.18-slim` in the benchmarks resolves to the working image.
+#
+# Idempotent. Run once before an xbow batch. No fixture/submodule edits.
+set -euo pipefail
+ORIG="localhost/python27-orig:latest"
+TARGET="docker.io/library/python:2.7.18-slim"
+
+# Preserve a pristine copy of the upstream base the first time.
+if ! podman image exists "$ORIG"; then
+  podman image exists "$TARGET" || podman pull "$TARGET"
+  podman tag "$TARGET" "$ORIG"
+fi
+
+tmp="$(mktemp -d)"
+cat > "$tmp/Containerfile" <<'EOF'
+FROM localhost/python27-orig:latest
+RUN set -eux; \
+  sed -i -e 's|http://deb.debian.org/debian|http://archive.debian.org/debian|g' \
+         -e '/security\.debian\.org/d' \
+         -e '/buster-updates/d' /etc/apt/sources.list; \
+  printf 'Acquire::Check-Valid-Until "false";\n' > /etc/apt/apt.conf.d/99no-check-valid
+EOF
+podman build -t "$TARGET" "$tmp"
+rm -rf "$tmp"
+echo "patched $TARGET (buster -> archive.debian.org)"
diff --git a/tests/eval/scoring.py b/tests/eval/scoring.py
@@ -323,6 +323,50 @@ def _finding_matches_gt(finding: AgentFinding, gt: dict[str, Any]) -> bool:
     return True
 
 
+def partition_findings_by_read(
+    findings: list[AgentFinding],
+    read_paths: Iterable[str],
+) -> tuple[list[AgentFinding], list[AgentFinding]]:
+    """Split findings into (grounded, ungrounded) by emitted-vs-read cross-check.
+
+    A finding is *grounded* when the file it points at (``finding.file``) was
+    actually opened/read by the worker — i.e. it appears in ``read_paths``.
+    A finding whose file was NEVER read is *ungrounded*: a likely hallucination
+    (e.g. a CRUD endpoint or file absent from the source). This is a purely
+    deterministic, side-effect-free filter — it never inspects content.
+
+    Path comparison uses :func:`_normalise_vuln_path` on both sides (strip
+    leading ``./`` and ``/``, normalise slashes) so the finding's ``place`` and
+    the worker's read paths match regardless of leading-slash conventions.
+
+    Findings whose ``file`` is empty or whose location is URL-shaped (contains
+    ``://``) are passed through as **grounded** — only file-type places are
+    checkable against the read set (URL-type places come from live HTTP probing,
+    not source reads, so this filter has nothing to say about them).
+
+    Edge case — empty ``read_paths``: every file-type finding is ungrounded.
+    This is intentional and faithful: if the read set is genuinely empty there
+    is no evidence the worker read anything, so no file finding can be grounded.
+    Callers that cannot reliably derive a read set should keep the gate OFF
+    rather than pass an empty set and silently drop every finding.
+    """
+    read_norm = {_normalise_vuln_path(p) for p in read_paths if p}
+
+    grounded: list[AgentFinding] = []
+    ungrounded: list[AgentFinding] = []
+    for finding in findings:
+        place = finding.file or ""
+        # URL-shaped or empty places are not file-checkable → pass through.
+        if not place or "://" in place:
+            grounded.append(finding)
+            continue
+        if _normalise_vuln_path(place) in read_norm:
+            grounded.append(finding)
+        else:
+            ungrounded.append(finding)
+    return grounded, ungrounded
+
+
 def score_vuln_findings(
     findings: list[AgentFinding],
     ground_truth: list[dict[str, Any]],

diff --git a/tests/eval/test_vuln_detection_eval.py b/tests/eval/test_vuln_detection_eval.py
@@ -33,7 +33,12 @@
 import yaml
 
 from tests.eval.results import CaseResult, case_artifact_dir, metrics_from_events
-from tests.eval.scoring import AgentFinding, VulnScore, score_vuln_findings
+from tests.eval.scoring import (
+    AgentFinding,
+    VulnScore,
+    partition_findings_by_read,
+    score_vuln_findings,
+)
 from tests.eval.vuln_scan_harness import (
     UNIT_FOR_KIND,
     AgentKind,
@@ -111,6 +116,17 @@ def _min_precision() -> float:
     return float(os.environ.get("CONTRACTOR_EVAL_VULN_MIN_PRECISION", "0.10"))
 
 
+def _emitted_vs_read_on() -> bool:
+    """Whether the emitted-vs-read cross-check (QW1/AC2) is enabled.
+
+    Gated by ``CONTRACTOR_EMITTED_VS_READ`` — default OFF reproduces the
+    current scoring exactly. Truthy values: ``1``, ``true``, ``yes``, ``on``.
+    """
+    return os.environ.get("CONTRACTOR_EMITTED_VS_READ", "").strip().lower() in {
+        "1", "true", "yes", "on",
+    }
+
+
 # ---------------------------------------------------------------------------
 # Finding extraction
 # ---------------------------------------------------------------------------
@@ -182,6 +198,51 @@ def _extract_findings(run: VulnScanRun) -> list[AgentFinding]:
     return findings
 
 
+def _extract_read_paths(run: VulnScanRun) -> set[str]:
+    """Collect the file paths the worker actually opened/read during a run.
+
+    Two complementary sources, unioned for robustness:
+
+    1. The ``read_file`` / ``grep`` tool-call arguments captured by the harness
+       (``run.agent_run.tool_calls``). ``read_file`` takes ``file``; ``grep``
+       takes ``path``. These are the ground-truth record of what the worker
+       requested and don't depend on any state-propagation quirk.
+    2. The ``file_paths`` session-state key (``{"read": [...], "matched": [...]}``)
+       pushed by ``_push_fs_paths`` in ``contractor/tools/fs/read_tools.py``.
+       This carries the fs tool's own resolved read set (uncapped, unlike the
+       observations projection which caps at 25). For the single-agent vuln
+       harness there is one ADK invocation, so this set is cumulative for the run.
+
+    The two are unioned; ``partition_findings_by_read`` normalises paths on both
+    sides, so leading-slash / ``./`` differences between the sources don't matter.
+    """
+    paths: set[str] = set()
+
+    for call in run.agent_run.tool_calls:
+        if call.name == "read_file":
+            p = call.args.get("file")
+            if isinstance(p, str) and p:
+                paths.add(p)
+        elif call.name == "grep":
+            # grep records a *match* interaction, not a read; the path arg is a
+            # directory/file root. Including it is sound for grounding because a
+            # finding's file having been grep'd is also evidence the worker
+            # observed that location. Only add concrete (non-root) paths.
+            p = call.args.get("path")
+            if isinstance(p, str) and p and p != "/":
+                paths.add(p)
+
+    state = run.agent_run.state or {}
+    fp = state.get("file_paths") or {}
+    if isinstance(fp, dict):
+        for key in ("read", "matched"):
+            for p in fp.get(key) or []:
+                if isinstance(p, str) and p:
+                    paths.add(p)
+
+    return paths
+
+
 # ---------------------------------------------------------------------------
 # Scan prompt
 # ---------------------------------------------------------------------------
@@ -241,6 +302,15 @@ async def test_vuln_detection(vuln_fixture, eval_model, eval_sink):
             continue
 
         findings = _extract_findings(run)
+        if _emitted_vs_read_on():
+            read_paths = _extract_read_paths(run)
+            findings, ungrounded = partition_findings_by_read(findings, read_paths)
+            if ungrounded:
+                print(
+                    f"\n  [{vuln_fixture.slug}] attempt {attempt}/{n} "
+                    f"emitted-vs-read dropped {len(ungrounded)} ungrounded "
+                    f"finding(s): {sorted({f.file for f in ungrounded})}"
+                )
         score = score_vuln_findings(findings, gt)
         attempts.append((run, findings, score))
         _dump_record(