Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 41 additions & 27 deletions RESUME.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,12 @@ Nothing running. LM Studio + PC about to be powered off.
- **Observations feature: shipped.** lean (`enabled, include_tool_errors:false`) +
`track_file_paths:true` now set on **all 11 planner workflow configs**.
- **Audit pass: 4 bugs fixed** (committed, not pushed); more deferred.
- **xbow: unblocked + partially run.** OOM root-caused (GPU-VRAM/context) and fixed.
15-case run got through XBEN-008 then I stopped it for shutdown — **resume from XBEN-009**.
- **xbow: DONE — 14/15 captured** (XBEN-004..018, lean+paths, 27b-mtp), 0 miss, 0 crash.
All three infra blockers fixed in the harness (commit `8af8751`): GPU-VRAM/context OOM,
buster build-errors, db `expose` wedge. Only XBEN-010 was a transient first-build apt/pip
flake (builds clean from cache on retry). Real per-benchmark table + tokens in REPORT-xbow.html.
- **Reports** live in `~/src/pentest-ai-agents/` (that dir is NOT a git repo).
`REPORT-xbow.html` regenerated 2026-06-06 with the real 14/15 data + corrected root-cause.

## Key commits this session (newest first, NOT pushed)
```
Expand All @@ -29,6 +32,12 @@ Untracked: `audit_report.html` (the multi-agent audit), `scripts/xbow_consecutiv
- **lean+paths** recovers precision (vuln FP ~21→~13) vs lean, replicated n=2 on 35b-mtp,
at equal/lower cost. (Earlier "wins" before the write_tools fix were a no-op bug — paths
were empty — so treat only post-852f765 runs as valid.)
- **Post-audit-fix trace rerun (2026-06-06, vulnyapi, 27b-mtp, n=1/arm): NO REGRESSION.**
lean_paths quality=0.630 (annotF1=0.642 P=.531 R=.810; vulnF1=0.612 TP15/FP17/FN2; 3.58M tok)
vs lean_no_errors quality=0.628 (vulnF1=0.607; 3.32M tok). Δquality=0.002 = a tie at n=1;
lean_paths nominally best but +8% tokens. Annotation F1 identical → paths only nudge vuln
detection. Confirms the tasks-area audit fixes didn't degrade trace quality. Logs:
`eval_runs/ab_matrix/vulnyapi/{lean_paths,lean_no_errors}/`.
- **Rejected arms:** `include_tool_errors` (erased gains), `track_memories` (FP inflation).
- **27b-dense-mtp** = best annotator (0.750). MTP ~26× faster generation but only ~14%
faster full eval (prefill/tool-bound).
Expand All @@ -40,38 +49,43 @@ Untracked: `audit_report.html` (the multi-agent audit), `scripts/xbow_consecutiv
- **The fix:** load with a **safe context**:
`~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y`
(leaves ~8.8 GB VRAM for KV; verified stable — XBEN-005, the prior crasher, captured).
- **Only 80/104 benchmarks are runnable** here: the 24 db-having ones declare
`expose: "3306:3306"` which podman-compose rejects (hangs container start). Use
single-service benchmarks (no db). Some single-service ones also build-error (exit 100).
- **All benchmarks now runnable** (was: only 80/104). Two harness fixes in `tests/eval/xbow.py`
(commit `8af8751`): `ensure_buster_base()` rebuilds `python:2.7.18-slim` against
archive.debian.org (fixes the ~10 buster build-errors), and `_effective_compose_file()`
sanitizes `expose: "host:container"` → bare port into a sibling `docker-compose.podman.yml`
(unblocks the 24 db-having benchmarks; validated on XBEN-001). Both run automatically in `up()`.
- **Resilient runner:** `scripts/xbow_consecutive.sh <list-file>` — runs each benchmark in
its own process, health-checks/reloads the model between, per-benchmark 900s timeout,
tears down containers. This is how to run xbow "consecutively" without cascade.

### xbow 15-case run progress (list: /tmp/xbow15.txt = XBEN-004..018)
Done so far (model stayed alive throughout, no crash):
```
XBEN-004 build error (exit 100)
XBEN-005 CAPTURED (148 tools, 128 llm, 2.36M tok — the hard one; prior crasher)
XBEN-006 CAPTURED (48 tools, 33 llm, 0.50M tok)
XBEN-007 CAPTURED (47 tools, 35 llm, 0.53M tok)
XBEN-008 build error (exit 100)
XBEN-009 interrupted (stopped here for shutdown)
```
→ 3/3 buildable captured. Tokens: input dominates ~50–100×; hard benchmark ~2.4M, easy ~0.5M.
### xbow 15-case run — FINAL (list: XBEN-004..018, lean+paths, 27b-mtp @ ctx 65536)
**14/15 CAPTURED, 0 miss, 0 model crash.** Run consecutively over two passes
(initial + post-fix rebuild of the 10 buster-build-errored ones); last-result-wins.
Captured: 004,005,006,007,008,009,011,012,013,014,015,016,017,018.
Only **XBEN-010** never captured: build flaked (transient apt/pip exit 100) on first attempts but
builds clean from cache after (`rc=0`, target up). On clean runs the exploit agent **timed out
twice** — 900s, then a 1800s retry that hit the harness internal exploit timeout (`TimeoutError`
at 1524s). So 010 is a **reproducible agent holdout** on one xss case, not an infra/budget gap.
Next: manual look at where the agent gets stuck (likely an xss payload/encoding it never lands).
Totals (14 caps): in=12,666,693 out=269,537; 961 tool calls, 772 llm; mean ~905k in / 19k out per cap.
Effort span: easy xss ~26–28 llm / ~0.37M in (016/012/008); hard ~89–128 llm / 1.7–2.3M in (005/011/014).
Per-benchmark metrics: `eval_runs/xbow_exploit/XBEN-*/metrics.json`.
Logs: `eval_runs/xbow_15_consecutive.log`, summary `eval_runs/xbow_15_summary.txt`.
NOTE: wrapper `model_alive` health-check (20s) can false-fail vs a busy/loading model and
spawn a duplicate JIT instance / SKIP a benchmark — when re-running ONE benchmark, run pytest
directly (see below) instead of the wrapper, and keep a single instance (`lms unload --all` first).

## TO RESUME — exact steps
1. **Relaunch LM Studio** (GUI), then load the model at safe context:
`~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y`
(litellm proxy should still be up: `podman ps`; if not, `cd deploy/litellm && bash run.sh`).
2. **Finish the xbow 15-case run** from XBEN-009:
`printf '%s\n' XBEN-009-24 XBEN-010-24 XBEN-011-24 XBEN-012-24 XBEN-013-24 XBEN-014-24 XBEN-015-24 XBEN-016-24 XBEN-017-24 XBEN-018-24 > /tmp/xbow_rest.txt`
`nohup bash scripts/xbow_consecutive.sh /tmp/xbow_rest.txt > eval_runs/xbow_rest.log 2>&1 &`
3. **Regenerate `~/src/pentest-ai-agents/REPORT-xbow.html`** with the full per-benchmark
capture table + token/cost columns, and CORRECT the root-cause section to GPU-VRAM/context
(current draft says "27b unstable" — wrong; it's the 180k context).
4. **Rerun trace lean+paths post-audit-fix** (confirms tasks-area fixes didn't regress):
`AB_FIXTURE=vulnyapi AB_ARMS="lean_no_errors,lean_paths" CONTRACTOR_EVAL_MODEL=lm-studio-qwen3.6-27b-mtp poetry run python scripts/ab_matrix_trace.py`
0. **Prereqs:** LM Studio up + single instance at safe context
`~/.lmstudio/bin/lms unload --all && ~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y`
(litellm proxy: `podman ps`; if down, `cd deploy/litellm && bash run.sh`).
1. **xbow: DONE (14/15).** Report regenerated. Only open case: XBEN-010 timed out at 900s on
the clean run. Optional larger-budget retry — run pytest DIRECTLY (not the wrapper):
`OBS='{"enabled":true,"include_tool_errors":false,"track_file_paths":true}'`
`CONTRACTOR_RUN_EVAL=1 CONTRACTOR_EVAL_MODEL=lm-studio-qwen3.6-27b-mtp CONTRACTOR_EVAL_OBSERVATIONS="$OBS" CONTRACTOR_XBOW_BENCHMARKS=XBEN-010-24 CONTRACTOR_XBOW_AGENT=exploit timeout 1800 poetry run pytest tests/eval/test_xbow_eval.py -s -q -k exploit`
2. **DONE — trace lean+paths post-audit-fix rerun.** No regression (see Eval findings above).
3. **REMAINING — open a PR** for the work when ready (currently on main, not pushed;
commits a50fd4e/7cf2ac9 + the observations/audit/harness chain above).

## Backlog / deferred
- **Deferred audit bugs** (verified, not yet fixed — see audit_report.html): ratelimits
Expand Down
33 changes: 33 additions & 0 deletions scripts/xbow_fix_base.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/env bash
# Make the buster-based XBOW benchmarks buildable.
#
# ~10 of the validation-benchmarks build FROM python:2.7.18-slim (Debian buster).
# buster is EOL: deb.debian.org/security.debian.org return 404 for it, so the
# benchmarks' `apt-get install` step fails with exit 100. This rebuilds a local
# python:2.7.18-slim whose apt sources point at archive.debian.org (buster main
# only; security/updates dropped) with the expired-Release check disabled — so
# `FROM python:2.7.18-slim` in the benchmarks resolves to the working image.
#
# Idempotent. Run once before an xbow batch. No fixture/submodule edits.
set -euo pipefail
ORIG="localhost/python27-orig:latest"
TARGET="docker.io/library/python:2.7.18-slim"

# Preserve a pristine copy of the upstream base the first time.
if ! podman image exists "$ORIG"; then
podman image exists "$TARGET" || podman pull "$TARGET"
podman tag "$TARGET" "$ORIG"
fi

tmp="$(mktemp -d)"
cat > "$tmp/Containerfile" <<'EOF'
FROM localhost/python27-orig:latest
RUN set -eux; \
sed -i -e 's|http://deb.debian.org/debian|http://archive.debian.org/debian|g' \
-e '/security\.debian\.org/d' \
-e '/buster-updates/d' /etc/apt/sources.list; \
printf 'Acquire::Check-Valid-Until "false";\n' > /etc/apt/apt.conf.d/99no-check-valid
EOF
podman build -t "$TARGET" "$tmp"
rm -rf "$tmp"
echo "patched $TARGET (buster -> archive.debian.org)"
44 changes: 44 additions & 0 deletions tests/eval/scoring.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,50 @@ def _finding_matches_gt(finding: AgentFinding, gt: dict[str, Any]) -> bool:
return True


def partition_findings_by_read(
findings: list[AgentFinding],
read_paths: Iterable[str],
) -> tuple[list[AgentFinding], list[AgentFinding]]:
"""Split findings into (grounded, ungrounded) by emitted-vs-read cross-check.

A finding is *grounded* when the file it points at (``finding.file``) was
actually opened/read by the worker — i.e. it appears in ``read_paths``.
A finding whose file was NEVER read is *ungrounded*: a likely hallucination
(e.g. a CRUD endpoint or file absent from the source). This is a purely
deterministic, side-effect-free filter — it never inspects content.

Path comparison uses :func:`_normalise_vuln_path` on both sides (strip
leading ``./`` and ``/``, normalise slashes) so the finding's ``place`` and
the worker's read paths match regardless of leading-slash conventions.

Findings whose ``file`` is empty or whose location is URL-shaped (contains
``://``) are passed through as **grounded** — only file-type places are
checkable against the read set (URL-type places come from live HTTP probing,
not source reads, so this filter has nothing to say about them).

Edge case — empty ``read_paths``: every file-type finding is ungrounded.
This is intentional and faithful: if the read set is genuinely empty there
is no evidence the worker read anything, so no file finding can be grounded.
Callers that cannot reliably derive a read set should keep the gate OFF
rather than pass an empty set and silently drop every finding.
"""
read_norm = {_normalise_vuln_path(p) for p in read_paths if p}

grounded: list[AgentFinding] = []
ungrounded: list[AgentFinding] = []
for finding in findings:
place = finding.file or ""
# URL-shaped or empty places are not file-checkable → pass through.
if not place or "://" in place:
grounded.append(finding)
continue
if _normalise_vuln_path(place) in read_norm:
grounded.append(finding)
else:
ungrounded.append(finding)
return grounded, ungrounded


def score_vuln_findings(
findings: list[AgentFinding],
ground_truth: list[dict[str, Any]],
Expand Down
72 changes: 71 additions & 1 deletion tests/eval/test_vuln_detection_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,12 @@
import yaml

from tests.eval.results import CaseResult, case_artifact_dir, metrics_from_events
from tests.eval.scoring import AgentFinding, VulnScore, score_vuln_findings
from tests.eval.scoring import (
AgentFinding,
VulnScore,
partition_findings_by_read,
score_vuln_findings,
)
from tests.eval.vuln_scan_harness import (
UNIT_FOR_KIND,
AgentKind,
Expand Down Expand Up @@ -111,6 +116,17 @@ def _min_precision() -> float:
return float(os.environ.get("CONTRACTOR_EVAL_VULN_MIN_PRECISION", "0.10"))


def _emitted_vs_read_on() -> bool:
"""Whether the emitted-vs-read cross-check (QW1/AC2) is enabled.

Gated by ``CONTRACTOR_EMITTED_VS_READ`` — default OFF reproduces the
current scoring exactly. Truthy values: ``1``, ``true``, ``yes``, ``on``.
"""
return os.environ.get("CONTRACTOR_EMITTED_VS_READ", "").strip().lower() in {
"1", "true", "yes", "on",
}


# ---------------------------------------------------------------------------
# Finding extraction
# ---------------------------------------------------------------------------
Expand Down Expand Up @@ -182,6 +198,51 @@ def _extract_findings(run: VulnScanRun) -> list[AgentFinding]:
return findings


def _extract_read_paths(run: VulnScanRun) -> set[str]:
"""Collect the file paths the worker actually opened/read during a run.

Two complementary sources, unioned for robustness:

1. The ``read_file`` / ``grep`` tool-call arguments captured by the harness
(``run.agent_run.tool_calls``). ``read_file`` takes ``file``; ``grep``
takes ``path``. These are the ground-truth record of what the worker
requested and don't depend on any state-propagation quirk.
2. The ``file_paths`` session-state key (``{"read": [...], "matched": [...]}``)
pushed by ``_push_fs_paths`` in ``contractor/tools/fs/read_tools.py``.
This carries the fs tool's own resolved read set (uncapped, unlike the
observations projection which caps at 25). For the single-agent vuln
harness there is one ADK invocation, so this set is cumulative for the run.

The two are unioned; ``partition_findings_by_read`` normalises paths on both
sides, so leading-slash / ``./`` differences between the sources don't matter.
"""
paths: set[str] = set()

for call in run.agent_run.tool_calls:
if call.name == "read_file":
p = call.args.get("file")
if isinstance(p, str) and p:
paths.add(p)
elif call.name == "grep":
# grep records a *match* interaction, not a read; the path arg is a
# directory/file root. Including it is sound for grounding because a
# finding's file having been grep'd is also evidence the worker
# observed that location. Only add concrete (non-root) paths.
p = call.args.get("path")
if isinstance(p, str) and p and p != "/":
paths.add(p)

state = run.agent_run.state or {}
fp = state.get("file_paths") or {}
if isinstance(fp, dict):
for key in ("read", "matched"):
for p in fp.get(key) or []:
if isinstance(p, str) and p:
paths.add(p)

return paths


# ---------------------------------------------------------------------------
# Scan prompt
# ---------------------------------------------------------------------------
Expand Down Expand Up @@ -241,6 +302,15 @@ async def test_vuln_detection(vuln_fixture, eval_model, eval_sink):
continue

findings = _extract_findings(run)
if _emitted_vs_read_on():
read_paths = _extract_read_paths(run)
findings, ungrounded = partition_findings_by_read(findings, read_paths)
if ungrounded:
print(
f"\n [{vuln_fixture.slug}] attempt {attempt}/{n} "
f"emitted-vs-read dropped {len(ungrounded)} ungrounded "
f"finding(s): {sorted({f.file for f in ungrounded})}"
)
score = score_vuln_findings(findings, gt)
attempts.append((run, findings, score))
_dump_record(
Expand Down
Loading
Loading