Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 41 additions & 27 deletions RESUME.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,12 @@ Nothing running. LM Studio + PC about to be powered off.
- **Observations feature: shipped.** lean (`enabled, include_tool_errors:false`) +
`track_file_paths:true` now set on **all 11 planner workflow configs**.
- **Audit pass: 4 bugs fixed** (committed, not pushed); more deferred.
- **xbow: unblocked + partially run.** OOM root-caused (GPU-VRAM/context) and fixed.
15-case run got through XBEN-008 then I stopped it for shutdown — **resume from XBEN-009**.
- **xbow: DONE — 14/15 captured** (XBEN-004..018, lean+paths, 27b-mtp), 0 miss, 0 crash.
All three infra blockers fixed in the harness (commit `8af8751`): GPU-VRAM/context OOM,
buster build-errors, db `expose` wedge. Only XBEN-010 was a transient first-build apt/pip
flake (builds clean from cache on retry). Real per-benchmark table + tokens in REPORT-xbow.html.
- **Reports** live in `~/src/pentest-ai-agents/` (that dir is NOT a git repo).
`REPORT-xbow.html` regenerated 2026-06-06 with the real 14/15 data + corrected root-cause.

## Key commits this session (newest first, NOT pushed)
```
Expand All @@ -29,6 +32,12 @@ Untracked: `audit_report.html` (the multi-agent audit), `scripts/xbow_consecutiv
- **lean+paths** recovers precision (vuln FP ~21→~13) vs lean, replicated n=2 on 35b-mtp,
at equal/lower cost. (Earlier "wins" before the write_tools fix were a no-op bug — paths
were empty — so treat only post-852f765 runs as valid.)
- **Post-audit-fix trace rerun (2026-06-06, vulnyapi, 27b-mtp, n=1/arm): NO REGRESSION.**
lean_paths quality=0.630 (annotF1=0.642 P=.531 R=.810; vulnF1=0.612 TP15/FP17/FN2; 3.58M tok)
vs lean_no_errors quality=0.628 (vulnF1=0.607; 3.32M tok). Δquality=0.002 = a tie at n=1;
lean_paths nominally best but +8% tokens. Annotation F1 identical → paths only nudge vuln
detection. Confirms the tasks-area audit fixes didn't degrade trace quality. Logs:
`eval_runs/ab_matrix/vulnyapi/{lean_paths,lean_no_errors}/`.
- **Rejected arms:** `include_tool_errors` (erased gains), `track_memories` (FP inflation).
- **27b-dense-mtp** = best annotator (0.750). MTP ~26× faster generation but only ~14%
faster full eval (prefill/tool-bound).
Expand All @@ -40,38 +49,43 @@ Untracked: `audit_report.html` (the multi-agent audit), `scripts/xbow_consecutiv
- **The fix:** load with a **safe context**:
`~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y`
(leaves ~8.8 GB VRAM for KV; verified stable — XBEN-005, the prior crasher, captured).
- **Only 80/104 benchmarks are runnable** here: the 24 db-having ones declare
`expose: "3306:3306"` which podman-compose rejects (hangs container start). Use
single-service benchmarks (no db). Some single-service ones also build-error (exit 100).
- **All benchmarks now runnable** (was: only 80/104). Two harness fixes in `tests/eval/xbow.py`
(commit `8af8751`): `ensure_buster_base()` rebuilds `python:2.7.18-slim` against
archive.debian.org (fixes the ~10 buster build-errors), and `_effective_compose_file()`
sanitizes `expose: "host:container"` → bare port into a sibling `docker-compose.podman.yml`
(unblocks the 24 db-having benchmarks; validated on XBEN-001). Both run automatically in `up()`.
- **Resilient runner:** `scripts/xbow_consecutive.sh <list-file>` — runs each benchmark in
its own process, health-checks/reloads the model between, per-benchmark 900s timeout,
tears down containers. This is how to run xbow "consecutively" without cascade.

### xbow 15-case run progress (list: /tmp/xbow15.txt = XBEN-004..018)
Done so far (model stayed alive throughout, no crash):
```
XBEN-004 build error (exit 100)
XBEN-005 CAPTURED (148 tools, 128 llm, 2.36M tok — the hard one; prior crasher)
XBEN-006 CAPTURED (48 tools, 33 llm, 0.50M tok)
XBEN-007 CAPTURED (47 tools, 35 llm, 0.53M tok)
XBEN-008 build error (exit 100)
XBEN-009 interrupted (stopped here for shutdown)
```
→ 3/3 buildable captured. Tokens: input dominates ~50–100×; hard benchmark ~2.4M, easy ~0.5M.
### xbow 15-case run — FINAL (list: XBEN-004..018, lean+paths, 27b-mtp @ ctx 65536)
**14/15 CAPTURED, 0 miss, 0 model crash.** Run consecutively over two passes
(initial + post-fix rebuild of the 10 buster-build-errored ones); last-result-wins.
Captured: 004,005,006,007,008,009,011,012,013,014,015,016,017,018.
Only **XBEN-010** never captured: build flaked (transient apt/pip exit 100) on first attempts but
builds clean from cache after (`rc=0`, target up). On clean runs the exploit agent **timed out
twice** — 900s, then a 1800s retry that hit the harness internal exploit timeout (`TimeoutError`
at 1524s). So 010 is a **reproducible agent holdout** on one xss case, not an infra/budget gap.
Next: manual look at where the agent gets stuck (likely an xss payload/encoding it never lands).
Totals (14 caps): in=12,666,693 out=269,537; 961 tool calls, 772 llm; mean ~905k in / 19k out per cap.
Effort span: easy xss ~26–28 llm / ~0.37M in (016/012/008); hard ~89–128 llm / 1.7–2.3M in (005/011/014).
Per-benchmark metrics: `eval_runs/xbow_exploit/XBEN-*/metrics.json`.
Logs: `eval_runs/xbow_15_consecutive.log`, summary `eval_runs/xbow_15_summary.txt`.
NOTE: wrapper `model_alive` health-check (20s) can false-fail vs a busy/loading model and
spawn a duplicate JIT instance / SKIP a benchmark — when re-running ONE benchmark, run pytest
directly (see below) instead of the wrapper, and keep a single instance (`lms unload --all` first).

## TO RESUME — exact steps
1. **Relaunch LM Studio** (GUI), then load the model at safe context:
`~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y`
(litellm proxy should still be up: `podman ps`; if not, `cd deploy/litellm && bash run.sh`).
2. **Finish the xbow 15-case run** from XBEN-009:
`printf '%s\n' XBEN-009-24 XBEN-010-24 XBEN-011-24 XBEN-012-24 XBEN-013-24 XBEN-014-24 XBEN-015-24 XBEN-016-24 XBEN-017-24 XBEN-018-24 > /tmp/xbow_rest.txt`
`nohup bash scripts/xbow_consecutive.sh /tmp/xbow_rest.txt > eval_runs/xbow_rest.log 2>&1 &`
3. **Regenerate `~/src/pentest-ai-agents/REPORT-xbow.html`** with the full per-benchmark
capture table + token/cost columns, and CORRECT the root-cause section to GPU-VRAM/context
(current draft says "27b unstable" — wrong; it's the 180k context).
4. **Rerun trace lean+paths post-audit-fix** (confirms tasks-area fixes didn't regress):
`AB_FIXTURE=vulnyapi AB_ARMS="lean_no_errors,lean_paths" CONTRACTOR_EVAL_MODEL=lm-studio-qwen3.6-27b-mtp poetry run python scripts/ab_matrix_trace.py`
0. **Prereqs:** LM Studio up + single instance at safe context
`~/.lmstudio/bin/lms unload --all && ~/.lmstudio/bin/lms load qwen3.6-27b-mtp -c 65536 --parallel 1 -y`
(litellm proxy: `podman ps`; if down, `cd deploy/litellm && bash run.sh`).
1. **xbow: DONE (14/15).** Report regenerated. Only open case: XBEN-010 timed out at 900s on
the clean run. Optional larger-budget retry — run pytest DIRECTLY (not the wrapper):
`OBS='{"enabled":true,"include_tool_errors":false,"track_file_paths":true}'`
`CONTRACTOR_RUN_EVAL=1 CONTRACTOR_EVAL_MODEL=lm-studio-qwen3.6-27b-mtp CONTRACTOR_EVAL_OBSERVATIONS="$OBS" CONTRACTOR_XBOW_BENCHMARKS=XBEN-010-24 CONTRACTOR_XBOW_AGENT=exploit timeout 1800 poetry run pytest tests/eval/test_xbow_eval.py -s -q -k exploit`
2. **DONE — trace lean+paths post-audit-fix rerun.** No regression (see Eval findings above).
3. **REMAINING — open a PR** for the work when ready (currently on main, not pushed;
commits a50fd4e/7cf2ac9 + the observations/audit/harness chain above).

## Backlog / deferred
- **Deferred audit bugs** (verified, not yet fixed — see audit_report.html): ratelimits
Expand Down
16 changes: 15 additions & 1 deletion contractor/agents/worker_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
from contractor.callbacks.tokens import TokenUsageCallback
from contractor.tools import DEFAULT_HEAVY_TOOLS
from contractor.tools.tasks import SubtaskFormatter, _prepare_worker_instructions
from contractor.utils.settings import DEFAULT_MODEL
from contractor.utils.settings import DEFAULT_MODEL, get_settings


def build_summarization_message(
Expand Down Expand Up @@ -59,6 +59,7 @@ def build_worker(
with_elide: bool = True,
elide_tool_results: Iterable[str] | None = None,
elide_keep_last_n: int = 15,
elide_keep_budget_chars: int | None = None,
repeated_call_threshold: int = 5,
) -> LlmAgent:
"""Construct an :class:`LlmAgent` with the standard callback stack.
Expand Down Expand Up @@ -93,6 +94,13 @@ def build_worker(
is used.
elide_keep_last_n:
Number of recent eligible results to keep un-elided.
elide_keep_budget_chars:
Cumulative char budget for retained heavy-tool results. When
*None* (the default), ``Settings.fs_heavy_keep_budget_chars`` is
used (itself defaulting to ``0`` = budget axis disabled, i.e.
count-only retention). When > 0, large/stale results are evicted
once the running total would exceed this budget, even if
``elide_keep_last_n`` is not yet reached.
repeated_call_threshold:
Number of identical consecutive calls before the guardrail
fires.
Expand All @@ -113,9 +121,15 @@ def build_worker(
else list(DEFAULT_HEAVY_TOOLS)
)
if elide_targets:
keep_budget_chars = (
elide_keep_budget_chars
if elide_keep_budget_chars is not None
else get_settings().fs_heavy_keep_budget_chars
)
callback_adapter.register(
FunctionResultsRemovalCallback(
keep_last_n=elide_keep_last_n,
keep_budget_chars=keep_budget_chars,
target_tools=elide_targets,
)
)
Expand Down
7 changes: 7 additions & 0 deletions contractor/utils/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,13 @@ class Settings(BaseSettings):
# Default per-read line cap when read_file is called without an explicit
# `limit`. None disables the line cap (byte cap only).
fs_max_read_lines: int | None = Field(default=2000)
# Cumulative char budget for retained heavy-tool function results in the
# FunctionResultsRemovalCallback (env: FS_HEAVY_KEEP_BUDGET_CHARS). When > 0,
# large/stale heavy-tool results are elided once the running total of kept
# response sizes would exceed this budget, even if the count cap
# (keep_last_n) is not yet reached. Default 0 disables the budget axis, so
# retention stays count-only (historical behaviour).
fs_heavy_keep_budget_chars: int = Field(default=0)
code_max_walk_depth: int = Field(default=50)
code_max_files_per_walk: int = Field(default=100_000)
graph_max_results: int = Field(default=200)
Expand Down
33 changes: 33 additions & 0 deletions scripts/xbow_fix_base.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/env bash
# Make the buster-based XBOW benchmarks buildable.
#
# ~10 of the validation-benchmarks build FROM python:2.7.18-slim (Debian buster).
# buster is EOL: deb.debian.org/security.debian.org return 404 for it, so the
# benchmarks' `apt-get install` step fails with exit 100. This rebuilds a local
# python:2.7.18-slim whose apt sources point at archive.debian.org (buster main
# only; security/updates dropped) with the expired-Release check disabled — so
# `FROM python:2.7.18-slim` in the benchmarks resolves to the working image.
#
# Idempotent. Run once before an xbow batch. No fixture/submodule edits.
set -euo pipefail
ORIG="localhost/python27-orig:latest"
TARGET="docker.io/library/python:2.7.18-slim"

# Preserve a pristine copy of the upstream base the first time.
if ! podman image exists "$ORIG"; then
podman image exists "$TARGET" || podman pull "$TARGET"
podman tag "$TARGET" "$ORIG"
fi

tmp="$(mktemp -d)"
cat > "$tmp/Containerfile" <<'EOF'
FROM localhost/python27-orig:latest
RUN set -eux; \
sed -i -e 's|http://deb.debian.org/debian|http://archive.debian.org/debian|g' \
-e '/security\.debian\.org/d' \
-e '/buster-updates/d' /etc/apt/sources.list; \
printf 'Acquire::Check-Valid-Until "false";\n' > /etc/apt/apt.conf.d/99no-check-valid
EOF
podman build -t "$TARGET" "$tmp"
rm -rf "$tmp"
echo "patched $TARGET (buster -> archive.debian.org)"
88 changes: 87 additions & 1 deletion tests/eval/xbow.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,29 +159,115 @@ def discover_benchmarks(benchmarks_root: Path) -> list[XbowBenchmark]:
return found


_BUSTER_BASE_ENSURED = False


def ensure_buster_base() -> None:
"""Make ``python:2.7.18-slim`` (Debian buster, EOL) buildable.

Many XBOW benchmarks build ``FROM python:2.7.18-slim``. buster is EOL, so its
apt repos 404 (moved to archive.debian.org) and the benchmark's ``apt-get
install`` fails -> build ``exit 100``. We rebuild that image tag locally with
apt pointed at archive.debian.org (buster main; security/updates dropped) and
the expired-Release check disabled, so any current OR future buster-based
benchmark builds. Idempotent (once per process), best-effort — benchmarks on
other base images are unaffected.
"""
global _BUSTER_BASE_ENSURED
if _BUSTER_BASE_ENSURED:
return
_BUSTER_BASE_ENSURED = True
import tempfile

target = "docker.io/library/python:2.7.18-slim"
orig = "localhost/python27-orig:latest"

def _exists(img: str) -> bool:
return subprocess.run(
["podman", "image", "exists", img],
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL,
).returncode == 0

try:
if not _exists(orig):
if not _exists(target):
subprocess.run(["podman", "pull", target], check=True,
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.run(["podman", "tag", target, orig], check=True)
containerfile = (
"FROM localhost/python27-orig:latest\n"
"RUN sed -i "
"-e 's|http://deb.debian.org/debian|http://archive.debian.org/debian|g' "
"-e '/security\\.debian\\.org/d' -e '/buster-updates/d' "
"/etc/apt/sources.list "
"&& printf 'Acquire::Check-Valid-Until \"false\";\\n' "
"> /etc/apt/apt.conf.d/99no-check-valid\n"
)
with tempfile.TemporaryDirectory() as td:
Path(td, "Containerfile").write_text(containerfile)
subprocess.run(["podman", "build", "-t", target, td], check=True,
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
except Exception:
pass # best-effort; non-buster benchmarks build regardless


@dataclass
class XbowService:
"""Bring one XBOW benchmark up via podman-compose for dynamic testing."""

benchmark: XbowBenchmark
project_name: str = ""
_resolved_port: int | None = field(default=None, init=False)
_compose_file: str | None = field(default=None, init=False)

def __post_init__(self) -> None:
if not self.project_name:
self.project_name = f"xbow-{self.benchmark.id.lower()}"

def _effective_compose_file(self) -> str:
"""Compose file path safe for podman-compose.

docker-compose accepts ``expose: "host:container"`` but podman-compose
rejects it ("invalid range format for --expose"), which wedges the ~24
db-having benchmarks. If such mappings are present we emit a sanitized
sibling compose (mapping -> bare container port) and use that; otherwise
the original file is used unchanged. Sibling (not temp) so the relative
``build.context`` still resolves. Cached.
"""
if self._compose_file is not None:
return self._compose_file
src = Path(self.benchmark.compose_file)
self._compose_file = str(src)
try:
data = yaml.safe_load(src.read_text(encoding="utf-8")) or {}
changed = False
for svc in (data.get("services") or {}).values():
exp = svc.get("expose") if isinstance(svc, dict) else None
if isinstance(exp, list):
fixed = [str(e).split(":")[-1] for e in exp]
if fixed != [str(e) for e in exp]:
svc["expose"] = fixed
changed = True
if changed:
out = src.with_name("docker-compose.podman.yml")
out.write_text(yaml.safe_dump(data, sort_keys=False), encoding="utf-8")
self._compose_file = str(out)
except Exception:
pass
return self._compose_file

def _compose(self, *args: str) -> list[str]:
return [
"podman-compose",
"-f", str(self.benchmark.compose_file),
"-f", self._effective_compose_file(),
"-p", self.project_name,
*args,
]

def up(self, *, timeout: float = 120.0, quiet: bool = True) -> None:
import os

ensure_buster_base() # make buster-based benchmarks buildable (EOL apt fix)
env = dict(os.environ)
if self.benchmark.flag:
env["FLAG"] = self.benchmark.flag # build-arg `args: - FLAG`
Expand Down
Loading
Loading