Skip to content

Qw/emitted vs read#52

Merged
grauwolf32 merged 6 commits into
mainfrom
qw/emitted-vs-read
Jun 7, 2026
Merged

Qw/emitted vs read#52
grauwolf32 merged 6 commits into
mainfrom
qw/emitted-vs-read

Conversation

@grauwolf32

Copy link
Copy Markdown
Owner

No description provided.

grauwolf32 and others added 6 commits June 6, 2026 16:27
Two harness fixes so current + future XBOW cases run without per-case hacks:

1. ensure_buster_base(): ~10 benchmarks build FROM python:2.7.18-slim (Debian
   buster, EOL) — apt 404s -> build exit 100. up() now rebuilds that image tag
   locally with apt pointed at archive.debian.org (idempotent, best-effort).
   Validated: XBEN-004/010 build + capture after the fix.
2. expose sanitizer: podman-compose rejects docker-compose's
   expose: "host:container" (the ~24 db benchmarks) -> emits a sanitized
   sibling compose (expose -> bare container port). Validated: XBEN-001 (db),
   which previously wedged, now comes up healthy.

scripts/xbow_fix_base.sh provides the base fix standalone too.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…QW1/AC2)

Add a deterministic false-positive filter: any reported vuln finding whose
file 'place' was never in the set of files the worker actually read/grep'd is
treated as a hallucination and dropped before scoring. URL-type and empty
places pass through (not file-checkable).

Gated by CONTRACTOR_EMITTED_VS_READ, default off; when unset the scoring path
is byte-for-byte the prior behavior. Read paths are derived from the harness
read_file/grep tool-call args unioned with the fs 'file_paths' session-state
key. Pure function partition_findings_by_read lives in tests/eval/scoring.py
with a unit test under tests/units/contractor_tests/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@grauwolf32 grauwolf32 merged commit 40996af into main Jun 7, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant