Skip to content

Upd0607#54

Merged
grauwolf32 merged 23 commits into
mainfrom
upd0607
Jun 7, 2026
Merged

Upd0607#54
grauwolf32 merged 23 commits into
mainfrom
upd0607

Conversation

@grauwolf32

Copy link
Copy Markdown
Owner

No description provided.

grauwolf32 and others added 23 commits June 6, 2026 16:27
Two harness fixes so current + future XBOW cases run without per-case hacks:

1. ensure_buster_base(): ~10 benchmarks build FROM python:2.7.18-slim (Debian
   buster, EOL) — apt 404s -> build exit 100. up() now rebuilds that image tag
   locally with apt pointed at archive.debian.org (idempotent, best-effort).
   Validated: XBEN-004/010 build + capture after the fix.
2. expose sanitizer: podman-compose rejects docker-compose's
   expose: "host:container" (the ~24 db benchmarks) -> emits a sanitized
   sibling compose (expose -> bare container port). Validated: XBEN-001 (db),
   which previously wedged, now comes up healthy.

scripts/xbow_fix_base.sh provides the base fix standalone too.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add optional per-fixture cost measurements to FixtureResult: tokens (int)
and latency_ms (float), both defaulting to None. derive_totals sums them
into totals['tokens'] and totals['latency_ms'] only when at least one
fixture supplies a value; when no fixture measures cost, no new keys are
added and the envelope (plus per-fixture dicts) is byte-for-byte unchanged,
keeping existing eval/v1 producers and analytics consumers fully
back-compatible. This is the instrument later token A/Bs read.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…QW1/AC2)

Add a deterministic false-positive filter: any reported vuln finding whose
file 'place' was never in the set of files the worker actually read/grep'd is
treated as a hallucination and dropped before scoring. URL-type and empty
places pass through (not file-checkable).

Gated by CONTRACTOR_EMITTED_VS_READ, default off; when unset the scoring path
is byte-for-byte the prior behavior. Read paths are derived from the harness
read_file/grep tool-call args unioned with the fs 'file_paths' session-state
key. Pure function partition_findings_by_read lives in tests/eval/scoring.py
with a unit test under tests/units/contractor_tests/.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p_budget_chars (QW3/A2)

Add Settings.fs_heavy_keep_budget_chars (env: FS_HEAVY_KEEP_BUDGET_CHARS,
default 0) and thread it into the FunctionResultsRemovalCallback built in
build_worker as keep_budget_chars=, alongside the existing keep_last_n=15.

Default 0 is a no-op: the budget axis stays disabled and heavy-tool result
retention remains count-only (historical behaviour), so merging is safe.
When set > 0, large/stale heavy-tool results are evicted once the cumulative
kept-char total would exceed the budget, even if keep_last_n is not reached.
An explicit elide_keep_budget_chars kwarg still overrides the setting.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add dedupe_findings(): a pure, side-effect-free pass that collapses
near-duplicate vuln findings before scoring. Groups by (normalised file,
primary CWE) and, within a group, merges findings whose titles are
near-identical by normalised-token Jaccard (>= 0.6, no external deps),
keeping the most severe/specific representative. Distinct issues (different
file, different CWE, or clearly different title) are kept; never merges
across files.

Gated by CONTRACTOR_VULN_DEDUP, default off; when unset the scoring path is
byte-for-byte the prior behavior. Wired into the vuln-detection eval right
after finding extraction, mirroring the CONTRACTOR_EMITTED_VS_READ wiring
style. Unit test under tests/units/contractor_tests/test_vuln_dedup.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n (QW8/AC)

Gated off by default. When ObservationConfig.track_coverage_gap is enabled
(and the fs tools are built with capture_in_scope=True), the planner-visible
observation block gains an unvisited_in_scope_paths list = (in-scope source
files) - (files already read), bounded to 25 entries with a +N-more marker, to
drive worker breadth/coverage.

The in-scope file set is captured lazily and only when capture_in_scope is
requested (a single memoized, hard-bounded fs walk per worker run); with the
flag off there is no traversal and the projection is absent, so the default is
a byte-for-byte no-op. Reuses the existing file_paths read-set capture and the
fs _iter_all_files walk; the projector is pure and deterministic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…QW8 live)

Make the coverage-gap observation live for planner-driven workflows. The
TaskRunner spawn path now resolves the effective ObservationConfig and, when
track_coverage_gap is on, passes capture_in_scope=True to the worker builder so
the worker's fs tools run the in-scope walk and unvisited files are computed.

Single wiring point: TaskRunner._spawn_planning_agent via the new
_coverage_gap_kwargs helper. It opts in only when track_coverage_gap is set AND
the builder accepts the kwarg (signature-checked), so builders without fs tools
(oas_linter, http, ...) are never handed an unknown arg. The codereview and swe
factories thread capture_in_scope through ro_file_tools.

Default off = no kwarg passed, no walk, no projection: byte-identical to prior
behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two harness fixes so current + future XBOW cases run without per-case hacks:

1. ensure_buster_base(): ~10 benchmarks build FROM python:2.7.18-slim (Debian
   buster, EOL) — apt 404s -> build exit 100. up() now rebuilds that image tag
   locally with apt pointed at archive.debian.org (idempotent, best-effort).
   Validated: XBEN-004/010 build + capture after the fix.
2. expose sanitizer: podman-compose rejects docker-compose's
   expose: "host:container" (the ~24 db benchmarks) -> emits a sanitized
   sibling compose (expose -> bare container port). Validated: XBEN-001 (db),
   which previously wedged, now comes up healthy.

scripts/xbow_fix_base.sh provides the base fix standalone too.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts:
#	tests/eval/scoring.py
#	tests/eval/test_vuln_detection_eval.py
…ap path norm, deterministic in-scope cap, docs/test hygiene)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@grauwolf32 grauwolf32 merged commit cd19e29 into main Jun 7, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant