Upd0607 by grauwolf32 · Pull Request #54 · grauwolf32/contractor

grauwolf32 · 2026-06-07T06:04:43Z

No description provided.

Two harness fixes so current + future XBOW cases run without per-case hacks: 1. ensure_buster_base(): ~10 benchmarks build FROM python:2.7.18-slim (Debian buster, EOL) — apt 404s -> build exit 100. up() now rebuilds that image tag locally with apt pointed at archive.debian.org (idempotent, best-effort). Validated: XBEN-004/010 build + capture after the fix. 2. expose sanitizer: podman-compose rejects docker-compose's expose: "host:container" (the ~24 db benchmarks) -> emits a sanitized sibling compose (expose -> bare container port). Validated: XBEN-001 (db), which previously wedged, now comes up healthy. scripts/xbow_fix_base.sh provides the base fix standalone too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add optional per-fixture cost measurements to FixtureResult: tokens (int) and latency_ms (float), both defaulting to None. derive_totals sums them into totals['tokens'] and totals['latency_ms'] only when at least one fixture supplies a value; when no fixture measures cost, no new keys are added and the envelope (plus per-fixture dicts) is byte-for-byte unchanged, keeping existing eval/v1 producers and analytics consumers fully back-compatible. This is the instrument later token A/Bs read. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…QW1/AC2) Add a deterministic false-positive filter: any reported vuln finding whose file 'place' was never in the set of files the worker actually read/grep'd is treated as a hallucination and dropped before scoring. URL-type and empty places pass through (not file-checkable). Gated by CONTRACTOR_EMITTED_VS_READ, default off; when unset the scoring path is byte-for-byte the prior behavior. Read paths are derived from the harness read_file/grep tool-call args unioned with the fs 'file_paths' session-state key. Pure function partition_findings_by_read lives in tests/eval/scoring.py with a unit test under tests/units/contractor_tests/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…p_budget_chars (QW3/A2) Add Settings.fs_heavy_keep_budget_chars (env: FS_HEAVY_KEEP_BUDGET_CHARS, default 0) and thread it into the FunctionResultsRemovalCallback built in build_worker as keep_budget_chars=, alongside the existing keep_last_n=15. Default 0 is a no-op: the budget axis stays disabled and heavy-tool result retention remains count-only (historical behaviour), so merging is safe. When set > 0, large/stale heavy-tool results are evicted once the cumulative kept-char total would exceed the budget, even if keep_last_n is not reached. An explicit elide_keep_budget_chars kwarg still overrides the setting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add dedupe_findings(): a pure, side-effect-free pass that collapses near-duplicate vuln findings before scoring. Groups by (normalised file, primary CWE) and, within a group, merges findings whose titles are near-identical by normalised-token Jaccard (>= 0.6, no external deps), keeping the most severe/specific representative. Distinct issues (different file, different CWE, or clearly different title) are kept; never merges across files. Gated by CONTRACTOR_VULN_DEDUP, default off; when unset the scoring path is byte-for-byte the prior behavior. Wired into the vuln-detection eval right after finding extraction, mirroring the CONTRACTOR_EMITTED_VS_READ wiring style. Unit test under tests/units/contractor_tests/test_vuln_dedup.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…n (QW8/AC) Gated off by default. When ObservationConfig.track_coverage_gap is enabled (and the fs tools are built with capture_in_scope=True), the planner-visible observation block gains an unvisited_in_scope_paths list = (in-scope source files) - (files already read), bounded to 25 entries with a +N-more marker, to drive worker breadth/coverage. The in-scope file set is captured lazily and only when capture_in_scope is requested (a single memoized, hard-bounded fs walk per worker run); with the flag off there is no traversal and the projection is absent, so the default is a byte-for-byte no-op. Reuses the existing file_paths read-set capture and the fs _iter_all_files walk; the projector is pure and deterministic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…QW8 live) Make the coverage-gap observation live for planner-driven workflows. The TaskRunner spawn path now resolves the effective ObservationConfig and, when track_coverage_gap is on, passes capture_in_scope=True to the worker builder so the worker's fs tools run the in-scope walk and unvisited files are computed. Single wiring point: TaskRunner._spawn_planning_agent via the new _coverage_gap_kwargs helper. It opts in only when track_coverage_gap is set AND the builder accepts the kwarg (signature-checked), so builders without fs tools (oas_linter, http, ...) are never handed an unknown arg. The codereview and swe factories thread capture_in_scope through ro_file_tools. Default off = no kwarg passed, no walk, no projection: byte-identical to prior behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two harness fixes so current + future XBOW cases run without per-case hacks: 1. ensure_buster_base(): ~10 benchmarks build FROM python:2.7.18-slim (Debian buster, EOL) — apt 404s -> build exit 100. up() now rebuilds that image tag locally with apt pointed at archive.debian.org (idempotent, best-effort). Validated: XBEN-004/010 build + capture after the fix. 2. expose sanitizer: podman-compose rejects docker-compose's expose: "host:container" (the ~24 db benchmarks) -> emits a sanitized sibling compose (expose -> bare container port). Validated: XBEN-001 (db), which previously wedged, now comes up healthy. scripts/xbow_fix_base.sh provides the base fix standalone too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…k artifact)

# Conflicts: # tests/eval/scoring.py # tests/eval/test_vuln_detection_eval.py

…ap path norm, deterministic in-scope cap, docs/test hygiene) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… + review fixes

grauwolf32 and others added 23 commits June 6, 2026 16:27

docs(resume): xbow 14/15 final + corrected XBEN-010 status

a50fd4e

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(resume): XBEN-010 confirmed reproducible timeout (900s + 1800s)

7cf2ac9

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(resume): trace lean+paths post-audit rerun — no regression

ff4dcd9

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

QW1: update

20e63db

docs(resume): xbow 14/15 final + corrected XBEN-010 status

0796906

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(resume): XBEN-010 confirmed reproducible timeout (900s + 1800s)

b655420

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(resume): trace lean+paths post-audit rerun — no regression

729438f

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Merge branch 'qw/eval-cost-metric' into quickwins-safe-win

4a36007

Merge branch 'qw/byte-retention' into quickwins-safe-win

a6bde9d

Merge branch 'qw/coverage-obs' into quickwins-safe-win

1dddae5

Merge branch 'qw/emitted-vs-read' into quickwins-safe-win

40996af

fix: restore tests/playground submodule gitlink (drop worktree symlin…

815c9c4

…k artifact)

Merge branch 'qw/vuln-dedup' into quickwins-safe-win

e1d9004

# Conflicts: # tests/eval/scoring.py # tests/eval/test_vuln_detection_eval.py

fix(quickwins): address review findings (idempotent dedup, coverage-g…

b7ae4b5

…ap path norm, deterministic in-scope cap, docs/test hygiene) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge quickwins-safe-win: QW1/QW3/QW4/QW7/QW8 (all gated/default-off)…

cc3ff3a

… + review fixes

grauwolf32 merged commit cd19e29 into main Jun 7, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upd0607#54

Upd0607#54
grauwolf32 merged 23 commits into
mainfrom
upd0607

grauwolf32 commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grauwolf32 commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant