Upd0607#54
Merged
Merged
Conversation
Two harness fixes so current + future XBOW cases run without per-case hacks: 1. ensure_buster_base(): ~10 benchmarks build FROM python:2.7.18-slim (Debian buster, EOL) — apt 404s -> build exit 100. up() now rebuilds that image tag locally with apt pointed at archive.debian.org (idempotent, best-effort). Validated: XBEN-004/010 build + capture after the fix. 2. expose sanitizer: podman-compose rejects docker-compose's expose: "host:container" (the ~24 db benchmarks) -> emits a sanitized sibling compose (expose -> bare container port). Validated: XBEN-001 (db), which previously wedged, now comes up healthy. scripts/xbow_fix_base.sh provides the base fix standalone too.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add optional per-fixture cost measurements to FixtureResult: tokens (int) and latency_ms (float), both defaulting to None. derive_totals sums them into totals['tokens'] and totals['latency_ms'] only when at least one fixture supplies a value; when no fixture measures cost, no new keys are added and the envelope (plus per-fixture dicts) is byte-for-byte unchanged, keeping existing eval/v1 producers and analytics consumers fully back-compatible. This is the instrument later token A/Bs read. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…QW1/AC2) Add a deterministic false-positive filter: any reported vuln finding whose file 'place' was never in the set of files the worker actually read/grep'd is treated as a hallucination and dropped before scoring. URL-type and empty places pass through (not file-checkable). Gated by CONTRACTOR_EMITTED_VS_READ, default off; when unset the scoring path is byte-for-byte the prior behavior. Read paths are derived from the harness read_file/grep tool-call args unioned with the fs 'file_paths' session-state key. Pure function partition_findings_by_read lives in tests/eval/scoring.py with a unit test under tests/units/contractor_tests/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p_budget_chars (QW3/A2) Add Settings.fs_heavy_keep_budget_chars (env: FS_HEAVY_KEEP_BUDGET_CHARS, default 0) and thread it into the FunctionResultsRemovalCallback built in build_worker as keep_budget_chars=, alongside the existing keep_last_n=15. Default 0 is a no-op: the budget axis stays disabled and heavy-tool result retention remains count-only (historical behaviour), so merging is safe. When set > 0, large/stale heavy-tool results are evicted once the cumulative kept-char total would exceed the budget, even if keep_last_n is not reached. An explicit elide_keep_budget_chars kwarg still overrides the setting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add dedupe_findings(): a pure, side-effect-free pass that collapses near-duplicate vuln findings before scoring. Groups by (normalised file, primary CWE) and, within a group, merges findings whose titles are near-identical by normalised-token Jaccard (>= 0.6, no external deps), keeping the most severe/specific representative. Distinct issues (different file, different CWE, or clearly different title) are kept; never merges across files. Gated by CONTRACTOR_VULN_DEDUP, default off; when unset the scoring path is byte-for-byte the prior behavior. Wired into the vuln-detection eval right after finding extraction, mirroring the CONTRACTOR_EMITTED_VS_READ wiring style. Unit test under tests/units/contractor_tests/test_vuln_dedup.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n (QW8/AC) Gated off by default. When ObservationConfig.track_coverage_gap is enabled (and the fs tools are built with capture_in_scope=True), the planner-visible observation block gains an unvisited_in_scope_paths list = (in-scope source files) - (files already read), bounded to 25 entries with a +N-more marker, to drive worker breadth/coverage. The in-scope file set is captured lazily and only when capture_in_scope is requested (a single memoized, hard-bounded fs walk per worker run); with the flag off there is no traversal and the projection is absent, so the default is a byte-for-byte no-op. Reuses the existing file_paths read-set capture and the fs _iter_all_files walk; the projector is pure and deterministic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…QW8 live) Make the coverage-gap observation live for planner-driven workflows. The TaskRunner spawn path now resolves the effective ObservationConfig and, when track_coverage_gap is on, passes capture_in_scope=True to the worker builder so the worker's fs tools run the in-scope walk and unvisited files are computed. Single wiring point: TaskRunner._spawn_planning_agent via the new _coverage_gap_kwargs helper. It opts in only when track_coverage_gap is set AND the builder accepts the kwarg (signature-checked), so builders without fs tools (oas_linter, http, ...) are never handed an unknown arg. The codereview and swe factories thread capture_in_scope through ro_file_tools. Default off = no kwarg passed, no walk, no projection: byte-identical to prior behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two harness fixes so current + future XBOW cases run without per-case hacks: 1. ensure_buster_base(): ~10 benchmarks build FROM python:2.7.18-slim (Debian buster, EOL) — apt 404s -> build exit 100. up() now rebuilds that image tag locally with apt pointed at archive.debian.org (idempotent, best-effort). Validated: XBEN-004/010 build + capture after the fix. 2. expose sanitizer: podman-compose rejects docker-compose's expose: "host:container" (the ~24 db benchmarks) -> emits a sanitized sibling compose (expose -> bare container port). Validated: XBEN-001 (db), which previously wedged, now comes up healthy. scripts/xbow_fix_base.sh provides the base fix standalone too.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # tests/eval/scoring.py # tests/eval/test_vuln_detection_eval.py
…ap path norm, deterministic in-scope cap, docs/test hygiene) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.