Split SARIF broadcasts into a deduplicated finding pool#506
Split SARIF broadcasts into a deduplicated finding pool#506
Conversation
Extract individual findings from SARIF results into a separate Redis pool with fingerprint-based deduplication, and switch seed-gen vuln discovery to consume findings instead of raw SARIF broadcasts. Findings are formatted as compact XML hints in prompts, and sampling probability now scales with pool size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hbrodin
left a comment
There was a problem hiding this comment.
Review: 7 findings (5 P2, 2 P3)
The PR's core design is solid — extracting structured findings from raw SARIF and deduplicating via fingerprints is a good simplification for the seed-gen pipeline. The issues below are mostly around atomicity of the Redis dedup pattern, a consistency bug in the backward-compat fallback, and missing test coverage for the new methods.
See inline comments for details.
| if self.redis.sismember(seen_key, finding.fingerprint): | ||
| continue | ||
| self.redis.rpush(finding_key, finding.model_dump_json()) | ||
| self.redis.sadd(seen_key, finding.fingerprint) |
There was a problem hiding this comment.
P2 — Race condition: non-atomic dedup allows duplicate findings.
The sismember → rpush → sadd sequence is three independent Redis commands. Two concurrent store() calls for the same task_id can both pass the sismember check before either executes sadd, causing the same finding to be rpushed twice.
The rest of the codebase uses redis.pipeline() for atomic multi-key writes (e.g. maps.py:67, task_registry.py:99, sets.py:121).
Simplest fix — use sadd return value as the atomic guard:
for finding in findings:
if not self.redis.sadd(seen_key, finding.fingerprint):
continue
self.redis.rpush(finding_key, finding.model_dump_json())
added += 1sadd returns 1 if the member was added (new), 0 if it already existed. This eliminates the TOCTOU window in a single atomic check and also fixes the operation ordering (the guard is set before the list push, so a crash between the two can't leave an unguarded duplicate).
| if all_findings: | ||
| self._store_findings(task_id, all_findings) | ||
|
|
||
| return all_findings |
There was a problem hiding this comment.
P2 — Fallback returns un-deduplicated findings.
The primary path (line 220) returns from the deduplicated Redis pool, but this fallback returns the raw all_findings list. If two stored SARIFs contain findings with the same fingerprint, the first call returns duplicates (while all subsequent calls return the deduplicated pool). This inconsistency propagates to sample_findings(), inflating both the probability calculation (len(self.findings)) and the sampling pool.
Suggested fix — re-read from Redis after storing, or deduplicate before returning:
if all_findings:
self._store_findings(task_id, all_findings)
# Return from the deduplicated pool to match the primary path
return self.get_findings_by_task_id(task_id)
return [](This is safe from infinite recursion because _store_findings populates the finding key, so the recursive call hits the primary if finding_list: path.)
| """ | ||
| # Get all SARIF keys in Redis | ||
| if added > 0: | ||
| logger.info("Added %d new findings for task %s (total pool: %d)", added, task_id, added) |
There was a problem hiding this comment.
P2 — Log message says "total pool" but reports the batch count.
The third format arg is added again, not the actual pool size. If 2 findings are added to a pool that already has 10, the log says total pool: 2.
| logger.info("Added %d new findings for task %s (total pool: %d)", added, task_id, added) | |
| logger.info("Added %d new findings for task %s (total pool: %d)", added, task_id, self.redis.llen(finding_key)) |
| if pool_size == 0: | ||
| return [] | ||
|
|
||
| probability = min(1.0, self.MIN_FINDING_PROBABILITY + 0.14 * pool_size) |
There was a problem hiding this comment.
P3 — Docstring/formula mismatch; magic constant.
The docstring says probability starts at MIN_FINDING_PROBABILITY (0.3) for 1 finding, but 0.3 + 0.14 * 1 = 0.44. The formula reaches 1.0 at pool_size=5 as intended, but the base case doesn't match the docs.
If the intent is p=0.3 at 1 finding and p=1.0 at 5 findings:
scale = (1.0 - self.MIN_FINDING_PROBABILITY) / 4 # 0.175
probability = min(1.0, self.MIN_FINDING_PROBABILITY + scale * (pool_size - 1))Deriving the coefficient from the constants also eliminates the magic 0.14.
| logger.info("Using %d SARIFs for challenge %s", len(self.sarifs), self.package_name) | ||
| return self.sarifs | ||
| return [] | ||
| def sample_findings(self) -> list[Finding]: |
There was a problem hiding this comment.
P2 — No test coverage for sample_findings().
This method has non-trivial logic (probability scaling, MAX_FINDINGS cap, random.sample) but zero tests. Both vuln_discovery test fixtures pass findings=[], so sample_findings() always returns [] in tests.
Edge cases worth covering: empty pool → [], pool_size=1 (probability=0.44), pool_size > MAX_FINDINGS (capped sampling), pool_size=5+ (probability clamped to 1.0). Use unittest.mock.patch('random.random') and patch('random.sample') to make the tests deterministic.
| def format_sarif_hints(self) -> str: | ||
| """Format SARIF hints for prompts""" | ||
| if not self.sarifs: | ||
| def format_finding_hints(self) -> str: |
There was a problem hiding this comment.
P2 — No test coverage for format_finding_hints().
This method builds structured XML from Finding objects and is called by both Full and Delta tasks' _gather_context and _analyze_bug prompts. Since test fixtures pass findings=[], this always returns "" in tests — the XML formatting, multi-line vs single-line display, and field interpolation are never exercised.
| assert findings[1].rule_id == "CWE-787" | ||
|
|
||
|
|
||
| def test_findings_deduplication(sarif_store, sarif_with_findings): |
There was a problem hiding this comment.
P3 — Dedup test only covers same-SARIF case.
This stores the identical SARIF broadcast twice, but the primary real-world dedup scenario is two different SARIFs containing findings with the same fingerprint (same rule+location, different sarif_id). Consider adding a test with a second SARIF broadcast that has an overlapping finding to verify cross-SARIF deduplication.
Extract individual findings from SARIF results into a separate Redis pool with fingerprint-based deduplication, and switch seed-gen vuln discovery to consume findings instead of raw SARIF broadcasts. Findings are formatted as compact XML hints in prompts, and sampling probability now scales with pool size.
CC @reytchison