Skip to content

Emit Run 0 SARIF for accepted Type-1 and Type-2 clone pairs#173

Merged
leynos merged 11 commits intomainfrom
emit-sarif-run0-type1-type2-qn437v
Apr 8, 2026
Merged

Emit Run 0 SARIF for accepted Type-1 and Type-2 clone pairs#173
leynos merged 11 commits intomainfrom
emit-sarif-run0-type1-type2-qn437v

Conversation

@leynos
Copy link
Copy Markdown
Owner

@leynos leynos commented Mar 29, 2026

Summary

  • Implements Run 0 emission for accepted Type-1 and Type-2 clone pairs. Adds a new run0 module under crates/whitaker_clones_core/ with accept_candidate_pairs and emit_run0, plus unit tests and BDD coverage, and updates docs and roadmap accordingly.

What’s included

  • New module subtree: crates/whitaker_clones_core/src/run0/ containing:
    • emit.rs (SARIF emission logic)
    • error.rs (typed Run 0 errors)
    • mod.rs (module wiring)
    • score.rs (integer-based Jaccard scoring and thresholds)
    • span.rs (byte-range to SARIF region conversion)
    • types.rs (public input/output types)
    • tests.rs (unit tests for Run 0 logic)
  • Public API exposure in crates/whitaker_clones_core/src/lib.rs for Run 0:
    • AcceptedPair, Run0Error, Run0Result, SimilarityRatio, SimilarityThreshold, TokenFragment, TokenPassConfig, accept_candidate_pairs, emit_run0.
  • Behavioural and integration tests:
    • crates/whitaker_clones_core/src/run0_sarif.feature (BDD feature file)
    • crates/whitaker_clones_core/src/run0_sarif_behaviour.rs (BDD harness)
  • Tests and fixtures for Run 0 emission:
    • crates/whitaker_clones_core/tests/features/run0_sarif.feature (feature area)
    • crates/whitaker_clones_core/tests/run0_sarif_behaviour.rs (behaviour tests)
  • ExecPlan living document updated to reflect implementation:
    • docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md (status COMPLETE)
  • Roadmap and design updates:
    • docs/roadmap.md updated to mark 7.2.3 as done
    • docs/whitaker-clone-detector-design.md updated with dedicated 7.2.3 decisions (in-file section)

Rationale

  • This completes roadmap item 7.2.3 by turning the prior planning into a concrete, test-covered Run 0 emission path. The new run0 module reuses existing SARIF tooling (whitaker_sarif) and keeps emission logic encapsulated within whitaker_clones_core as planned, enabling deterministic results and stable regions for downstream tooling.

Implementation plan (high level)

  • Stage A: Draft ExecPlan (completed)
  • Stage B: Add unit tests for thresholds, span mapping, and deterministic ordering (completed)
  • Stage C: Add BDD coverage for Run 0 emission (completed)
  • Stage D: Implement run0 module skeleton and API surface (completed)
  • Stage E: Implement Run 0 emission logic (completed)
  • Stage F: Documentation updates (## Implementation decisions (7.2.3)) (completed)
  • Stage G: Mark roadmap item done in docs/roadmap.md (completed)
  • Stage H: Validate with all gates (fmt, lint, test, etc.) (completed)
  • Stage I: Finalize ExecPlan post-implementation (completed)

How to review

  • Verify the new Run 0 API surface and the emitted SARIF run shape align with the design goals (one SARIF result per accepted pair, with stable primary/related locations and fingerprints).
  • Confirm deterministic sorting and deduplication behavior for multiple accepted pairs.
  • Check unit and behavioural tests exercise threshold boundaries, multi-line span conversion, and reversed-pair determinism.
  • Review documentation updates for accuracy and consistency with the code changes.

Documentation impact

  • Introduces a new ExecPlan living document updated to COMPLETE status.
  • Roadmap and design docs reflect the implemented Run 0 emission decisions and outcomes.

Acceptance criteria

  • The PR implements Run 0 emission for Type-1 and Type-2 accepted pairs with deterministic results.
  • Unit tests and BDD tests cover happy paths, threshold boundaries, multi-line spans, and reversed inputs.
  • The emitted SARIF run uses the existing whitaker_sarif tooling and carries stable fingerprints and 1-based SARIF regions.
  • Documentation and roadmap reflect completion of 7.2.3.

📎 Task: https://www.devboxer.com/task/81bbb2d8-2581-49d3-a21b-cba40d1d3c53

📎 Task: https://www.devboxer.com/task/3515b49a-8747-49b6-80b3-5e2575dce6dd

Summary by Sourcery

Implement token-pass Run 0 SARIF emission for accepted clone pairs and expose a public API for acceptance and emission in whitaker_clones_core.

New Features:

  • Add a dedicated run0 module in whitaker_clones_core for token-pass acceptance and SARIF Run 0 emission.
  • Expose public types and functions for Run 0, including TokenFragment, TokenPassConfig, AcceptedPair, SimilarityThreshold, accept_candidate_pairs, and emit_run0.
  • Introduce deterministic Jaccard-based scoring and threshold handling for Type-1 and Type-2 clone pairs.

Enhancements:

  • Define stable pair and token hash fingerprints using SHA-256 for SARIF partialFingerprints.
  • Implement byte-range to SARIF region mapping with validation for UTF-8 alignment and bounds.
  • Ensure deterministic sorting and deduplication of emitted SARIF results based on rule, fingerprints, and message.

Documentation:

  • Document implementation decisions for roadmap item 7.2.3 in the clone-detector design and add a detailed ExecPlan for Run 0 emission.
  • Update the roadmap to mark SARIF Run 0 emission for accepted Type-1 and Type-2 pairs as complete.

Tests:

  • Add unit tests for similarity scoring, threshold boundaries, span mapping, acceptance logic, and Run 0 emission behaviour.
  • Add BDD feature scenarios and harness for end-to-end Run 0 SARIF emission, including edge cases for empty inputs, multi-line spans, and reversed pairs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 72183855-e501-4b9f-9b64-08260f8da88e

📥 Commits

Reviewing files that changed from the base of the PR and between 6204a3f and de1788a.

📒 Files selected for processing (5)
  • crates/whitaker_clones_core/src/run0/mod.rs
  • crates/whitaker_clones_core/src/run0/test_helpers.rs
  • crates/whitaker_clones_core/src/run0/tests.rs
  • crates/whitaker_clones_core/src/run0/tests_emit.rs
  • docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md

Token-Pass Run 0 SARIF Emission for Accepted Type-1 and Type-2 Clone Pairs

Overview

Introduces deterministic SARIF Run 0 emission for accepted Type-1 and Type-2 clone candidate pairs in crates/whitaker_clones_core. A complete run0 module subtree has been implemented with acceptance logic, similarity scoring, span mapping, and SARIF result generation, supported by comprehensive unit and BDD-style integration tests.

New Module: run0/

The implementation introduces a new module subtree under crates/whitaker_clones_core/src/run0/ consisting of:

  • emit.rs: Core logic for accept_candidate_pairs() and emit_run0(). Validates thresholds, resolves fragments, rejects pairs with mixed normalisation profiles or empty fingerprint sets, computes Jaccard similarity, assigns rule profiles, and generates SARIF results with deterministic sorting and deduplication.

  • score.rs: Integer-only similarity scoring with SimilarityRatio (intersection/union holder with decimal-string formatting) and SimilarityThreshold (validated ratio with cross-multiplication for threshold checks). Includes jaccard_similarity() and select_rule_profile() helpers.

  • span.rs: Byte-range to SARIF region mapping with region_for_range(), converting validated ranges to 1-based line/column coordinates with UTF-8 boundary validation and byteOffset/byteLength tracking.

  • types.rs: Public data types—TokenFragment (with ID, profile, URI, source text, and retained fingerprints), TokenPassConfig (tool identity and token-pass parameters with fluent builders), and AcceptedPair (candidate pair with assigned profile and similarity ratio).

  • error.rs: Typed error handling via Run0Error enum covering missing fragments, empty fingerprint sets, mixed profiles, invalid thresholds, invalid ranges, UTF-8 boundary violations, and SARIF construction failures.

  • test_helpers.rs: Shared test utilities (FragmentInput, fingerprint(), fragment(), pair(), config()) reducing duplication across test modules.

Public API Surface

New exports from crates/whitaker_clones_core:

  • Functions: accept_candidate_pairs(), emit_run0()
  • Types: AcceptedPair, Run0Error, Run0Result, SimilarityRatio, SimilarityThreshold, TokenFragment, TokenPassConfig

SARIF Output Characteristics

  • Rule IDs: WHK001 for Type-1, WHK002 for Type-2
  • Stable partial fingerprints: whitakerFragment (ID + profile SHA-256) and tokenHash (sorted, deduplicated hashes)
  • Results include primary location (left fragment) and related locations (peer fragments)
  • Whitaker properties attached: profile, k (shingle size), window (winnow window), class_size, and jaccard (via integer-to-decimal conversion avoiding floating-point)
  • Deterministic sorting and deduplication of results

Testing

  • Unit tests (tests.rs): Exercises acceptance boundaries, similarity scoring, region mapping (single/multi-line spans), and error paths (InvalidUtf8Boundary, MissingFragment, MixedProfiles).
  • BDD integration tests (run0_sarif.feature and run0_sarif_behaviour.rs): Six scenarios validating Type-1/Type-2 acceptance, rejection below thresholds, error conditions, region formatting, and deterministic result ordering.
  • SARIF-specific tests (tests_emit.rs): Validates Jaccard deduplication, empty set handling, rule ID assignment, and WhitakerProperties field extraction.

Dependencies

Added sha2 and whitaker_sarif (both workspace dependencies) to support fingerprint hashing and SARIF result construction.

Documentation

  • New ExecPlan: docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md documents the contract, staged delivery, observable requirements, typed errors, and verification gates. This supersedes the roadmap task 7.2.3.
  • Design document update (docs/whitaker-clone-detector-design.md): New section "Implementation decisions (7.2.3)" specifies profile-specific acceptance, stable fingerprint shapes, integer-based scoring with decimal formatting, byte-range validation rules, and deterministic result ordering with SARIF deduplication.
  • Roadmap update (docs/roadmap.md): Task 7.2.3 marked complete.
  • Minor formatting (docs/adr-002-dylint-expect-attribute-macro.md): Text wrapping and phrasing adjustments.

Code Quality

Refactored test scaffolding to reduce duplication via shared helpers in test_helpers.rs. Complex threshold validation extracted into private predicates. Large assertion blocks collapsed to struct-equality assertions where possible. All review comments addressed with minimal, scoped refactors.

Walkthrough

Summarise: Add a public run0 module that accepts Type‑1/Type‑2 token‑pass candidate pairs, computes integer Jaccard similarity with validated thresholds, enforces profile and fingerprint invariants, constructs deterministic SHA‑256 partial fingerprints, maps byte ranges to 1‑based SARIF regions with UTF‑8 checks, emits SARIF Run 0 results, and deterministically sorts/deduplicates outputs.

Changes

Cohort / File(s) Summary
Core Run0 module
crates/whitaker_clones_core/src/run0/mod.rs, crates/whitaker_clones_core/src/run0/emit.rs, crates/whitaker_clones_core/src/run0/error.rs, crates/whitaker_clones_core/src/run0/score.rs, crates/whitaker_clones_core/src/run0/span.rs, crates/whitaker_clones_core/src/run0/types.rs
Introduce Run 0 implementation: candidate acceptance, integer-only Jaccard scoring and threshold validation, profile enforcement and missing/empty fingerprint errors, deterministic SHA‑256 partial fingerprints, byte-range→SARIF region conversion with UTF‑8 boundary validation, SARIF result construction, deterministic sorting and deduplication, and typed Run0Error/Run0Result.
Run0 re‑exports
crates/whitaker_clones_core/src/lib.rs
Re-export run0 API at crate root: accept_candidate_pairs, emit_run0 and related types (Run0Error, Run0Result, SimilarityRatio, SimilarityThreshold, TokenFragment, TokenPassConfig, AcceptedPair).
Dependencies
crates/whitaker_clones_core/Cargo.toml
Add workspace dependencies: sha2 and whitaker_sarif.
Unit tests
crates/whitaker_clones_core/src/run0/tests.rs, crates/whitaker_clones_core/src/run0/tests_emit.rs, crates/whitaker_clones_core/src/run0/test_helpers.rs
Add unit tests and test helpers covering threshold boundaries, Jaccard behaviour (duplicates and empty sets), region mapping and UTF‑8 boundary errors, SARIF emission shape/content, sorting and deduplication, and expected fingerprints/properties.
BDD integration tests
crates/whitaker_clones_core/tests/run0_sarif_behaviour.rs, crates/whitaker_clones_core/tests/features/run0_sarif.feature
Add BDD scenarios exercising end‑to‑end Run 0: T1/T2 acceptance, below‑threshold rejection, empty fingerprint errors, multiline region mapping, deterministic output, and reversed input handling.
Documentation: execplan & design
docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md, docs/whitaker-clone-detector-design.md
Add execplan and design notes describing Run 0 contract: integer thresholding, SHA‑256 fingerprinting, SARIF contracts, region mapping rules, deterministic ordering and deduplication; update design decisions and implementation plan.
Roadmap
docs/roadmap.md
Mark roadmap item 7.2.3 as completed for SARIF Run 0 emission.
ADR text update
docs/adr-002-dylint-expect-attribute-macro.md
Minor phrasing and line‑wrapping adjustments; no behavioural change.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Fragments as Fragment Store
    participant Candidates as Candidate Queue
    participant Accept as accept_candidate_pairs()
    participant Score as Similarity Scorer
    participant Emit as emit_run0()
    participant SARIF as whitaker_sarif
    participant Results as SARIF Run

    Client->>Fragments: Load TokenFragment instances
    Client->>Candidates: Queue CandidatePair entries

    Client->>Accept: call accept_candidate_pairs(fragments, candidates, config)
    activate Accept
    Accept->>Accept: Validate thresholds and build lookup
    Accept->>Fragments: Resolve CandidatePair -> (left, right)
    Accept->>Accept: Enforce same NormProfile and non-empty retained fingerprints
    loop For each candidate
        Accept->>Score: jaccard_similarity(left_hashes, right_hashes)
        activate Score
        Score-->>Accept: Return SimilarityRatio or None
        deactivate Score
        Accept->>Accept: select_rule_profile(ratio, thresholds)
        Accept->>Accept: Accumulate AcceptedPair if accepted
    end
    Accept->>Accept: Sort & deduplicate AcceptedPair list
    Accept-->>Client: Return Vec<AcceptedPair>
    deactivate Accept

    Client->>Emit: call emit_run0(fragments, accepted_pairs, config)
    activate Emit
    Emit->>Fragments: Resolve pairs and validate fingerprints
    loop For each AcceptedPair
        Emit->>Emit: Compute pair/token SHA‑256 hex fingerprints
        Emit->>Emit: Convert byte ranges -> SARIF Region (validate UTF‑8)
        Emit->>SARIF: build_result with rule id, locations, fingerprints, properties
        SARIF-->>Results: Append SarifResult
    end
    Emit->>Results: Sort and deduplicate SarifResult list
    Emit->>SARIF: Build Run (tool name/version, rules, results)
    Emit-->>Client: Return SARIF Run
    deactivate Emit
Loading

Poem

🔍 Tokens pass through similarity's gaze,
Jaccard counts align in careful ways,
SHA‑256 fingerprints steady and clear,
SARIF maps bytes to lines, precise and near,
Run 0 emits results, sorted and sincere. ✨

🚥 Pre-merge checks | ✅ 5 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Testing ⚠️ Warning The PR introduces substantial new functionality with 11 unit tests + 6 BDD scenarios covering happy paths and many error cases, but several public API items and error paths lack adequate test coverage including SimilarityRatio accessors and Run0Error variants. Add unit tests for SimilarityThreshold::new() with invalid inputs, SimilarityRatio::as_decimal_string(), all public accessor methods, dedicated unit test for Run0Error::EmptyFingerprintSet, and InvalidScore tests using parameterised tests to maximise coverage of boundary conditions.
Developer Documentation ⚠️ Warning The run0/ module's APIs and architectural boundaries lack documentation in docs/developers-guide.md, despite being documented in Rustdoc, design documents, and ExecPlan specifications. Add a new section to docs/developers-guide.md documenting the whitaker_clones_core architecture, covering the run0/ module's purpose, public APIs, and token-pass layer integration with usage examples.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarises the main change: implementing Run 0 SARIF emission for accepted Type-1 and Type-2 clone pairs, which is the core objective of the PR.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the new run0 module, public API, tests, documentation updates, and implementation rationale.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
User-Facing Documentation ✅ Passed This PR adds internal infrastructure to the unpublished whitaker_clones_core crate that is not yet exposed to end users through the CLI or Dylint lints.
Module-Level Documentation ✅ Passed All newly-added modules in the run0 subtree carry explicit, clear module-level docstrings explaining their purpose and function within the token-pass acceptance and SARIF Run 0 emission pipeline.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch emit-sarif-run0-type1-type2-qn437v

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Mar 29, 2026

Reviewer's Guide

Implements a new run0 module in whitaker_clones_core that accepts Type-1/Type-2 candidate pairs, scores them with integer Jaccard logic, and emits deterministic SARIF Run 0 results via whitaker_sarif, including stable fingerprints, span mapping, tests, and docs/roadmap updates to mark roadmap item 7.2.3 complete.

Class diagram for whitaker_clones_core run0 types and functions

classDiagram
    class TokenFragment {
        +FragmentId id()
        +NormProfile profile()
        +&str file_uri()
        +&str source_text()
        +&[Fingerprint] retained_fingerprints()
        +new(id: FragmentId, profile: NormProfile, file_uri: String, source_text: String) TokenFragment
        +with_retained_fingerprints(retained_fingerprints: Vec_Fingerprint_) TokenFragment
    }

    class TokenPassConfig {
        +&str tool_name()
        +&str tool_version()
        +usize shingle_size()
        +usize winnow_window()
        +SimilarityThreshold type1_threshold()
        +SimilarityThreshold type2_threshold()
        +new(tool_name: String, tool_version: String) TokenPassConfig
        +with_shingle_size(shingle_size: usize) TokenPassConfig
        +with_winnow_window(winnow_window: usize) TokenPassConfig
        +with_type1_threshold(threshold: SimilarityThreshold) TokenPassConfig
        +with_type2_threshold(threshold: SimilarityThreshold) TokenPassConfig
    }

    class AcceptedPair {
        +&CandidatePair pair()
        +NormProfile profile()
        +SimilarityRatio score()
        +new(pair: CandidatePair, profile: NormProfile, score: SimilarityRatio) AcceptedPair
    }

    class SimilarityRatio {
        -usize intersection
        -usize union
        +new(intersection: usize, union: usize) SimilarityRatio
        +usize intersection()
        +usize union()
        +String as_decimal_string()
    }

    class SimilarityThreshold {
        -&static str name
        -usize numerator
        -usize denominator
        +new(name: &static str, numerator: usize, denominator: usize) SimilarityThreshold
        +new_unchecked(name: &static str, numerator: usize, denominator: usize) SimilarityThreshold
        +usize numerator()
        +usize denominator()
    }

    class Run0Error {
        <<enum>>
        MissingFragment
        EmptyFingerprintSet
        MixedProfiles
        InvalidThreshold
        InvalidFingerprintRange
        InvalidUtf8Boundary
        Sarif
        InvalidScore
    }

    class run0_mod {
        +accept_candidate_pairs(fragments: &[TokenFragment], candidates: &[CandidatePair], config: &TokenPassConfig) Run0Result_Vec_AcceptedPair__
        +emit_run0(fragments: &[TokenFragment], accepted_pairs: &[AcceptedPair], config: &TokenPassConfig) Run0Result_Run_
    }

    class CandidatePair {
        +FragmentId left()
        +FragmentId right()
    }

    class Fingerprint {
        +u64 hash
        +Range_usize_ range
    }

    class FragmentId
    class NormProfile {
        <<enum>>
        T1
        T2
    }

    class Run {
    }

    class WhitakerSarifBuilders {
        RunBuilder
        ResultBuilder
        LocationBuilder
        RegionBuilder
    }

    %% Relationships
    TokenFragment --> FragmentId
    TokenFragment --> NormProfile
    TokenFragment --> Fingerprint

    AcceptedPair --> CandidatePair
    AcceptedPair --> NormProfile
    AcceptedPair --> SimilarityRatio

    TokenPassConfig --> SimilarityThreshold

    SimilarityThreshold ..> Run0Error
    SimilarityRatio ..> Run0Error

    run0_mod ..> TokenFragment
    run0_mod ..> TokenPassConfig
    run0_mod ..> AcceptedPair
    run0_mod ..> CandidatePair
    run0_mod ..> Run0Error
    run0_mod ..> Run
    run0_mod ..> WhitakerSarifBuilders

    Run0Error ..> Run

    Run ..> WhitakerSarifBuilders
Loading

Flow diagram for byte-range to SARIF region mapping in span module

flowchart TD
    A["Input: fragment_id, source_text, byte range start..end"] --> B["validate_range: check start < end and end <= source_text.len"]
    B -->|invalid| E1["Run0Error::InvalidFingerprintRange"]
    B -->|valid| C["check UTF-8 boundaries at start and end"]
    C -->|invalid| E2["Run0Error::InvalidUtf8Boundary"]
    C -->|valid| D["line_starts: build vector of line start offsets"]
    D --> F["line_and_column for start: find line_index where start_offset <= byte < next_start"]
    F --> G["compute start_line = index+1, start_column = byte - line_start + 1"]
    G --> H["line_and_column for end-1: same logic for end_position"]
    H --> I["compute end_line and end_column"]
    I --> J["build Region with start_line, start_column, end_line, end_column, byteOffset=start, byteLength=end-start"]
    J --> K["Return Region or propagate SarifError as Run0Error::Sarif"]
Loading

File-Level Changes

Change Details Files
Add run0 module implementing token-pass acceptance and SARIF Run 0 emission with deterministic scoring, span mapping, and fingerprints.
  • Introduce TokenFragment, TokenPassConfig, AcceptedPair, SimilarityRatio, SimilarityThreshold and related helpers as the public API for Run 0 inputs/outputs.
  • Implement integer-only Jaccard similarity and threshold handling, including profile-based rule selection and stable decimal formatting for scores.
  • Implement byte-range to SARIF Region mapping with validation of bounds and UTF-8 boundaries, returning typed Run0Error variants on failure.
  • Implement accept_candidate_pairs to resolve CandidatePair to TokenFragment, enforce consistent NormProfile, compute Jaccard scores, apply T1/T2 thresholds, and produce a sorted/deduplicated list of AcceptedPair.
  • Implement emit_run0 to map AcceptedPair into whitaker_sarif::SarifResult instances with primary/related locations, WHK001/WHK002 rule ids, WhitakerProperties, and SHA-256-based partial fingerprints, then build a Run with deduplicate_results.
  • Add focused unit tests covering thresholds, Jaccard set semantics, span mapping (single-line/multi-line/invalid), primary vs related location behaviour, result sorting/deduplication, fingerprint keys, and WhitakerProperties content.
crates/whitaker_clones_core/src/run0/mod.rs
crates/whitaker_clones_core/src/run0/types.rs
crates/whitaker_clones_core/src/run0/score.rs
crates/whitaker_clones_core/src/run0/span.rs
crates/whitaker_clones_core/src/run0/error.rs
crates/whitaker_clones_core/src/run0/emit.rs
crates/whitaker_clones_core/src/run0/tests.rs
Expose the new Run 0 API surface from whitaker_clones_core and wire in required dependencies.
  • Export run0 module from lib.rs and re-export the public Run 0 types and functions so downstream crates can call accept_candidate_pairs and emit_run0.
  • Add sha2 and whitaker_sarif as workspace dependencies for hashing and SARIF modelling/emission.
crates/whitaker_clones_core/src/lib.rs
crates/whitaker_clones_core/Cargo.toml
Add BDD-style integration tests for Run 0 SARIF behaviour using rstest-bdd.
  • Define a Run0World fixture that holds fragments, candidate pairs, config, emitted Run, and possible Run0Error, plus helper accessors for results and WhitakerProperties.
  • Provide step definitions for loading named fragment fixtures, queuing candidate pairs, emitting Run 0 (acceptance + emission), and asserting on result count, rule id, locations, primary region/file, Whitaker profile/k/window, and error messages.
  • Capture behaviour in a feature file with scenarios for accepted T1/T2 pairs, below-threshold pairs, empty fingerprints, multi-line spans, and reversed-pair determinism.
crates/whitaker_clones_core/tests/run0_sarif_behaviour.rs
crates/whitaker_clones_core/tests/features/run0_sarif.feature
Document final 7.2.3 implementation decisions and mark roadmap item complete via an ExecPlan and design/roadmap updates.
  • Add a detailed ExecPlan living document for roadmap item 7.2.3, including constraints, risks, decisions, progress, and validation notes, marked as COMPLETE.
  • Add a new 'Implementation decisions (7.2.3)' section to the clone-detector design describing the dedicated run0 module, profile-specific acceptance, SHA-256-based fingerprints, integer-only thresholds, primary/related location contract, span mapping rules, and updated candidate-pair semantics.
  • Mark roadmap item 7.2.3 as done in docs/roadmap.md to reflect completion of Run 0 emission work.
docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md
docs/whitaker-clone-detector-design.md
docs/roadmap.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

codescene-delta-analysis[bot]

This comment was marked as outdated.

codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos leynos changed the title Plan SARIF Run 0 emission for Type-1/Type-2 clones Emit SARIF Run 0 for accepted Type-1/Type-2 clone pairs Mar 30, 2026
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 2, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/whitaker_clones_core/src/run0/score.rs

Comment on file

    }

    pub(crate) fn validate(self) -> Run0Result<()> {
        if self.numerator == 0 || self.denominator == 0 || self.numerator > self.denominator {

❌ New issue: Complex Conditional
SimilarityThreshold.validate has 1 complex conditionals with 2 branches, threshold = 2

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 2, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/whitaker_clones_core/src/run0/tests.rs

Comment on file

fn boundary_threshold_accepts_type1_pair() {
    let fragments = vec![
        fragment(FragmentInput {
            id: "alpha",
            profile: NormProfile::T1,
            file_uri: "src/a.rs",
            source_text: "fn a() {}\n",
            hashes: &[(1, 0..8), (2, 0..8)],
        }),
        fragment(FragmentInput {
            id: "beta",
            profile: NormProfile::T1,
            file_uri: "src/b.rs",
            source_text: "fn b() {}\n",
            hashes: &[(1, 0..8), (2, 0..8)],
        }),
    ];

    let accepted = accept_candidate_pairs(&fragments, &[pair("alpha", "beta")], &config())
        .unwrap_or_else(|error| panic!("unexpected acceptance error: {error}"));

    assert_eq!(accepted.len(), 1);
    assert_eq!(accepted[0].profile(), NormProfile::T1);
    assert_eq!(accepted[0].score(), SimilarityRatio::new(2, 2));
}

❌ New issue: Code Duplication
The module contains 3 functions with similar structure: boundary_threshold_accepts_type1_pair,boundary_threshold_accepts_type2_pair,just_below_threshold_is_rejected

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 2, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/whitaker_clones_core/src/run0/tests.rs

Comment on file

fn emitted_result_contains_fingerprints_and_properties() {
    let fragments = vec![
        fragment(FragmentInput {
            id: "alpha",
            profile: NormProfile::T2,
            file_uri: "src/a.rs",
            source_text: "fn a(x: i32) {}\n",
            hashes: &[(1, 0..15), (2, 0..15)],
        }),
        fragment(FragmentInput {
            id: "beta",
            profile: NormProfile::T2,
            file_uri: "src/b.rs",
            source_text: "fn b(y: i32) {}\n",
            hashes: &[(1, 0..15), (2, 0..15)],
        }),
    ];
    let accepted = vec![AcceptedPair::new(
        pair("alpha", "beta"),
        NormProfile::T2,
        SimilarityRatio::new(2, 2),
    )];

    let run = emit_run0(&fragments, &accepted, &config())
        .unwrap_or_else(|error| panic!("unexpected emit error: {error}"));
    let [result] = run.results.as_slice() else {
        panic!("expected one result");
    };
    let properties = result
        .properties
        .as_ref()
        .unwrap_or_else(|| panic!("Whitaker properties must be present"));
    let extracted = WhitakerProperties::try_from(properties)
        .unwrap_or_else(|error| panic!("unexpected property extraction error: {error}"));

    assert_eq!(result.rule_id, WHK002_ID);
    assert!(
        result
            .partial_fingerprints
            .contains_key(WHITAKER_FRAGMENT_KEY)
    );
    assert!(result.partial_fingerprints.contains_key("tokenHash"));
    assert_eq!(extracted.profile, "T2");
    assert_eq!(extracted.k, 25);
    assert_eq!(extracted.window, 16);
    assert_eq!(extracted.class_size, 2);
}

❌ New issue: Large Assertion Blocks
The test suite contains 5 assertion blocks with at least 4 assertions, threshold = 4

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 2, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/whitaker_clones_core/tests/run0_sarif_behaviour.rs

Comment on file

fn then_whitaker_profile_is(world: &Run0World, profile: String) {
    with_results(world, |results| {
        let [result] = results else {
            panic!("exactly one result must exist before checking Whitaker properties");
        };
        let properties = match result.properties.as_ref() {
            Some(properties) => properties,
            None => panic!("Whitaker properties must be present"),
        };
        match WhitakerProperties::try_from(properties) {
            Ok(extracted) => assert_eq!(extracted.profile, profile),
            Err(error) => panic!("unexpected property extraction error: {error}"),
        }
    });
}

❌ New issue: Code Duplication
The module contains 3 functions with similar structure: then_whitaker_k_is,then_whitaker_profile_is,then_whitaker_window_is

@coderabbitai

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

@coderabbitai

This comment was marked as resolved.

codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 3, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response.

crates/whitaker_clones_core/src/run0/tests.rs

Comment on file

fn single_line_region_uses_one_based_columns() {
    let region = region_for_range("alpha", "fn a() {}\n", 0..8)
        .unwrap_or_else(|error| panic!("unexpected region error: {error}"));

    assert_eq!(region.start_line, 1);
    assert_eq!(region.start_column, Some(1));
    assert_eq!(region.end_line, Some(1));
    assert_eq!(region.end_column, Some(8));
    assert_eq!(region.byte_offset, Some(0));
    assert_eq!(region.byte_length, Some(8));
}

❌ New issue: Large Assertion Blocks
The test suite contains 4 assertion blocks with at least 4 assertions, threshold = 4

@coderabbitai

This comment was marked as resolved.

codescene-delta-analysis[bot]

This comment was marked as outdated.

@devboxerhub devboxerhub Bot force-pushed the emit-sarif-run0-type1-type2-qn437v branch from 278c73b to eb5f747 Compare April 3, 2026 21:35
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 3, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix:

Code Duplication

crates/whitaker_clones_core/src/run0/tests.rs:

What lead to degradation?

The module contains 5 functions with similar structure: boundary_threshold_accepts_type1_pair,boundary_threshold_accepts_type2_pair,just_below_threshold_is_rejected,multi_line_region_tracks_trailing_newline and 1 more functions

Why does this problem occur?

Duplicated code often leads to code that's harder to change since the same logical change has to be done in multiple functions. More duplication gives lower code health.

How to fix it?

A certain degree of duplicated code might be acceptable. The problems start when it is the same behavior that is duplicated across the functions in the module, ie. a violation of the Don't Repeat Yourself (DRY) principle. DRY violations lead to code that is changed together in predictable patterns, which is both expensive and risky. DRY violations can be identified using CodeScene's X-Ray analysis to detect clusters of change coupled functions with high code similarity. Read More
Once you have identified the similarities across functions, look to extract and encapsulate the concept that varies into its own function(s). These shared abstractions can then be re-used, which minimizes the amount of duplication and simplifies change.

@coderabbitai

This comment was marked as resolved.

codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos leynos changed the title Emit SARIF Run 0 for accepted Type-1/Type-2 clone pairs Plan and Implement Run 0 SARIF emission for Type-1/Type-2 clone pairs Apr 4, 2026
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 5, 2026

@coderabbitai Please suggest a fix for this issue and supply a prompt for an AI coding agent to enable it to apply the fix. Include the file and symbol names indicated in the issue at the head of your response. Ensure that this is validated against the current version of the codegraph.

If further refinement to address this finding would be deleterious, please supply a clear explanatory one to two paragraph markdown message I can paste into the CodeScene web ui's diagnostic suppression function so this diagnostic can be silenced.

Code Duplication

crates/whitaker_clones_core/src/run0/tests.rs:

What lead to degradation?

The module contains 5 functions with similar structure: boundary_threshold_accepts_type1_pair,boundary_threshold_accepts_type2_pair,just_below_threshold_is_rejected,multi_line_region_tracks_trailing_newline and 1 more functions

Why does this problem occur?

Duplicated code often leads to code that's harder to change since the same logical change has to be done in multiple functions. More duplication gives lower code health.

How to fix it?

A certain degree of duplicated code might be acceptable. The problems start when it is the same behavior that is duplicated across the functions in the module, ie. a violation of the Don't Repeat Yourself (DRY) principle. DRY violations lead to code that is changed together in predictable patterns, which is both expensive and risky. DRY violations can be identified using CodeScene's X-Ray analysis to detect clusters of change coupled functions with high code similarity. Read More
Once you have identified the similarities across functions, look to extract and encapsulate the concept that varies into its own function(s). These shared abstractions can then be re-used, which minimizes the amount of duplication and simplifies change.

@coderabbitai

This comment was marked as resolved.

codescene-delta-analysis[bot]

This comment was marked as outdated.

leynos and others added 3 commits April 5, 2026 17:35
Introduce a comprehensive new ExecPlan markdown document specifying roadmap item 7.2.3. This document outlines the design, constraints, tolerances, risks, and detailed implementation plan for emitting SARIF Run 0 data for accepted Type-1 and Type-2 clone pairs within `whitaker_clones_core`. It details the expected public API, testing strategy, and integration points with `whitaker_sarif`. This lays the groundwork for upcoming development stages and enforces documentation and quality standards for the feature.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
- Introduce a dedicated `run0/` module subtree for token-pass acceptance and SARIF emission
- Implement acceptance of candidate pairs by profile with Jaccard similarity thresholding
- Emit deterministic SARIF Run 0 results including primary and related locations
- Compute SHA-256 hex digests for pair fingerprints and token hashes
- Add integer-only Jaccard scoring and threshold validation
- Map byte-range fingerprints to SARIF regions with line-column info
- Provide comprehensive typed error handling for emission failures
- Add unit tests and BDD feature scenarios for acceptance and emission
- Update crate public API to expose new token-pass types and methods
- Update docs and roadmap to mark 7.2.3 implementation complete

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
- Added `is_valid` method to `SimilarityThreshold` for clarity and reused it in validation.
- Extracted common test logic for building pairs and accepting them into `build_pair_and_accept` helper.
- Added tests ensuring emitted SARIF results contain correct Whitaker rule ID and fingerprints.
- Refactored test step definitions to use a helper `with_whitaker_properties` for cleaner property assertions.
- Improved test readability and reduced duplication across run0 tests and SARIF behavior tests.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
Extract helper functions `assert_single_accepted` and `assert_region` to reduce duplication and improve clarity in test assertions. Update tests to use these helpers for verifying accepted pairs and region calculations.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@devboxerhub devboxerhub Bot force-pushed the emit-sarif-run0-type1-type2-qn437v branch from a8c0e86 to 3c98af7 Compare April 5, 2026 17:53
codescene-delta-analysis[bot]

This comment was marked as outdated.

@leynos leynos changed the title Plan and Implement Run 0 SARIF emission for Type-1/Type-2 clone pairs Emit Run 0 SARIF for accepted Type-1 and Type-2 clone pairs Apr 5, 2026
@leynos leynos marked this pull request as ready for review April 5, 2026 19:02
sourcery-ai[bot]

This comment was marked as resolved.

@coderabbitai coderabbitai Bot added the Roadmap label Apr 5, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c98af7edb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/whitaker_clones_core/src/run0/span.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/whitaker_clones_core/src/run0/emit.rs`:
- Around line 48-57: The calls to ensure_non_empty are redundant with
jaccard_similarity's None result; remove the ensure_non_empty(left) and
ensure_non_empty(right) calls and keep the let Some(score) =
jaccard_similarity(left.retained_fingerprints(), right.retained_fingerprints())
else { return Err(Run0Error::EmptyFingerprintSet { fragment_id:
left.id().as_str().to_owned(), }); }; construct so empty fingerprint sets are
handled only by jaccard_similarity returning None (referencing ensure_non_empty,
jaccard_similarity, Run0Error::EmptyFingerprintSet,
left.retained_fingerprints(), right.retained_fingerprints(), left.id()).

In `@crates/whitaker_clones_core/src/run0/score.rs`:
- Around line 159-171: The repeated_division function performs O(quotient) work
by looping subtraction; replace that loop with native integer division and
modulus to get the quotient and remainder in constant time: in
repeated_division, keep the denominator==0 check returning None, then compute
quotient = numerator / denominator and remainder = numerator % denominator and
return Some((quotient, remainder)) instead of the while loop to preserve
determinism and correctness (function name: repeated_division).

In `@crates/whitaker_clones_core/src/run0/span.rs`:
- Around line 59-65: The current line_and_column computes column as a byte
offset which breaks for non-ASCII characters; change its signature to fn
line_and_column(source_text: &str, starts: &[usize], offset: usize) -> (usize,
usize), clamp the offset to source_text.len(), take the substring from the line
start to the clamped offset, and compute the column as the count of UTF-16 code
units (e.g., line_slice.encode_utf16(...).count() + 1); then update every call
site (e.g., the places computing (start_line, start_column) and (end_line,
end_column)) to pass source_text as the first argument. Ensure you still compute
line_index the same way and return 1-based line and column numbers.

In `@crates/whitaker_clones_core/src/run0/tests.rs`:
- Around line 1-401: This file exceeds 400 lines because the T2 emission tests
and their helper run were kept in the same file; extract the T2-specific pieces
into a new test module/file to stay under the limit. Create a new tests_emit.rs
(or tests_emit submodule) and move make_t2_emission_run plus the three tests
emitted_t2_result_has_correct_rule_id,
emitted_t2_result_contains_required_fingerprint_keys, and
emitted_t2_result_properties_match_config into it; ensure the new module can
access helpers fragment(), FragmentInput, config(), fragment(), and any types
(WhitakerProperties, WHK002_ID) by importing with use super::* or by moving
small helper functions (e.g., make_t2_emission_run) as pub(crate) helpers into a
shared test_helpers.rs and have both test files import them; update the parent
tests.rs to declare the new submodule (mod tests_emit;) or include the new
test_helpers module so compilation and symbol resolution for
make_t2_emission_run and the emitted_* tests remains correct.
- Around line 42-45: The test helper function pair uses
CandidatePair::new(...).unwrap_or_else(|| panic!(...)), which is verbose; change
it to call .expect("pair `{left}` and `{right}` must be distinct") on the Result
returned by CandidatePair::new so the failure message remains descriptive but
uses the preferred .expect pattern; update the pair function that constructs
CandidatePair from FragmentId::from(left) and FragmentId::from(right)
accordingly.

In `@crates/whitaker_clones_core/src/run0/types.rs`:
- Around line 218-222: Add SimilarityRatio to the top-level imports (import it
from super::score) and then replace occurrences of super::score::SimilarityRatio
with the imported SimilarityRatio type in the AcceptedPair struct definition and
in the related method signatures that currently reference
super::score::SimilarityRatio; specifically update the AcceptedPair declaration
(pair: CandidatePair, profile: NormProfile, score: SimilarityRatio) and any
functions that take or return that score type so they use the imported symbol
instead of the fully-qualified path.

In `@crates/whitaker_clones_core/tests/run0_sarif_behaviour.rs`:
- Line 31: Replace the hard-coded tool version string in the TokenPassConfig
initialization with the compile-time crate version: locate the config field
where TokenPassConfig::new("whitaker_clones_cli@token", "0.2.1") is used and
change the second argument to env!("CARGO_PKG_VERSION") so the fixture binds to
package metadata at compile time.

In
`@docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md`:
- Around line 523-525: The two long shell command lines (the ones starting with
"set -o pipefail && cargo test -p whitaker_clones_core" and "set -o pipefail &&
cargo clippy -p whitaker_clones_core --all-targets --all-features -- -D
warnings") must be wrapped to respect the 120-column Markdown rule: break each
command with a backslash continuation and indent the wrapped continuation lines
so the code block remains equivalent (use "set -o pipefail && \" then a newline
with two-space indentation before "cargo test ..." and similarly for the "cargo
clippy" command with an additional backslash before "-D warnings" to keep the
long flag on its own wrapped line), preserving the original behavior (pipes to
tee and redirection to logs).

In `@docs/whitaker-clone-detector-design.md`:
- Around line 505-508: Update the wording to clarify error types across layers:
explain that while the index/minhash layer returns
IndexError::EmptyFingerprintSet when sketching an empty retained-fingerprint
list, the Run 0 layer (run0/emit.rs) does not propagate that error and instead
surfaces Run0Error::EmptyFingerprintSet; callers should match on
Run0Error::EmptyFingerprintSet for Run 0 operations or translate it to the index
error if normalizing across layers. Include references to
IndexError::EmptyFingerprintSet and Run0Error::EmptyFingerprintSet so readers
can locate the exact variants.
- Around line 477-483: Replace the article text that reads "a SHA-256 hex
digest" with "an SHA-256 hex digest" for the descriptions of
partialFingerprints["whitakerFragment"] and partialFingerprints["tokenHash"];
update the sentence(s) that mention those symbols so they use "an" before
"SHA-256" to fix the grammar.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 99e47aa3-895a-4b2f-a3f0-a2f0353fea2c

📥 Commits

Reviewing files that changed from the base of the PR and between 502edb6 and 3c98af7.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (14)
  • crates/whitaker_clones_core/Cargo.toml
  • crates/whitaker_clones_core/src/lib.rs
  • crates/whitaker_clones_core/src/run0/emit.rs
  • crates/whitaker_clones_core/src/run0/error.rs
  • crates/whitaker_clones_core/src/run0/mod.rs
  • crates/whitaker_clones_core/src/run0/score.rs
  • crates/whitaker_clones_core/src/run0/span.rs
  • crates/whitaker_clones_core/src/run0/tests.rs
  • crates/whitaker_clones_core/src/run0/types.rs
  • crates/whitaker_clones_core/tests/features/run0_sarif.feature
  • crates/whitaker_clones_core/tests/run0_sarif_behaviour.rs
  • docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md
  • docs/roadmap.md
  • docs/whitaker-clone-detector-design.md

Comment thread crates/whitaker_clones_core/src/run0/emit.rs Outdated
Comment thread crates/whitaker_clones_core/src/run0/score.rs
Comment thread crates/whitaker_clones_core/src/run0/span.rs Outdated
Comment thread crates/whitaker_clones_core/src/run0/tests.rs
Comment thread crates/whitaker_clones_core/src/run0/tests.rs Outdated
Comment thread crates/whitaker_clones_core/src/run0/types.rs
Comment thread crates/whitaker_clones_core/tests/run0_sarif_behaviour.rs Outdated
Comment thread docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md Outdated
Comment thread docs/whitaker-clone-detector-design.md
Comment thread docs/whitaker-clone-detector-design.md Outdated
- Simplify repeated_division to use native division operators with debug assertion
- Fix column computation to count Unicode characters instead of bytes
- Add test coverage for InvalidUtf8Boundary error path
- Add test coverage for jaccard_similarity with empty fragments
- Add test coverage for MissingFragment and MixedProfiles error paths
- Fix documentation: use 'relatedLocations' instead of 'related_locations'
- Define SARIF and CLI acronyms on first use in execplan

The column computation fix ensures SARIF viewers correctly highlight
non-ASCII source code by counting visual columns rather than byte offsets.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Code improvements:
- Remove redundant ensure_non_empty calls (jaccard_similarity handles this)
- Fix column computation to use UTF-16 code units for SARIF compatibility
- Split tests.rs into tests.rs and tests_emit.rs (now under 400 lines)
- Replace unwrap_or_else with expect in pair() test helper
- Import SimilarityRatio directly in types.rs (remove super:: prefix)
- Use env!("CARGO_PKG_VERSION") in run0_sarif_behaviour.rs

Documentation improvements:
- Wrap long shell commands in execplan to respect 120-column limit
- Fix article grammar: 'a SHA-256' -> 'an SHA-256' in design doc
- Clarify error types across layers (IndexError vs Run0Error)

The UTF-16 column fix ensures SARIF viewers correctly highlight code
in files with multi-byte UTF-8 characters, as SARIF columns are
UTF-16 code units per the specification.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
crates/whitaker_clones_core/src/run0/tests.rs (1)

1-486: 🛠️ Refactor suggestion | 🟠 Major

Split or parameterise the remaining test matrix.

The helpers improved readability, but the file is still 486 lines long, so it still misses the 400-line cap. Move the T2 emission block into a sibling test module and collapse the remaining acceptance and region permutations into #[rstest(...)] cases to bring the file back under the limit.

As per coding guidelines, “Files must not exceed 400 lines in length”, and “Replace duplicated tests with #[rstest(...)] parameterised cases.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/whitaker_clones_core/src/run0/tests.rs` around lines 1 - 486, This
test file exceeds the 400-line limit and duplicates test permutations; split the
T2 emission tests into a separate sibling test module (move make_t2_emission_run
and the emitted_t2_result_* tests into that module) and convert repeated
acceptance/region cases that call build_pair_and_accept,
boundary_threshold_accepts_type1_pair, boundary_threshold_accepts_type2_pair,
just_below_threshold_is_rejected, duplicate_hashes_do_not_inflate_jaccard_score,
jaccard_returns_none_for_empty_fragments,
single_line_region_uses_one_based_columns,
multi_line_region_tracks_trailing_newline, invalid_range_produces_typed_error,
invalid_utf8_boundary_produces_typed_error,
missing_fragment_produces_typed_error, and mixed_profiles_produces_typed_error
into parameterised #[rstest(...)] cases that drive the same assertions (use the
FragmentInput helper, fingerprint, fragment, pair, config, and
assert_region/assert_single_accepted helpers) so duplicated scenarios are
collapsed and the file drops below 400 lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/whitaker_clones_core/src/run0/score.rs`:
- Around line 9-42: Add Rustdoc doctest examples for the public types and
methods so consumers can see runnable usage: add a "/// # Examples" block for
SimilarityRatio showing construction via SimilarityRatio::new(intersection,
union) and demonstrating formatting with as_decimal_string() (e.g., a simple
example that prints or asserts the string), and add a similar doctest for
SimilarityThreshold (constructing it and showing its behavior/formatting);
ensure examples are valid Rust code that compile and run under cargo test / make
test and reference the public symbols SimilarityRatio, SimilarityRatio::new,
SimilarityRatio::as_decimal_string, and the SimilarityThreshold type/methods so
documentation tests pass.
- Around line 82-90: The exported constructor new_unchecked on the
Threshold-like type exposes an invariant-bypassing API; restrict its visibility
to the crate by changing its signature from pub const fn new_unchecked(...) to
pub(crate) const fn new_unchecked(...), so only internal trusted defaults can
call it and external callers must use the validated public constructors/creators
(refer to the new_unchecked function in score.rs to locate the change).

In `@crates/whitaker_clones_core/src/run0/span.rs`:
- Around line 15-16: The end-position calculation currently does a raw byte
subtract (range.end.saturating_sub(1)), which can land inside a UTF-8 multibyte
character and cause the slice in line_and_column to panic; update the assignment
to compute end_position by finding the previous character boundary before
range.end (e.g., take source_text[..range.end].char_indices().next_back() and
use its index or fall back to range.start) so line_and_column receives a valid
character boundary; update the code around the end_position binding that
precedes the call to line_and_column(source_text, &starts, end_position) and
ensure errors still propagate as Run0Error where applicable.

In
`@docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md`:
- Around line 88-89: The document introduces the unexplained term "Firecrawl
MCP"; expand the acronym on its first occurrence by replacing or augmenting
"Firecrawl MCP" with "Firecrawl MCP (Machine Communication Protocol)" or the
correct full phrase so readers see the full name on first use; ensure subsequent
mentions can keep the shortened "MCP" form and update any nearby references in
the same paragraph if needed.
- Line 1: The document is missing a top-level heading; update the first line to
be a level-1 heading by prefixing the existing title text "Emit Static Analysis
Results Interchange Format (SARIF) Run 0 for accepted Type-1 and Type-2 clone
pairs (roadmap 7.2.3)" with "# " so the file starts with "# Emit Static Analysis
Results Interchange Format (SARIF) Run 0 for accepted Type-1 and Type-2 clone
pairs (roadmap 7.2.3)".
- Line 141: Update the title-case headings to sentence case: change "Surprises &
Discoveries" to "Surprises & discoveries", "Decision Log" to "Decision log",
"Outcomes & Retrospective" to "Outcomes & retrospective", and convert all Stage
headings from Title Case to sentence case; apply the same normalization to the
other occurrences of these headings referenced in the review so all instances
match the documentation style guide.

---

Duplicate comments:
In `@crates/whitaker_clones_core/src/run0/tests.rs`:
- Around line 1-486: This test file exceeds the 400-line limit and duplicates
test permutations; split the T2 emission tests into a separate sibling test
module (move make_t2_emission_run and the emitted_t2_result_* tests into that
module) and convert repeated acceptance/region cases that call
build_pair_and_accept, boundary_threshold_accepts_type1_pair,
boundary_threshold_accepts_type2_pair, just_below_threshold_is_rejected,
duplicate_hashes_do_not_inflate_jaccard_score,
jaccard_returns_none_for_empty_fragments,
single_line_region_uses_one_based_columns,
multi_line_region_tracks_trailing_newline, invalid_range_produces_typed_error,
invalid_utf8_boundary_produces_typed_error,
missing_fragment_produces_typed_error, and mixed_profiles_produces_typed_error
into parameterised #[rstest(...)] cases that drive the same assertions (use the
FragmentInput helper, fingerprint, fragment, pair, config, and
assert_region/assert_single_accepted helpers) so duplicated scenarios are
collapsed and the file drops below 400 lines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: bf481d05-c5c5-43f6-9ba7-bd1e66b27ce1

📥 Commits

Reviewing files that changed from the base of the PR and between 3c98af7 and 2f35bc9.

📒 Files selected for processing (4)
  • crates/whitaker_clones_core/src/run0/score.rs
  • crates/whitaker_clones_core/src/run0/span.rs
  • crates/whitaker_clones_core/src/run0/tests.rs
  • docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md

Comment thread crates/whitaker_clones_core/src/run0/score.rs
Comment thread crates/whitaker_clones_core/src/run0/score.rs
Comment thread crates/whitaker_clones_core/src/run0/span.rs Outdated
Comment thread docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md Outdated
Comment thread docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md (2)

88-89: ⚠️ Potential issue | 🟡 Minor

Expand MCP on first use.

Replace the first occurrence with the expanded form, then use the acronym
afterwards.

Triage: [type:docstyle]

As per coding guidelines: “Expand any uncommon acronym on first use in documentation, for example, Continuous Integration (CI).”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md`
around lines 88 - 89, Replace the first occurrence of the acronym "MCP" with its
expanded form (e.g., "Model Clarification Protocol (MCP)" or the correct
expanded term used in this project) in the document so the acronym is defined on
first use, then keep using "MCP" for subsequent mentions; update the sentence
containing "Firecrawl MCP" to read "Firecrawl [Expanded Form] (MCP)" and leave
later references as "MCP".

141-141: ⚠️ Potential issue | 🟡 Minor

Convert headings to sentence case.

Rename these headings to sentence case to keep documentation style consistent.

Triage: [type:docstyle]

As per coding guidelines: “Write headings in sentence case in documentation.”

Also applies to: 169-169, 261-261, 368-368, 388-388, 414-414, 444-444, 480-480, 497-497, 502-502

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md`
at line 141, Change the documentation headings from title case/ampersand form to
sentence case—e.g., rename "Surprises & Discoveries" to "Surprises and
discoveries"—and apply the same conversion to the other flagged headings in this
markdown (replace "&" with "and" where appropriate and only capitalize the first
word and proper nouns). Locate the headings in the file (e.g., the "Surprises &
Discoveries" header and the other flagged headers) and update their text to
sentence case to match the doc style guide.
crates/whitaker_clones_core/src/run0/span.rs (1)

15-16: ⚠️ Potential issue | 🔴 Critical

Replace raw byte subtraction for end-position calculation.

Fix Line 15. range.end.saturating_sub(1) can land inside a multibyte UTF-8
code point and trigger a panic when slicing at Line 65.

Patch end-position to the previous character boundary
-    let end_position = range.end.saturating_sub(1);
+    let end_position = source_text[..range.end]
+        .char_indices()
+        .next_back()
+        .map_or(range.start, |(index, _)| index);
     let (end_line, end_column) = line_and_column(source_text, &starts, end_position);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/whitaker_clones_core/src/run0/span.rs` around lines 15 - 16, The
subtraction end_position = range.end.saturating_sub(1) can land inside a UTF-8
codepoint; instead find the previous UTF-8 character boundary before range.end
in the source_text and pass that byte index to line_and_column. Replace the raw
subtraction with logic that walks back to the nearest char boundary (e.g., using
source_text.is_char_boundary in a loop or taking
source_text[..range.end].char_indices().last().map(|(i,_)| i)) to compute the
safe end_position, keeping references to source_text, starts, and the call to
line_and_column unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/whitaker_clones_core/src/run0/emit.rs`:
- Around line 41-54: The EmptyFingerprintSet error currently always reports
left.id() when jaccard_similarity returns None; change the logic so you check
which fragment has an empty retained_fingerprints() before returning the error
and include that fragment's id in Run0Error::EmptyFingerprintSet. Specifically,
in the code around resolve_pair(...) and the jaccard_similarity call, test
left.retained_fingerprints().is_empty() and
right.retained_fingerprints().is_empty() (or inspect the slices passed to
jaccard_similarity) and return Run0Error::EmptyFingerprintSet with fragment_id
set to the empty fragment's id (left.id().as_str().to_owned() or
right.id().as_str().to_owned()) so the error correctly names the failing
fragment.

In `@crates/whitaker_clones_core/src/run0/tests_emit.rs`:
- Around line 45-47: The test helper config() currently passes a hard-coded
version string to TokenPassConfig::new which will go stale after crate version
bumps; change the second argument to use the crate's package version at compile
time (e.g., via env!("CARGO_PKG_VERSION")) so TokenPassConfig::new in function
config() binds to the crate metadata instead of the literal "0.2.1".

In `@crates/whitaker_clones_core/src/run0/tests.rs`:
- Around line 7-12: The import block containing symbols AcceptedPair, Run0Error,
SimilarityThreshold, TokenFragment, TokenPassConfig, accept_candidate_pairs,
emit_run0, score::SimilarityRatio, and span::region_for_range is not
rustfmt-compliant; run rustfmt (cargo fmt) to reformat this use block so it
meets the project's formatting rules, then re-run cargo fmt --check to verify;
locate the use ... { ... } block that imports those symbols in the tests module
(references: AcceptedPair, Run0Error, SimilarityThreshold, TokenFragment,
TokenPassConfig, accept_candidate_pairs, emit_run0, score::SimilarityRatio,
span::region_for_range) and commit the formatted result.

---

Duplicate comments:
In `@crates/whitaker_clones_core/src/run0/span.rs`:
- Around line 15-16: The subtraction end_position = range.end.saturating_sub(1)
can land inside a UTF-8 codepoint; instead find the previous UTF-8 character
boundary before range.end in the source_text and pass that byte index to
line_and_column. Replace the raw subtraction with logic that walks back to the
nearest char boundary (e.g., using source_text.is_char_boundary in a loop or
taking source_text[..range.end].char_indices().last().map(|(i,_)| i)) to compute
the safe end_position, keeping references to source_text, starts, and the call
to line_and_column unchanged.

In
`@docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md`:
- Around line 88-89: Replace the first occurrence of the acronym "MCP" with its
expanded form (e.g., "Model Clarification Protocol (MCP)" or the correct
expanded term used in this project) in the document so the acronym is defined on
first use, then keep using "MCP" for subsequent mentions; update the sentence
containing "Firecrawl MCP" to read "Firecrawl [Expanded Form] (MCP)" and leave
later references as "MCP".
- Line 141: Change the documentation headings from title case/ampersand form to
sentence case—e.g., rename "Surprises & Discoveries" to "Surprises and
discoveries"—and apply the same conversion to the other flagged headings in this
markdown (replace "&" with "and" where appropriate and only capitalize the first
word and proper nouns). Locate the headings in the file (e.g., the "Surprises &
Discoveries" header and the other flagged headers) and update their text to
sentence case to match the doc style guide.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9f9754b3-0f45-4f24-9391-96f8121586d0

📥 Commits

Reviewing files that changed from the base of the PR and between 2f35bc9 and 498a30c.

📒 Files selected for processing (10)
  • crates/whitaker_clones_core/src/run0/emit.rs
  • crates/whitaker_clones_core/src/run0/mod.rs
  • crates/whitaker_clones_core/src/run0/span.rs
  • crates/whitaker_clones_core/src/run0/tests.rs
  • crates/whitaker_clones_core/src/run0/tests_emit.rs
  • crates/whitaker_clones_core/src/run0/types.rs
  • crates/whitaker_clones_core/tests/run0_sarif_behaviour.rs
  • docs/adr-002-dylint-expect-attribute-macro.md
  • docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md
  • docs/whitaker-clone-detector-design.md

Comment thread crates/whitaker_clones_core/src/run0/emit.rs
Comment thread crates/whitaker_clones_core/src/run0/tests_emit.rs Outdated
Comment thread crates/whitaker_clones_core/src/run0/tests.rs
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 6, 2026

@coderabbitai Have the following now been resolved?

Annul any requirements that violate the en-GB-oxendict spelling (-ize / -yse / -our) conventions (for example a request to replace "normalize" with "normalise" or "artefact" with "artifact"), or where the requirement unnecessarily increases cyclomatic complexity.

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="crates/whitaker_clones_core/src/run0/score.rs" line_range="159-171" />
<code_context>
+        >= score.union.saturating_mul(threshold.numerator)
+}
+
+fn repeated_division(numerator: usize, denominator: usize) -> Option<(usize, usize)> {
+    if denominator == 0 {
+        return None;
+    }
+
+    let mut quotient = 0usize;
+    let mut remainder = numerator;
+    while remainder >= denominator {
+        remainder = remainder.saturating_sub(denominator);
+        quotient = quotient.saturating_add(1);
+    }
+    Some((quotient, remainder))
+}
</code_context>
<issue_to_address>
**suggestion:** Simplify division to use native operators now that unions are non-zero

Given unions are guaranteed non‑zero at all current call sites, the helper can be replaced with direct `numerator / denominator` and `numerator % denominator`, which is clearer and avoids the manual loop. If you prefer to keep the helper, consider adding a debug assertion that `denominator > 0` to document the invariant.

```suggestion
fn repeated_division(numerator: usize, denominator: usize) -> Option<(usize, usize)> {
    if denominator == 0 {
        debug_assert!(denominator != 0, "denominator must be non-zero");
        return None;
    }

    let quotient = numerator / denominator;
    let remainder = numerator % denominator;

    Some((quotient, remainder))
}
```
</issue_to_address>

### Comment 2
<location path="crates/whitaker_clones_core/src/run0/tests.rs" line_range="305-325" />
<code_context>
+}
+
+#[test]
+fn invalid_range_produces_typed_error() {
+    let error = region_for_range("alpha", "fn a() {}\n", 9..12)
+        .err()
+        .unwrap_or_else(|| panic!("invalid range must error"));
+
+    match error {
+        Run0Error::InvalidFingerprintRange {
+            fragment_id,
+            start,
+            end,
+            source_len,
+        } => {
+            assert_eq!(fragment_id, "alpha");
+            assert_eq!(start, 9);
+            assert_eq!(end, 12);
+            assert_eq!(source_len, 10);
+        }
+        other => panic!("unexpected error: {other}"),
+    }
+}
</code_context>
<issue_to_address>
**suggestion (testing):** Add a unit test for `InvalidUtf8Boundary` to complement the `InvalidFingerprintRange` coverage

`span::region_for_range` can also return `Run0Error::InvalidUtf8Boundary`, but the tests currently only cover `InvalidFingerprintRange` (out-of-bounds range). To cover the UTF-8 boundary path, please add a test along these lines:

```rust
#[test]
fn invalid_utf8_boundary_produces_typed_error() {
    // "á" = 2 bytes in UTF-8
    let source = "aáb";
    let mid = 1; // inside "á" byte sequence
    let err = region_for_range("alpha", source, 0..mid)
        .err()
        .unwrap_or_else(|| panic!("invalid utf-8 boundary must error"));

    match err {
        Run0Error::InvalidUtf8Boundary { fragment_id } => {
            assert_eq!(fragment_id, "alpha");
        }
        other => panic!("unexpected error: {other}"),
    }
}
```

This ensures we lock in the intended behaviour for non-ASCII sources and span computation around UTF-8 boundaries.

```suggestion
#[test]
fn invalid_range_produces_typed_error() {
    let error = region_for_range("alpha", "fn a() {}\n", 9..12)
        .err()
        .unwrap_or_else(|| panic!("invalid range must error"));

    match error {
        Run0Error::InvalidFingerprintRange {
            fragment_id,
            start,
            end,
            source_len,
        } => {
            assert_eq!(fragment_id, "alpha");
            assert_eq!(start, 9);
            assert_eq!(end, 12);
            assert_eq!(source_len, 10);
        }
        other => panic!("unexpected error: {other}"),
    }
}

#[test]
fn invalid_utf8_boundary_produces_typed_error() {
    // "á" = 2 bytes in UTF-8; index 2 is in the middle of that code point
    let source = "aáb";
    let mid = 2; // inside "á" byte sequence
    let error = region_for_range("alpha", source, 0..mid)
        .err()
        .unwrap_or_else(|| panic!("invalid utf-8 boundary must error"));

    match error {
        Run0Error::InvalidUtf8Boundary { fragment_id } => {
            assert_eq!(fragment_id, "alpha");
        }
        other => panic!("unexpected error: {other}"),
    }
}
```
</issue_to_address>

### Comment 3
<location path="crates/whitaker_clones_core/src/run0/tests.rs" line_range="153-162" />
<code_context>
+}
+
+#[test]
+fn duplicate_hashes_do_not_inflate_jaccard_score() {
+    let left = [
+        fingerprint(1, 0..3),
+        fingerprint(1, 3..6),
+        fingerprint(2, 6..9),
+    ];
+    let right = [fingerprint(1, 0..3), fingerprint(2, 3..6)];
+
+    let score = jaccard_similarity(&left, &right)
+        .unwrap_or_else(|| panic!("score should be present for non-empty fragments"));
+
+    assert_eq!(score, SimilarityRatio::new(2, 2));
+}
+
</code_context>
<issue_to_address>
**suggestion (testing):** Also test `jaccard_similarity` behaviour when one side has no fingerprints

The helper returns `None` when either side has no unique hashes, but current tests only cover non-empty inputs and deduplication. Please add a test like:

```rust
#[test]
fn jaccard_returns_none_for_empty_fragments() {
    let non_empty = [fingerprint(1, 0..3)];
    let empty: [Fingerprint; 0] = [];

    assert!(jaccard_similarity(&empty, &non_empty).is_none());
    assert!(jaccard_similarity(&non_empty, &empty).is_none());
}
```

so the `None`-on-empty contract is explicitly covered.

Suggested implementation:

```rust
}
+
+#[test]
+fn duplicate_hashes_do_not_inflate_jaccard_score() {
+    let left = [
+        fingerprint(1, 0..3),
+        fingerprint(1, 3..6),
+        fingerprint(2, 6..9),
+    ];
+    let right = [fingerprint(1, 0..3), fingerprint(2, 3..6)];
+
+    let score = jaccard_similarity(&left, &right)
+        .unwrap_or_else(|| panic!("score should be present for non-empty fragments"));
+
+    assert_eq!(score, SimilarityRatio::new(2, 2));
+}
+
+#[test]
+fn jaccard_returns_none_for_empty_fragments() {
+    let non_empty = [fingerprint(1, 0..3)];
+    let empty: [Fingerprint; 0] = [];
+
+    assert!(jaccard_similarity(&empty, &non_empty).is_none());
+    assert!(jaccard_similarity(&non_empty, &empty).is_none());
+}
+
+#[test]

```

This change assumes:
1. The `fingerprint` helper and `Fingerprint` type are already defined and in scope in this test module (they are referenced by existing tests).
2. `jaccard_similarity` is available from the imported `score` module, which is already used by `duplicate_hashes_do_not_inflate_jaccard_score`.

If the stray `#[test]` before the imports in your snippet is not actually present in the real file, adjust the SEARCH/REPLACE context to match the real surrounding lines (keep the two test functions adjacent and before the next item in the module).
</issue_to_address>

### Comment 4
<location path="crates/whitaker_clones_core/src/run0/tests.rs" line_range="76-85" />
<code_context>
+    );
+}
+
+#[test]
+fn emit_run0_sorts_and_deduplicates_results() {
+    let fragments = vec![
+        fragment(FragmentInput {
</code_context>
<issue_to_address>
**suggestion (testing):** Add unit tests for `MissingFragment` and `MixedProfiles` error paths in acceptance/emission

The current tests for `accept_candidate_pairs` and `emit_run0` cover happy paths and boundaries but don’t exercise the `Run0Error::MissingFragment` or `Run0Error::MixedProfiles` variants.

Please add tests that:
- Call `accept_candidate_pairs` with a candidate referencing a non-existent fragment and assert you get `Run0Error::MissingFragment { fragment_id: "beta" }`.
- Call `accept_candidate_pairs` with fragments using different `NormProfile`s and assert you get `Run0Error::MixedProfiles { left_fragment: "alpha", right_fragment: "beta" }`.

This will ensure the acceptance-stage error modes are covered and remain stable over time.
</issue_to_address>

### Comment 5
<location path="docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md" line_range="28-29" />
<code_context>
+   pass.
+2. Accepted Type-1 pairs produce `WHK001` results and accepted Type-2 pairs
+   produce `WHK002` results.
+3. Each emitted result carries deterministic `message`, `locations`,
+   `related_locations`, `partial_fingerprints`, and Whitaker `properties`.
+4. Unit tests cover happy paths, unhappy paths, threshold boundaries,
+   multi-line span conversion, stable ordering, and duplicate suppression.
</code_context>
<issue_to_address>
**suggestion (typo):** Use `relatedLocations` here for consistency with SARIF naming and the rest of the docs.

Earlier in the design docs this field is called `relatedLocations` / `relatedLocation`, matching the SARIF spec. Using `related_locations` here (in backticks with other SARIF field names) looks like an inconsistent typo rather than an intentional Rust identifier. Please align it to `relatedLocations` for consistency.

```suggestion
3. Each emitted result carries deterministic `message`, `locations`,
   `relatedLocations`, `partial_fingerprints`, and Whitaker `properties`.
```
</issue_to_address>

### Comment 6
<location path="docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md" line_range="1" />
<code_context>
+# Emit SARIF Run 0 for accepted Type-1 and Type-2 clone pairs (roadmap 7.2.3)
+
+This ExecPlan is a living document. The sections `Constraints`, `Tolerances`,
</code_context>
<issue_to_address>
**suggestion (review_instructions):** Define SARIF on first use so the acronym is clear to readers unfamiliar with it.

Please expand SARIF on first use, for example: `Emit Static Analysis Results Interchange Format (SARIF) Run 0 ...`, and then use `SARIF` thereafter.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*.md`

**Instructions:**
Define uncommon acronyms on first use.

</details>
</issue_to_address>

### Comment 7
<location path="docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md" line_range="44" />
<code_context>
+
+## Constraints
+
+- Scope only roadmap item 7.2.3. Do not implement filesystem discovery,
+  grouping into clone classes, abstract syntax tree refinement, HTML output, or
+  CLI wiring in this change.
</code_context>
<issue_to_address>
**suggestion (review_instructions):** CLI should be expanded on first use to satisfy the acronym-definition requirement.

In this section, `CLI wiring` appears without defining CLI. Consider wording such as `command-line interface (CLI) wiring` at first mention, and then `CLI` thereafter.

<details>
<summary>Review instructions:</summary>

**Path patterns:** `**/*.md`

**Instructions:**
Define uncommon acronyms on first use.

</details>
</issue_to_address>

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 6, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== score.rs ==="
fd 'score.rs' crates/whitaker_clones_core/src/run0/ --exec cat {}

echo "=== tests.rs ==="
fd 'tests.rs' crates/whitaker_clones_core/src/run0/ --exec cat {}

echo "=== tests_emit.rs ==="
fd 'tests_emit.rs' crates/whitaker_clones_core/src/run0/ --exec cat {}

echo "=== execplan doc ==="
fd '7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md' --exec cat {}

Length of output: 45370


@leynos — all seven comments are resolved in the current branch. Here is a per-comment status:

# File / Symbol Finding Status
1 score.rsrepeated_division Replace manual loop with / and %; add debug_assert ✅ Done — native operators used, debug_assert!(denominator != 0, …) present
2 tests.rsinvalid_utf8_boundary_produces_typed_error Add UTF-8 boundary error test ✅ Done — test present, uses "aáb" at mid = 2
3 tests_emit.rsjaccard_returns_none_for_empty_fragments Add None-on-empty Jaccard test ✅ Done — both (empty, non_empty) and (non_empty, empty) asserted
4 tests.rsmissing_fragment_produces_typed_error, mixed_profiles_produces_typed_error Add MissingFragment and MixedProfiles error path tests ✅ Done — both tests present and correctly assert the respective Run0Error variants
5 Execplan doc line 28 — `related_locations` Use relatedLocations for SARIF consistency ✅ Done — reads `relatedLocations`
6 Execplan doc line 1 — SARIF acronym Expand SARIF on first use ✅ Done — heading reads "Static Analysis Results Interchange Format (SARIF)"
7 Execplan doc constraints — CLI acronym Expand CLI on first use ✅ Done — reads "command-line interface (CLI) wiring"

Normalized section headers in the 7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md document for consistent styling and readability, such as changing '&' to 'and' and standardizing capitalization across sections like 'Surprises and discoveries' and 'Decision log'. This improves documentation clarity and consistency.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
crates/whitaker_clones_core/src/run0/tests.rs (1)

40-43: ⚠️ Potential issue | 🟡 Minor

Format placeholders in .expect() message are not interpolated.

Same issue as in tests_emit.rs: the {left} and {right} placeholders in the .expect() string at line 42 will not be substituted with variable values.

Proposed fix (static message)
 fn pair(left: &str, right: &str) -> CandidatePair {
     CandidatePair::new(FragmentId::from(left), FragmentId::from(right))
-        .expect("pair `{left}` and `{right}` must be distinct")
+        .expect("pair must be distinct")
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/whitaker_clones_core/src/run0/tests.rs` around lines 40 - 43, The
.expect() message in the pair function isn't interpolating {left} and {right};
update the call around CandidatePair::new in function pair to build the message
with formatting (e.g., use format! or .unwrap_or_else with a formatted panic
message) so the actual left and right values appear (e.g., "pair `{}` and `{}`
must be distinct", left, right).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/whitaker_clones_core/src/run0/score.rs`:
- Around line 180-183: The comparison in meets_threshold (using SimilarityRatio
{ intersection, union } and SimilarityThreshold { numerator, denominator })
risks overflow but currently uses saturating_mul; add a brief comment above the
meets_threshold function documenting the assumption that fingerprint counts
(intersection and union) are many orders of magnitude smaller than usize::MAX in
practical use, explain why saturating_mul is used to avoid panics on
pathological inputs, and note that this choice may produce clamped results only
in extreme/unrealistic cases so callers should validate counts if they may
approach usize::MAX.

In `@crates/whitaker_clones_core/src/run0/tests_emit.rs`:
- Around line 40-43: The expect message in function pair uses `{left}`/`{right}`
placeholders which won't be interpolated; change the call on CandidatePair::new
to either keep a static literal message or use unwrap_or_else with panic! to
include the runtime values (e.g., replace .expect("pair `{left}` and `{right}`
must be distinct") with an unwrap_or_else that calls panic! with formatted
left/right or a static .expect message), referencing the pair function and
CandidatePair::new to locate the edit.

In `@crates/whitaker_clones_core/src/run0/tests.rs`:
- Around line 45-47: The test helper config() currently hard-codes the version
"0.2.1"; update it to use the crate's package version macro like tests_emit.rs
does by replacing the literal in TokenPassConfig::new(...) with
env!("CARGO_PKG_VERSION") so config() returns
TokenPassConfig::new("whitaker_clones_cli@token", env!("CARGO_PKG_VERSION"))
ensuring both tests read the crate metadata version.

In
`@docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md`:
- Around line 26-27: Add a comma after "WHK001 results" in the sentence that
reads "Accepted Type-1 pairs produce `WHK001` results and accepted Type-2 pairs
produce `WHK002` results" so it becomes "Accepted Type-1 pairs produce `WHK001`
results, and accepted Type-2 pairs produce `WHK002` results"; update that
sentence in the document (look for the exact quoted string) to include the comma
for correct grammar.

---

Duplicate comments:
In `@crates/whitaker_clones_core/src/run0/tests.rs`:
- Around line 40-43: The .expect() message in the pair function isn't
interpolating {left} and {right}; update the call around CandidatePair::new in
function pair to build the message with formatting (e.g., use format! or
.unwrap_or_else with a formatted panic message) so the actual left and right
values appear (e.g., "pair `{}` and `{}` must be distinct", left, right).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 34f86f28-7c12-4c34-b944-6759c9f6c04a

📥 Commits

Reviewing files that changed from the base of the PR and between 498a30c and 3ca4110.

📒 Files selected for processing (6)
  • crates/whitaker_clones_core/src/run0/emit.rs
  • crates/whitaker_clones_core/src/run0/score.rs
  • crates/whitaker_clones_core/src/run0/span.rs
  • crates/whitaker_clones_core/src/run0/tests.rs
  • crates/whitaker_clones_core/src/run0/tests_emit.rs
  • docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md

Comment thread crates/whitaker_clones_core/src/run0/score.rs
Comment thread crates/whitaker_clones_core/src/run0/tests_emit.rs Outdated
Comment thread crates/whitaker_clones_core/src/run0/tests.rs Outdated
Comment thread docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md Outdated
…safety

- Introduce `meets_threshold` function to compare Jaccard similarity using cross multiplication
- Use saturating multiplication to avoid panics on pathological large inputs
- Document assumptions on fingerprint counts and guidance for callers

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
codescene-delta-analysis[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/whitaker_clones_core/src/run0/tests_emit.rs`:
- Around line 12-47: Extract the duplicated helpers (fingerprint, FragmentInput,
fragment, pair, config) into a single shared test_helpers module and have both
test modules import them; specifically, move the functions fingerprint,
fragment, pair, config and the FragmentInput struct into a new test_helpers
module with pub(super) visibility for each symbol, update the test modules to
use super::test_helpers::{fingerprint, FragmentInput, fragment, pair, config},
and alternatively consider converting shared setup into rstest fixtures if
preferred to follow the project guideline (ensure names match exactly so
existing calls like TokenFragment::new, FragmentId::from, Fingerprint::new and
TokenPassConfig::new continue to resolve).

In
`@docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md`:
- Line 29: The doc mixes SARIF naming styles: replace the snake_case
`partial_fingerprints` with the SARIF camelCase `partialFingerprints` to match
`relatedLocations` and the usage at line 231; update any mentions of
`partial_fingerprints` in this file (e.g., in the sentence listing
`relatedLocations`, `partial_fingerprints`, and Whitaker `properties`) to
`partialFingerprints` so all SARIF fields use consistent camelCase.
- Around line 460-463: Replace the snake_case SARIF field `partial_fingerprints`
with the camelCase `partialFingerprints` to match the SARIF spec and the other
keys in this doc; update the surrounding examples/descriptions that list
`whitakerFragment` and `tokenHash` so they appear as properties under
`partialFingerprints` (keeping `whitakerFragment` and `tokenHash` names as-is)
and ensure any explanatory text references `partialFingerprints` instead of
`partial_fingerprints`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6b0b2d1e-bffe-48f4-a5a0-001fd8eb1a1d

📥 Commits

Reviewing files that changed from the base of the PR and between 3ca4110 and 6204a3f.

📒 Files selected for processing (4)
  • crates/whitaker_clones_core/src/run0/score.rs
  • crates/whitaker_clones_core/src/run0/tests.rs
  • crates/whitaker_clones_core/src/run0/tests_emit.rs
  • docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md

Comment thread crates/whitaker_clones_core/src/run0/tests_emit.rs Outdated
Comment thread docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md Outdated
Comment thread docs/execplans/7-2-3-emit-sarif-run-0-for-accepted-type-1-and-type-2-pairs.md Outdated
- Move common test helper structs and functions from tests.rs and tests_emit.rs
  into a new test_helpers.rs module under run0.
- Update mod.rs to include test_helpers module conditionally for tests.
- Import helpers from test_helpers in tests.rs and tests_emit.rs,
  reducing code duplication and improving test maintainability.
- Minor docs update to fix casing of emitted SARIF fields (partialFingerprints).

This refactoring improves code organization for run0 acceptance and emission tests.

Co-authored-by: devboxerhub[bot] <devboxerhub[bot]@users.noreply.github.com>
@leynos
Copy link
Copy Markdown
Owner Author

leynos commented Apr 8, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 8, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@leynos leynos merged commit 221fdbd into main Apr 8, 2026
5 checks passed
@leynos leynos deleted the emit-sarif-run0-type1-type2-qn437v branch April 8, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant