Conversation
325c6ec to
8d9a3ba
Compare
Sanitizer Performance Benchmark
Iterations: 1 warmup + 20 measured |
git-pr-chain: feature/race_detector
e54bd1b to
03b2408
Compare
# Conflicts: # pyproject.toml # tests/end_to_end/test_sanitizer.py # triton_viz/clients/symbolic_engine.py
| if self._phase != "symbolic": | ||
| return True | ||
| # End of symbolic phase (block 0 finished) | ||
| if self._need_concrete_fallback: |
There was a problem hiding this comment.
[P1] If _need_concrete_fallback becomes true, this path only calls _concretize_block0() and then continues with later blocks. _concretize_block0() records block 0's accesses, but it never re-executes block 0's loads/stores/atomics, so fallback kernels leave block 0's side effects out of memory and synchronization state. That breaks concrete-fallback cases like while/CAS or other data-dependent kernels where out_ptr[0] and shared counters should already have been updated by block 0.
| model.evaluate(p, model_completion=True).as_long() | ||
| for p in pid_b | ||
| ) | ||
| if isinstance(addr_a, list): |
There was a problem hiding this comment.
[P2] When addr_a is vector-valued, the reported witness address is always taken from addr_a[0] instead of from the element pair that actually satisfied overlap. If the race is proven on any nonzero lane, address_offset and the recorded offsets for both accesses point at the wrong byte, which makes the symbolic report misleading for vectorized kernels.
| return [substitute(e, *pairs) for e in expr] | ||
| return substitute(expr, *pairs) | ||
|
|
||
| def _add_mask_constraint(solver: Solver, mask_value) -> None: |
There was a problem hiding this comment.
[P2] On the symbolic path, _add_mask_constraint only enforces that some mask term is true. Because the overlap equation is built independently over all address lanes, the solver can satisfy overlap with a masked-off element as long as another lane in the same vector is active. Kernels with partial predicates or tail masks can therefore report races on bytes that are never actually accessed.
Code reviewFound 4 issues:
Both triton-viz/triton_viz/clients/race_detector/race_detector.py Lines 694 to 698 in d030b00 triton-viz/triton_viz/clients/race_detector/race_detector.py Lines 902 to 906 in d030b00 triton-viz/triton_viz/clients/race_detector/data.py Lines 17 to 20 in d030b00
triton-viz/triton_viz/clients/race_detector/race_detector.py Lines 571 to 574 in d030b00 vs concrete path: triton-viz/triton_viz/clients/race_detector/race_detector.py Lines 844 to 846 in d030b00
The error raised when triton-viz/triton_viz/core/trace.py Lines 292 to 297 in d030b00
triton-viz/triton_viz/clients/race_detector/race_detector.py Lines 454 to 458 in d030b00 vs concrete path: triton-viz/triton_viz/clients/race_detector/race_detector.py Lines 746 to 750 in d030b00 |
…n barriers While loops with tl.atomic_cas spin infinitely during symbolic execution because the triton interpreter converts all values to tensors, and tensor comparisons (SymbolicExpr) are always truthy — the loop condition never evaluates to False. Fix by adding a SymbolicBailout exception mechanism: when a second CAS is encountered during symbolic phase (indicating a spin loop), bail out and restart the entire grid in concrete mode. A single CAS is still allowed for symbolic barrier markers (add + cas pattern). Also add a fast-path in detect_races to skip addresses where every access is ATOMIC, avoiding an O(n²) inner loop when CAS-spin barriers produce thousands of accesses at the same sync address.
- Bail immediately to concrete on data-dependent values instead of continuing symbolic execution with stale expressions (soundness) - Replace loose mask constraints with per-lane-pair predicates so masked-off lanes cannot produce false positives (precision) - Extract witness address from the satisfied lane pair instead of blindly using addr_a[0] (precision) - Collapse RAW/WAR into a single RW enum value (API clarity) - Add tests for concrete fallback side effects, masked non-overlap false positives, and witness address correctness
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c473d1ed82
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| else: | ||
| self._races = self._check_symbolic_races() | ||
| self._phase = "z3_done" | ||
| return False |
There was a problem hiding this comment.
Replay the kernel concretely after symbolic success
When the symbolic pass succeeds, post_run_callback switches to z3_done and returns False, which stops _grid_executor_call after the first block. Because _op_load_overrider/_op_store_overrider only build SymbolicExpr objects and never invoke the real interpreter memory ops, no concrete stores are replayed for any block in this path. Running a script under triton-race-detector therefore changes program semantics: kernels that stay on the fast symbolic path leave outputs stale/uninitialized, so any later code that reads those tensors or launches dependent kernels will misbehave.
Useful? React with 👍 / 👎.
| return { | ||
| ptr_key | ||
| for ptr_key, state in stats.items() | ||
| if state["add"] > 0 and state["cas"] > 0 | ||
| } |
There was a problem hiding this comment.
Require more than add+cas before advancing symbolic epochs
This treats any pointer that sees both an atomic_add and an atomic_cas as a barrier key. But tl.atomic_cas is only compare-and-swap, not a barrier by itself (docs). In the symbolic fast path, a kernel that uses both atomics on the same counter/lock address without actually waiting for all blocks will still advance to a new epoch here, which suppresses real cross-block RW/WAW reports instead of flagging them.
Useful? React with 👍 / 👎.
| if in_barrier_candidate: | ||
| if seen_add and seen_cas: | ||
| epoch += 1 | ||
| in_barrier_candidate = False | ||
| seen_add = False |
There was a problem hiding this comment.
Keep barrier candidates alive across intervening accesses
The concrete epoch logic clears in_barrier_candidate on the first non-atomic access unless both seen_add and seen_cas are already true. That misses valid global-barrier patterns like atomic_add(sync, 1), some ordinary bookkeeping loads/stores, then a waiting atomic_cas(sync, N, N): phase-1 accesses never move to a later epoch and are reported as races against phase-0 accesses. Any phased kernel with work between “arrive” and “wait” will therefore false-positive once the detector falls back to concrete execution.
Useful? React with 👍 / 👎.
…ter test - _broadcast_mask returns None (→ bail to concrete) on length mismatch, non-bool lanes, or unrecognised mask forms instead of asserting or defaulting to True - post_run_callback re-checks _need_concrete_fallback after _check_symbolic_races to handle mid-analysis bail - atomic_rmw overrider: skip trivially-all-True masks the interpreter always supplies, preventing spurious bail in the Z3 analysis phase - before_load/before_store: use np.asarray() so memoryview data from tensor pointer lowering doesn't crash - Add tensor pointer fallback E2E test confirming concrete recording works after MakeBlockPtr bail - Annotate CAS heuristic as intentionally coarse with tighten-later note
Strengthen symbolic barrier validation with two Z3 checks: PID-independence (reject barriers whose pointer depends on PID) and mask-universality (reject barriers where some blocks are masked off). Also fix atomic_rmw mask filtering to detect object-dtype arrays containing SymbolicExprs. Add byte-range aware race detection: expand element start addresses to full byte ranges using elem_size from tensor dtype, enabling detection of overlapping accesses with mixed element sizes (e.g., float32 at byte 0 vs int8 at byte 2).
Symbolic execution only observes block 0, so it cannot validate that all blocks participate in a barrier. Instead of trying to patch this with Z3 checks, bail to concrete mode when add+cas is detected on the same pointer during symbolic execution, letting the concrete path (which sees all blocks) handle epoch splitting. Key changes: - Remove _z3_free_vars, _detect_symbolic_barrier_keys, _validate_symbolic_barriers, _apply_symbolic_epoch_annotations - Add _note_symbolic_atomic to trigger SymbolicBailout on add+cas - Add atomic_scope/atomic_sem fields to MemoryAccess data classes - Filter CTA-scoped atomics from global barrier detection - Rewrite _annotate_block_epochs with per-address pending state (survives intervening load/store between add and cas) - Replace _is_barrier_atomic/_touches_address with _matched_barrier_addrs
… message P1: Symbolic success path now replays the grid concretely for side effects. Previously, post_run_callback returned False on symbolic success, terminating the grid loop and leaving output tensors unpopulated. Now it saves symbolic race results, then raises SymbolicBailout to restart the grid in concrete mode. The saved results are returned by finalize() instead of re-running detection. P2: Witness address for byte-range overlaps now uses max(a, b) — the start of the actual overlap region — instead of block A's start address, which could fall outside block B's range entirely. Minor: CLI conflict error message now mentions triton-race-detector alongside triton-sanitizer and triton-profiler.
Summary
Add a RaceDetector client that detects data races (RW, WAW) between thread blocks in Triton kernels. The detector uses a symbolic-first approach: it records block 0's memory accesses as symbolic expressions, then uses Z3 to check whether any two blocks can access overlapping addresses with at least one write. When symbolic analysis is insufficient (e.g., data-dependent pointers, tensor pointer ops, while-loop CAS barriers) it immediately bails out via
SymbolicBailoutand restarts all blocks in concrete mode.Key design decisions
SymbolicMemoryAccessrecords with Z3-compatible pointer/mask expressions. If no data-dependent patterns are found, Z3 solves for cross-block races without executing remaining blocks. Otherwise, all blocks run concretely.SymbolicBailoutimmediately instead of continuing with stale symbolic expressions. Block 0's concrete side effects are preserved during the grid restart.mask_a[i] AND mask_b[j] AND addr_a[i] == addr_b[j]), preventing false positives from masked-off lanes that alias.addr_a[0].atomic_addandatomic_cas) and splits accesses into epochs. Cross-epoch access pairs are not flagged as races, enabling correct analysis of phased algorithms.triton-race-detectorwraps user scripts, patchingtriton.jitto inject the race detector transparently.Race types
Known limitations
atomic_casduring symbolic block 0 always triggers concrete fallback, even if it isn't a spin loop. This prioritizes soundness over symbolic coverage for multi-CAS kernels.Tests