[FEAT] Race Detector by mark14wu · Pull Request #271 · Deep-Learning-Profiling-Tools/triton-viz

mark14wu · 2026-02-12T23:57:24Z

Summary

Add a RaceDetector client that detects data races (RW, WAW) between thread blocks in Triton kernels. The detector uses a symbolic-first approach: it records block 0's memory accesses as symbolic expressions, then uses Z3 to check whether any two blocks can access overlapping addresses with at least one write. When symbolic analysis is insufficient (e.g., data-dependent pointers, tensor pointer ops, while-loop CAS barriers) it immediately bails out via SymbolicBailout and restarts all blocks in concrete mode.

Key design decisions

Two-phase execution model: Block 0 runs in symbolic mode, producing SymbolicMemoryAccess records with Z3-compatible pointer/mask expressions. If no data-dependent patterns are found, Z3 solves for cross-block races without executing remaining blocks. Otherwise, all blocks run concretely.
Immediate bail-to-concrete: Data-dependent pointers/masks, unsupported ops (tensor pointers), and while-loop CAS patterns raise SymbolicBailout immediately instead of continuing with stale symbolic expressions. Block 0's concrete side effects are preserved during the grid restart.
Per-lane-pair mask constraints: The Z3 solver binds mask-active and address-overlap predicates to the same lane pair (mask_a[i] AND mask_b[j] AND addr_a[i] == addr_b[j]), preventing false positives from masked-off lanes that alias.
Correct witness extraction: The witness address is extracted from the specific lane pair whose predicate the model satisfies, not blindly from addr_a[0].
Barrier-aware epoch annotation: The detector recognizes global barriers (addresses where every block performs both atomic_add and atomic_cas) and splits accesses into epochs. Cross-epoch access pairs are not flagged as races, enabling correct analysis of phased algorithms.
CLI entry point: triton-race-detector wraps user scripts, patching triton.jit to inject the race detector transparently.

Race types

RW — one access is a load and the other is a store/atomic (or vice versa)
WAW — both accesses are writes (store or atomic)

Known limitations

CAS heuristic is coarse: A second atomic_cas during symbolic block 0 always triggers concrete fallback, even if it isn't a spin loop. This prioritizes soundness over symbolic coverage for multi-CAS kernels.
Tensor/block pointer concrete recording: These ops bail to concrete and the interpreter lowers them to regular Load/Store ops (confirmed by E2E test), but this relies on Triton interpreter internals. A dedicated concrete before-callback would be more robust.

Tests

11 end-to-end tests covering: WAW overlapping store, RW+WAW histogram, correct vector_add (no race), correct atomic histogram (no race), while-loop CAS concrete fallback, two-phase barrier (symbolic & concrete), data-dependent pointer concrete fallback with side-effect verification, masked non-overlap false-positive prevention, witness address correctness, tensor pointer fallback race detection
7 unit tests for epoch annotation and barrier detection

github-actions · 2026-02-19T20:27:07Z

Sanitizer Performance Benchmark

Benchmark	main (min)	PR (min)	Change
gemm	0.183s	0.184s	+0.5%
gemm_oob	0.191s	0.191s	-0.3%
indirect_load	0.295s	0.295s	+0.1%
nested_loop	0.371s	0.373s	+0.4%
block_pointer_loop_advance	0.188s	0.185s	-1.8%
liger_jsd	0.151s	0.152s	+0.7%
flaggems_layernorm	0.460s	0.465s	+1.0%
swiglu	0.184s	0.186s	+1.0%
cross_entropy	0.174s	0.173s	-0.3%
fused_linear_jsd	0.228s	0.228s	-0.0%
Total	2.426s	2.432s	+0.2%

Iterations: 1 warmup + 20 measured

git-pr-chain: feature/race_detector

# Conflicts: # pyproject.toml # tests/end_to_end/test_sanitizer.py # triton_viz/clients/symbolic_engine.py

mark14wu · 2026-03-16T21:57:51Z

triton_viz/clients/race_detector/race_detector.py

+        if self._phase != "symbolic":
+            return True
+        # End of symbolic phase (block 0 finished)
+        if self._need_concrete_fallback:


[P1] If _need_concrete_fallback becomes true, this path only calls _concretize_block0() and then continues with later blocks. _concretize_block0() records block 0's accesses, but it never re-executes block 0's loads/stores/atomics, so fallback kernels leave block 0's side effects out of memory and synchronization state. That breaks concrete-fallback cases like while/CAS or other data-dependent kernels where out_ptr[0] and shared counters should already have been updated by block 0.

mark14wu · 2026-03-16T21:57:52Z

triton_viz/clients/race_detector/race_detector.py

+                        model.evaluate(p, model_completion=True).as_long()
+                        for p in pid_b
+                    )
+                    if isinstance(addr_a, list):


[P2] When addr_a is vector-valued, the reported witness address is always taken from addr_a[0] instead of from the element pair that actually satisfied overlap. If the race is proven on any nonzero lane, address_offset and the recorded offsets for both accesses point at the wrong byte, which makes the symbolic report misleading for vectorized kernels.

mark14wu · 2026-03-16T21:57:53Z

triton_viz/clients/race_detector/race_detector.py

+                return [substitute(e, *pairs) for e in expr]
+            return substitute(expr, *pairs)
+
+        def _add_mask_constraint(solver: Solver, mask_value) -> None:


[P2] On the symbolic path, _add_mask_constraint only enforces that some mask term is true. Because the overlap equation is built independently over all address lanes, the solver can satisfy overlap with a masked-off element as long as another lane in the same vector is active. Kernels with partial predicates or tail masks can therefore report races on bytes that are never actually accessed.

mark14wu · 2026-03-16T22:20:39Z

Code review

Found 4 issues:

RaceType.WAR is defined but never emitted -- WAR races are misclassified as RAW

Both _classify_race_type and _classify_race return RaceType.RAW for the LOAD vs STORE case, which is semantically a WAR (Write-After-Read), not RAW. RaceType.WAR (defined in data.py line 19) is dead code. The transpose.py docstring claims "WAR+WAW race" but only WAW will ever be reported. Similarly, inplace_neighbor.py describes its race as "RAW" when the pattern (block N reads, block N+1 writes) is actually WAR.

triton-viz/triton_viz/clients/race_detector/race_detector.py

Lines 694 to 698 in d030b00

    
           if is_write_a and tb == AccessType.LOAD: 
        
               return RaceType.RAW 
        
           if ta == AccessType.LOAD and is_write_b: 
        
               return RaceType.RAW 
        
           return None

triton-viz/triton_viz/clients/race_detector/race_detector.py

Lines 902 to 906 in d030b00

    
           elif is_write_a and tb == AccessType.LOAD: 
        
               return RaceType.RAW 
        
           elif ta == AccessType.LOAD and is_write_b: 
        
               return RaceType.RAW 
        
           return None

triton-viz/triton_viz/clients/race_detector/data.py

Lines 17 to 20 in d030b00

    
           class RaceType(Enum): 
        
               RAW = auto()  # Read-After-Write 
        
               WAR = auto()  # Write-After-Read 
        
               WAW = auto()  # Write-After-Write

Symbolic race checker uses range(i, ...) instead of range(i+1, ...), inconsistent with concrete path

_check_symbolic_races iterates with for j in range(i, len(accesses)), allowing self-comparisons (i == j). While the different_blocks Z3 constraint prevents same-block matches, this is inconsistent with the concrete path detect_races which correctly uses range(bi + 1, len(blocks)). The self-comparisons are wasteful (doubling Z3 work) and can produce duplicate race records for scalar stores to fixed addresses.

triton-viz/triton_viz/clients/race_detector/race_detector.py

Lines 571 to 574 in d030b00

    
           for i in range(len(accesses)): 
        
               for j in range(i, len(accesses)): 
        
                   acc_a = accesses[i] 
        
                   acc_b = accesses[j]

vs concrete path:

triton-viz/triton_viz/clients/race_detector/race_detector.py

Lines 844 to 846 in d030b00

    
           for bi in range(len(blocks)): 
        
               for bj in range(bi + 1, len(blocks)): 
        
                   block_a, block_b = blocks[bi], blocks[bj]

CLI conflict error message does not mention triton-race-detector

The error raised when @triton_viz.trace() conflicts with a CLI wrapper only mentions triton-sanitizer / triton-profiler. This PR adds triton-race-detector as a new CLI tool but does not update the error message, making it confusing for users who trigger the error via the new tool.

triton-viz/triton_viz/core/trace.py

Lines 292 to 297 in d030b00

    
           raise RuntimeError( 
        
               "@triton_viz.trace() decorator cannot be used together with " 
        
               "CLI wrapper (e.g., triton-sanitizer / triton-profiler). " 
        
               "Please remove the @triton_viz.trace() decorator from your code " 
        
               "when using CLI tools." 
        
           )

Symbolic barrier detection uses weaker criterion than concrete path

_detect_symbolic_barrier_keys only requires state["add"] > 0 and state["cas"] > 0 (i.e., block 0 alone performed both ops). The concrete path _detect_global_barrier_addresses requires every block to participate (state["add"] >= blocks and state["cas"] >= blocks). A kernel where only some blocks conditionally hit the CAS path would pass the symbolic check but fail the concrete check. The fallback trigger (_need_concrete_fallback) does not cover this case since it only fires for data-dependent pointers/masks, not conditional barrier participation.

triton-viz/triton_viz/clients/race_detector/race_detector.py

Lines 454 to 458 in d030b00

    
           return { 
        
               ptr_key 
        
               for ptr_key, state in stats.items() 
        
               if state["add"] > 0 and state["cas"] > 0 
        
           }

vs concrete path:

triton-viz/triton_viz/clients/race_detector/race_detector.py

Lines 746 to 750 in d030b00

    
           barrier_addrs: set[int] = set() 
        
           for addr, state in stats.items(): 
        
               if state["add"] >= blocks and state["cas"] >= blocks: 
        
                   barrier_addrs.add(addr) 
        
           return barrier_addrs

…n barriers While loops with tl.atomic_cas spin infinitely during symbolic execution because the triton interpreter converts all values to tensors, and tensor comparisons (SymbolicExpr) are always truthy — the loop condition never evaluates to False. Fix by adding a SymbolicBailout exception mechanism: when a second CAS is encountered during symbolic phase (indicating a spin loop), bail out and restart the entire grid in concrete mode. A single CAS is still allowed for symbolic barrier markers (add + cas pattern). Also add a fast-path in detect_races to skip addresses where every access is ATOMIC, avoiding an O(n²) inner loop when CAS-spin barriers produce thousands of accesses at the same sync address.

- Bail immediately to concrete on data-dependent values instead of continuing symbolic execution with stale expressions (soundness) - Replace loose mask constraints with per-lane-pair predicates so masked-off lanes cannot produce false positives (precision) - Extract witness address from the satisfied lane pair instead of blindly using addr_a[0] (precision) - Collapse RAW/WAR into a single RW enum value (API clarity) - Add tests for concrete fallback side effects, masked non-overlap false positives, and witness address correctness

mark14wu · 2026-03-18T02:19:00Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c473d1ed82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T02:25:48Z

triton_viz/clients/race_detector/race_detector.py

+        else:
+            self._races = self._check_symbolic_races()
+            self._phase = "z3_done"
+            return False


Replay the kernel concretely after symbolic success

When the symbolic pass succeeds, post_run_callback switches to z3_done and returns False, which stops _grid_executor_call after the first block. Because _op_load_overrider/_op_store_overrider only build SymbolicExpr objects and never invoke the real interpreter memory ops, no concrete stores are replayed for any block in this path. Running a script under triton-race-detector therefore changes program semantics: kernels that stay on the fast symbolic path leave outputs stale/uninitialized, so any later code that reads those tensors or launches dependent kernels will misbehave.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T02:25:48Z

triton_viz/clients/race_detector/race_detector.py

+        return {
+            ptr_key
+            for ptr_key, state in stats.items()
+            if state["add"] > 0 and state["cas"] > 0
+        }


Require more than add+cas before advancing symbolic epochs

This treats any pointer that sees both an atomic_add and an atomic_cas as a barrier key. But tl.atomic_cas is only compare-and-swap, not a barrier by itself (docs). In the symbolic fast path, a kernel that uses both atomics on the same counter/lock address without actually waiting for all blocks will still advance to a new epoch here, which suppresses real cross-block RW/WAW reports instead of flagging them.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-18T02:25:48Z

triton_viz/clients/race_detector/race_detector.py

+        if in_barrier_candidate:
+            if seen_add and seen_cas:
+                epoch += 1
+            in_barrier_candidate = False
+            seen_add = False


Keep barrier candidates alive across intervening accesses

The concrete epoch logic clears in_barrier_candidate on the first non-atomic access unless both seen_add and seen_cas are already true. That misses valid global-barrier patterns like atomic_add(sync, 1), some ordinary bookkeeping loads/stores, then a waiting atomic_cas(sync, N, N): phase-1 accesses never move to a later epoch and are reported as races against phase-0 accesses. Any phased kernel with work between “arrive” and “wait” will therefore false-positive once the detector falls back to concrete execution.

Useful? React with 👍 / 👎.

…ter test - _broadcast_mask returns None (→ bail to concrete) on length mismatch, non-bool lanes, or unrecognised mask forms instead of asserting or defaulting to True - post_run_callback re-checks _need_concrete_fallback after _check_symbolic_races to handle mid-analysis bail - atomic_rmw overrider: skip trivially-all-True masks the interpreter always supplies, preventing spurious bail in the Z3 analysis phase - before_load/before_store: use np.asarray() so memoryview data from tensor pointer lowering doesn't crash - Add tensor pointer fallback E2E test confirming concrete recording works after MakeBlockPtr bail - Annotate CAS heuristic as intentionally coarse with tighten-later note

Strengthen symbolic barrier validation with two Z3 checks: PID-independence (reject barriers whose pointer depends on PID) and mask-universality (reject barriers where some blocks are masked off). Also fix atomic_rmw mask filtering to detect object-dtype arrays containing SymbolicExprs. Add byte-range aware race detection: expand element start addresses to full byte ranges using elem_size from tensor dtype, enabling detection of overlapping accesses with mixed element sizes (e.g., float32 at byte 0 vs int8 at byte 2).

Symbolic execution only observes block 0, so it cannot validate that all blocks participate in a barrier. Instead of trying to patch this with Z3 checks, bail to concrete mode when add+cas is detected on the same pointer during symbolic execution, letting the concrete path (which sees all blocks) handle epoch splitting. Key changes: - Remove _z3_free_vars, _detect_symbolic_barrier_keys, _validate_symbolic_barriers, _apply_symbolic_epoch_annotations - Add _note_symbolic_atomic to trigger SymbolicBailout on add+cas - Add atomic_scope/atomic_sem fields to MemoryAccess data classes - Filter CTA-scoped atomics from global barrier detection - Rewrite _annotate_block_epochs with per-address pending state (survives intervening load/store between add and cas) - Replace _is_barrier_atomic/_touches_address with _matched_barrier_addrs

… message P1: Symbolic success path now replays the grid concretely for side effects. Previously, post_run_callback returned False on symbolic success, terminating the grid loop and leaving output tensors unpopulated. Now it saves symbolic race results, then raises SymbolicBailout to restart the grid in concrete mode. The saved results are returned by finalize() instead of re-running detection. P2: Witness address for byte-range overlaps now uses max(a, b) — the start of the actual overlap region — instead of block A's start address, which could fall outside block B's range entirely. Minor: CLI conflict error message now mentions triton-race-detector alongside triton-sanitizer and triton-profiler.

mark14wu changed the title ~~[FEAT] Race Detector Demo~~ [FEAT] [Race Detector] Demo Feb 12, 2026

mark14wu changed the base branch from main to refactor/symbolic-client February 17, 2026 04:18

mark14wu force-pushed the refactor/symbolic-client branch from 325c6ec to 8d9a3ba Compare February 17, 2026 04:48

Base automatically changed from refactor/symbolic-client to main February 19, 2026 03:25

[FEAT] Add RaceDetector with symbolic-first Z3 cross-block detection

03b2408

git-pr-chain: feature/race_detector

mark14wu force-pushed the feature/race_detector branch from e54bd1b to 03b2408 Compare March 6, 2026 16:17

mark14wu mentioned this pull request Mar 6, 2026

[FEAT] Add triton-race-detector CLI entry point #319

Merged

2 tasks

mark14wu added 8 commits March 8, 2026 22:13

Align benchmark.yml and bench_sanitizer.py with main

d1a8e4f

[FEAT] Add triton-race-detector CLI entry point (#319)

b399f60

Merge branch 'main' into feature/race_detector

5e0529c

# Conflicts: # pyproject.toml # tests/end_to_end/test_sanitizer.py # triton_viz/clients/symbolic_engine.py

Merge branch 'main' into feature/race_detector

60e8a21

Merge branch 'main' into feature/race_detector

693c38f

Revert CI and test_profiler changes to match main

d86b5f0

Revert test_sanitizer changes to match main

4bcd330

Revert test/source files to match main and add pytest-xdist dependency

d030b00

mark14wu commented Mar 16, 2026

View reviewed changes

mark14wu added 3 commits March 17, 2026 17:33

pre-commit lint

c473d1e

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

mark14wu changed the title ~~[FEAT] [Race Detector] Demo~~ [FEAT] Race Detector Mar 18, 2026

mark14wu added 4 commits March 18, 2026 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Race Detector#271

[FEAT] Race Detector#271
mark14wu wants to merge 16 commits intomainfrom
feature/race_detector

mark14wu commented Feb 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

mark14wu Mar 16, 2026

Uh oh!

mark14wu Mar 16, 2026

Uh oh!

mark14wu Mar 16, 2026

Uh oh!

mark14wu commented Mar 16, 2026

Uh oh!

mark14wu commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mark14wu commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key design decisions

Race types

Known limitations

Tests

Uh oh!

github-actions bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sanitizer Performance Benchmark

Uh oh!

mark14wu Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mark14wu Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mark14wu Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mark14wu commented Mar 16, 2026

Code review

Uh oh!

mark14wu commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mark14wu commented Feb 12, 2026 •

edited

Loading

github-actions bot commented Feb 19, 2026 •

edited

Loading