Skip to content

[FEAT] Race Detector#271

Draft
mark14wu wants to merge 16 commits intomainfrom
feature/race_detector
Draft

[FEAT] Race Detector#271
mark14wu wants to merge 16 commits intomainfrom
feature/race_detector

Conversation

@mark14wu
Copy link
Copy Markdown
Collaborator

@mark14wu mark14wu commented Feb 12, 2026

Summary

Add a RaceDetector client that detects data races (RW, WAW) between thread blocks in Triton kernels. The detector uses a symbolic-first approach: it records block 0's memory accesses as symbolic expressions, then uses Z3 to check whether any two blocks can access overlapping addresses with at least one write. When symbolic analysis is insufficient (e.g., data-dependent pointers, tensor pointer ops, while-loop CAS barriers) it immediately bails out via SymbolicBailout and restarts all blocks in concrete mode.

Key design decisions

  • Two-phase execution model: Block 0 runs in symbolic mode, producing SymbolicMemoryAccess records with Z3-compatible pointer/mask expressions. If no data-dependent patterns are found, Z3 solves for cross-block races without executing remaining blocks. Otherwise, all blocks run concretely.
  • Immediate bail-to-concrete: Data-dependent pointers/masks, unsupported ops (tensor pointers), and while-loop CAS patterns raise SymbolicBailout immediately instead of continuing with stale symbolic expressions. Block 0's concrete side effects are preserved during the grid restart.
  • Per-lane-pair mask constraints: The Z3 solver binds mask-active and address-overlap predicates to the same lane pair (mask_a[i] AND mask_b[j] AND addr_a[i] == addr_b[j]), preventing false positives from masked-off lanes that alias.
  • Correct witness extraction: The witness address is extracted from the specific lane pair whose predicate the model satisfies, not blindly from addr_a[0].
  • Barrier-aware epoch annotation: The detector recognizes global barriers (addresses where every block performs both atomic_add and atomic_cas) and splits accesses into epochs. Cross-epoch access pairs are not flagged as races, enabling correct analysis of phased algorithms.
  • CLI entry point: triton-race-detector wraps user scripts, patching triton.jit to inject the race detector transparently.

Race types

  • RW — one access is a load and the other is a store/atomic (or vice versa)
  • WAW — both accesses are writes (store or atomic)

Known limitations

  • CAS heuristic is coarse: A second atomic_cas during symbolic block 0 always triggers concrete fallback, even if it isn't a spin loop. This prioritizes soundness over symbolic coverage for multi-CAS kernels.
  • Tensor/block pointer concrete recording: These ops bail to concrete and the interpreter lowers them to regular Load/Store ops (confirmed by E2E test), but this relies on Triton interpreter internals. A dedicated concrete before-callback would be more robust.

Tests

  • 11 end-to-end tests covering: WAW overlapping store, RW+WAW histogram, correct vector_add (no race), correct atomic histogram (no race), while-loop CAS concrete fallback, two-phase barrier (symbolic & concrete), data-dependent pointer concrete fallback with side-effect verification, masked non-overlap false-positive prevention, witness address correctness, tensor pointer fallback race detection
  • 7 unit tests for epoch annotation and barrier detection

@mark14wu mark14wu changed the title [FEAT] Race Detector Demo [FEAT] [Race Detector] Demo Feb 12, 2026
@mark14wu mark14wu changed the base branch from main to refactor/symbolic-client February 17, 2026 04:18
@mark14wu mark14wu force-pushed the refactor/symbolic-client branch from 325c6ec to 8d9a3ba Compare February 17, 2026 04:48
Base automatically changed from refactor/symbolic-client to main February 19, 2026 03:25
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 19, 2026

Sanitizer Performance Benchmark

Benchmark main (min) PR (min) Change
gemm 0.183s 0.184s +0.5%
gemm_oob 0.191s 0.191s -0.3%
indirect_load 0.295s 0.295s +0.1%
nested_loop 0.371s 0.373s +0.4%
block_pointer_loop_advance 0.188s 0.185s -1.8%
liger_jsd 0.151s 0.152s +0.7%
flaggems_layernorm 0.460s 0.465s +1.0%
swiglu 0.184s 0.186s +1.0%
cross_entropy 0.174s 0.173s -0.3%
fused_linear_jsd 0.228s 0.228s -0.0%
Total 2.426s 2.432s +0.2%

Iterations: 1 warmup + 20 measured

if self._phase != "symbolic":
return True
# End of symbolic phase (block 0 finished)
if self._need_concrete_fallback:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] If _need_concrete_fallback becomes true, this path only calls _concretize_block0() and then continues with later blocks. _concretize_block0() records block 0's accesses, but it never re-executes block 0's loads/stores/atomics, so fallback kernels leave block 0's side effects out of memory and synchronization state. That breaks concrete-fallback cases like while/CAS or other data-dependent kernels where out_ptr[0] and shared counters should already have been updated by block 0.

model.evaluate(p, model_completion=True).as_long()
for p in pid_b
)
if isinstance(addr_a, list):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] When addr_a is vector-valued, the reported witness address is always taken from addr_a[0] instead of from the element pair that actually satisfied overlap. If the race is proven on any nonzero lane, address_offset and the recorded offsets for both accesses point at the wrong byte, which makes the symbolic report misleading for vectorized kernels.

return [substitute(e, *pairs) for e in expr]
return substitute(expr, *pairs)

def _add_mask_constraint(solver: Solver, mask_value) -> None:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] On the symbolic path, _add_mask_constraint only enforces that some mask term is true. Because the overlap equation is built independently over all address lanes, the solver can satisfy overlap with a masked-off element as long as another lane in the same vector is active. Kernels with partial predicates or tail masks can therefore report races on bytes that are never actually accessed.

@mark14wu
Copy link
Copy Markdown
Collaborator Author

Code review

Found 4 issues:

  1. RaceType.WAR is defined but never emitted -- WAR races are misclassified as RAW

Both _classify_race_type and _classify_race return RaceType.RAW for the LOAD vs STORE case, which is semantically a WAR (Write-After-Read), not RAW. RaceType.WAR (defined in data.py line 19) is dead code. The transpose.py docstring claims "WAR+WAW race" but only WAW will ever be reported. Similarly, inplace_neighbor.py describes its race as "RAW" when the pattern (block N reads, block N+1 writes) is actually WAR.

if is_write_a and tb == AccessType.LOAD:
return RaceType.RAW
if ta == AccessType.LOAD and is_write_b:
return RaceType.RAW
return None

elif is_write_a and tb == AccessType.LOAD:
return RaceType.RAW
elif ta == AccessType.LOAD and is_write_b:
return RaceType.RAW
return None

class RaceType(Enum):
RAW = auto() # Read-After-Write
WAR = auto() # Write-After-Read
WAW = auto() # Write-After-Write

  1. Symbolic race checker uses range(i, ...) instead of range(i+1, ...), inconsistent with concrete path

_check_symbolic_races iterates with for j in range(i, len(accesses)), allowing self-comparisons (i == j). While the different_blocks Z3 constraint prevents same-block matches, this is inconsistent with the concrete path detect_races which correctly uses range(bi + 1, len(blocks)). The self-comparisons are wasteful (doubling Z3 work) and can produce duplicate race records for scalar stores to fixed addresses.

for i in range(len(accesses)):
for j in range(i, len(accesses)):
acc_a = accesses[i]
acc_b = accesses[j]

vs concrete path:

for bi in range(len(blocks)):
for bj in range(bi + 1, len(blocks)):
block_a, block_b = blocks[bi], blocks[bj]

  1. CLI conflict error message does not mention triton-race-detector

The error raised when @triton_viz.trace() conflicts with a CLI wrapper only mentions triton-sanitizer / triton-profiler. This PR adds triton-race-detector as a new CLI tool but does not update the error message, making it confusing for users who trigger the error via the new tool.

raise RuntimeError(
"@triton_viz.trace() decorator cannot be used together with "
"CLI wrapper (e.g., triton-sanitizer / triton-profiler). "
"Please remove the @triton_viz.trace() decorator from your code "
"when using CLI tools."
)

  1. Symbolic barrier detection uses weaker criterion than concrete path

_detect_symbolic_barrier_keys only requires state["add"] > 0 and state["cas"] > 0 (i.e., block 0 alone performed both ops). The concrete path _detect_global_barrier_addresses requires every block to participate (state["add"] >= blocks and state["cas"] >= blocks). A kernel where only some blocks conditionally hit the CAS path would pass the symbolic check but fail the concrete check. The fallback trigger (_need_concrete_fallback) does not cover this case since it only fires for data-dependent pointers/masks, not conditional barrier participation.

return {
ptr_key
for ptr_key, state in stats.items()
if state["add"] > 0 and state["cas"] > 0
}

vs concrete path:

barrier_addrs: set[int] = set()
for addr, state in stats.items():
if state["add"] >= blocks and state["cas"] >= blocks:
barrier_addrs.add(addr)
return barrier_addrs

…n barriers

While loops with tl.atomic_cas spin infinitely during symbolic execution
because the triton interpreter converts all values to tensors, and tensor
comparisons (SymbolicExpr) are always truthy — the loop condition never
evaluates to False.

Fix by adding a SymbolicBailout exception mechanism: when a second CAS is
encountered during symbolic phase (indicating a spin loop), bail out and
restart the entire grid in concrete mode. A single CAS is still allowed
for symbolic barrier markers (add + cas pattern).

Also add a fast-path in detect_races to skip addresses where every access
is ATOMIC, avoiding an O(n²) inner loop when CAS-spin barriers produce
thousands of accesses at the same sync address.
- Bail immediately to concrete on data-dependent values instead of
  continuing symbolic execution with stale expressions (soundness)
- Replace loose mask constraints with per-lane-pair predicates so
  masked-off lanes cannot produce false positives (precision)
- Extract witness address from the satisfied lane pair instead of
  blindly using addr_a[0] (precision)
- Collapse RAW/WAR into a single RW enum value (API clarity)
- Add tests for concrete fallback side effects, masked non-overlap
  false positives, and witness address correctness
@mark14wu
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c473d1ed82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +65 to +68
else:
self._races = self._check_symbolic_races()
self._phase = "z3_done"
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replay the kernel concretely after symbolic success

When the symbolic pass succeeds, post_run_callback switches to z3_done and returns False, which stops _grid_executor_call after the first block. Because _op_load_overrider/_op_store_overrider only build SymbolicExpr objects and never invoke the real interpreter memory ops, no concrete stores are replayed for any block in this path. Running a script under triton-race-detector therefore changes program semantics: kernels that stay on the fast symbolic path leave outputs stale/uninitialized, so any later code that reads those tensors or launches dependent kernels will misbehave.

Useful? React with 👍 / 👎.

Comment on lines +445 to +449
return {
ptr_key
for ptr_key, state in stats.items()
if state["add"] > 0 and state["cas"] > 0
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require more than add+cas before advancing symbolic epochs

This treats any pointer that sees both an atomic_add and an atomic_cas as a barrier key. But tl.atomic_cas is only compare-and-swap, not a barrier by itself (docs). In the symbolic fast path, a kernel that uses both atomics on the same counter/lock address without actually waiting for all blocks will still advance to a new epoch here, which suppresses real cross-block RW/WAW reports instead of flagging them.

Useful? React with 👍 / 👎.

Comment on lines +794 to +798
if in_barrier_candidate:
if seen_add and seen_cas:
epoch += 1
in_barrier_candidate = False
seen_add = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep barrier candidates alive across intervening accesses

The concrete epoch logic clears in_barrier_candidate on the first non-atomic access unless both seen_add and seen_cas are already true. That misses valid global-barrier patterns like atomic_add(sync, 1), some ordinary bookkeeping loads/stores, then a waiting atomic_cas(sync, N, N): phase-1 accesses never move to a later epoch and are reported as races against phase-0 accesses. Any phased kernel with work between “arrive” and “wait” will therefore false-positive once the detector falls back to concrete execution.

Useful? React with 👍 / 👎.

@mark14wu mark14wu changed the title [FEAT] [Race Detector] Demo [FEAT] Race Detector Mar 18, 2026
…ter test

- _broadcast_mask returns None (→ bail to concrete) on length
  mismatch, non-bool lanes, or unrecognised mask forms instead of
  asserting or defaulting to True
- post_run_callback re-checks _need_concrete_fallback after
  _check_symbolic_races to handle mid-analysis bail
- atomic_rmw overrider: skip trivially-all-True masks the interpreter
  always supplies, preventing spurious bail in the Z3 analysis phase
- before_load/before_store: use np.asarray() so memoryview data from
  tensor pointer lowering doesn't crash
- Add tensor pointer fallback E2E test confirming concrete recording
  works after MakeBlockPtr bail
- Annotate CAS heuristic as intentionally coarse with tighten-later note
Strengthen symbolic barrier validation with two Z3 checks: PID-independence
(reject barriers whose pointer depends on PID) and mask-universality (reject
barriers where some blocks are masked off). Also fix atomic_rmw mask filtering
to detect object-dtype arrays containing SymbolicExprs.

Add byte-range aware race detection: expand element start addresses to full
byte ranges using elem_size from tensor dtype, enabling detection of overlapping
accesses with mixed element sizes (e.g., float32 at byte 0 vs int8 at byte 2).
Symbolic execution only observes block 0, so it cannot validate that all
blocks participate in a barrier. Instead of trying to patch this with Z3
checks, bail to concrete mode when add+cas is detected on the same
pointer during symbolic execution, letting the concrete path (which sees
all blocks) handle epoch splitting.

Key changes:
- Remove _z3_free_vars, _detect_symbolic_barrier_keys,
  _validate_symbolic_barriers, _apply_symbolic_epoch_annotations
- Add _note_symbolic_atomic to trigger SymbolicBailout on add+cas
- Add atomic_scope/atomic_sem fields to MemoryAccess data classes
- Filter CTA-scoped atomics from global barrier detection
- Rewrite _annotate_block_epochs with per-address pending state
  (survives intervening load/store between add and cas)
- Replace _is_barrier_atomic/_touches_address with _matched_barrier_addrs
… message

P1: Symbolic success path now replays the grid concretely for side
effects. Previously, post_run_callback returned False on symbolic
success, terminating the grid loop and leaving output tensors
unpopulated. Now it saves symbolic race results, then raises
SymbolicBailout to restart the grid in concrete mode. The saved
results are returned by finalize() instead of re-running detection.

P2: Witness address for byte-range overlaps now uses max(a, b) —
the start of the actual overlap region — instead of block A's start
address, which could fall outside block B's range entirely.

Minor: CLI conflict error message now mentions triton-race-detector
alongside triton-sanitizer and triton-profiler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant