[FEAT] Lazy fake-to-real tensor materialization#303
Draft
Conversation
Start with fake tensors (no CPU copies) by default and lazily retry with real tensors only when indirect loads require concrete data. This avoids expensive host copies for kernels without data dependencies while transparently handling kernels that need them. SANITIZER_ENABLE_FAKE_TENSOR semantics: unset = auto (lazy retry), 1 = force fake, 0 = force real.
Sanitizer Performance Benchmark
Iterations: 1 warmup + 20 measured |
2d654b8 to
3245e1b
Compare
2 tasks
Replace the coarse-grained retry strategy (NeedRealTensorsError + full kernel re-run) with fine-grained lazy materialization: a TensorMaterializer rebases GPU pointers to on-demand CPU copies only when IndirectSymbolicExprBase.concretize() is called, avoiding unnecessary full-tensor copies for kernels with indirect loads. - Add TensorMaterializer class to patch.py - Update concretize() to rebase pointers via materializer - Remove NeedRealTensorsError, reset_for_retry(), retry logic - Simplify virtual_memory config back to bool
# Conflicts: # triton_viz/core/client.py
Collaborator
Author
|
I guess test is missing for this branch. |
rebase_pointers() assumed all pointers came from a single GPU storage, breaking when a pointer tensor spans multiple storages. It also crashed on masked-out garbage addresses because concretize() called rebase before computing the mask. - Make rebase_pointers() mask-aware with per-storage-group rebasing - Reorder concretize() to compute mask before calling rebase_pointers - Add regression tests for TensorMaterializer and Config env parsing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix mask broadcast bug where scalar/lower-dimensional masks were not expanded to match ptr_data shape, leaving most lanes unrebased. Add np.broadcast_to before flattening. Fix README documenting wrong default for SANITIZER_ENABLE_FAKE_TENSOR (0 → 1). Add 6 regression tests for non-zero offset rebasing and broadcast masks. Rename test classes to match pytest.ini python_classes = *Test pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llback
Replace boolean `virtual_memory` config with explicit `TensorMode` enum:
- FORCE_REAL (env=0): always copy tensors to CPU upfront
- LAZY_AUTO (unset/auto): fake tensors + lazy materialization with
eager fallback on unmappable pointers
- FORCE_FAKE (env=1): fake tensors + lazy materialization,
errors on unmappable pointers
Unrecognised env values now emit a warning and default to LAZY_AUTO.
Fix rebase_pointers mask handling: cast to dtype=bool and use .ravel()
so scalar, broadcastable, and same-shape masks all work correctly per
Triton load/store semantics.
Add _eager_materialise_all() fallback in LAZY_AUTO mode: when _find_base
fails, materialise every registered storage to CPU before retrying,
instead of surfacing a raw RuntimeError.
Extract _rebase_core() helper to avoid duplicating fast/slow path logic
between the normal path and the fallback retry.
Tests:
- 7 new config tests (0/1/unset/auto/AUTO/unrecognised + warning)
- scalar False mask, failure-path tests (FORCE_FAKE raises,
LAZY_AUTO fallback materialises all storages)
- Update e2e fixture from _isolate_virtual_memory to _isolate_tensor_mode
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Error, e2e tests Address code-review blockers and high-priority items: Blocker 1 — Thread safety: _cpu_offset() now uses double-checked locking so concurrent workers (TRITON_VIZ_NUM_SMS > 1) never materialise the same storage twice. _eager_materialise_all() also runs under the lock. Blocker 2 — Fallback contract: Replace catch-all `except RuntimeError` with dedicated UnmappablePointerError. Rename the LAZY_AUTO fallback from "eager-real fallback" to "pre-materialise all storages + retry" in docs, Config docstring, and README — no real-tensor rebuild or kernel re-run happens, so the old wording was misleading. High — Default flipped to FORCE_REAL: Unset / "0" → FORCE_REAL (safe default, same as main). "auto" → LAZY_AUTO (opt-in). "1" → FORCE_FAKE. Unrecognised values warn and default to FORCE_REAL. This separates "new mechanism lands" from "default behaviour changes". Medium — Backwards compat: config.virtual_memory is now a deprecation shim via __getattr__ / __setattr__. Reads map FORCE_REAL→False, else True. Writes map False→FORCE_REAL, True→LAZY_AUTO. DeprecationWarning on every access. E2E tests (5 new): - LAZY_AUTO indirect load — no false OOB - FORCE_FAKE indirect load — no false OOB - LAZY_AUTO indirect store — no false OOB - LAZY_AUTO + num_sms=2 concurrent — no crash, no false OOB - LAZY_AUTO OOB indirect — sanitizer detects and aborts Unit tests (7 new): - virtual_memory deprecation shim (read/write, 5 cases) - Thread safety: concurrent _cpu_offset / rebase_pointers (2 cases) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
IndirectSymbolicExprBase.concretize()is called (i.e., indirect loads require concrete data)SANITIZER_ENABLE_FAKE_TENSORsemantics: unset = auto (lazy retry),1= force fake,0= force realFixes #111
Test plan
uv run pytest tests/ -x)SANITIZER_ENABLE_FAKE_TENSOR=0(force real) passes all sanitizer testsSANITIZER_ENABLE_FAKE_TENSOR=1(force fake) passes all sanitizer teststest_gemm_oob_call_stack)