Skip to content

[FEAT] Lazy fake-to-real tensor materialization#303

Draft
mark14wu wants to merge 12 commits intomainfrom
feat/lazy-fake-to-real-tensor
Draft

[FEAT] Lazy fake-to-real tensor materialization#303
mark14wu wants to merge 12 commits intomainfrom
feat/lazy-fake-to-real-tensor

Conversation

@mark14wu
Copy link
Copy Markdown
Collaborator

@mark14wu mark14wu commented Mar 2, 2026

Summary

  • Start with fake tensors (no CPU copies) by default and lazily retry with real tensors only when IndirectSymbolicExprBase.concretize() is called (i.e., indirect loads require concrete data)
  • Avoids expensive host copies for kernels without data dependencies while transparently handling kernels that need them
  • SANITIZER_ENABLE_FAKE_TENSOR semantics: unset = auto (lazy retry), 1 = force fake, 0 = force real

Fixes #111

Test plan

  • All 205 existing tests pass (uv run pytest tests/ -x)
  • Verified SANITIZER_ENABLE_FAKE_TENSOR=0 (force real) passes all sanitizer tests
  • Verified SANITIZER_ENABLE_FAKE_TENSOR=1 (force fake) passes all sanitizer tests
  • Default auto mode correctly retries with real tensors on indirect load kernels (e.g., test_gemm_oob_call_stack)

Start with fake tensors (no CPU copies) by default and lazily retry
with real tensors only when indirect loads require concrete data.
This avoids expensive host copies for kernels without data
dependencies while transparently handling kernels that need them.

SANITIZER_ENABLE_FAKE_TENSOR semantics: unset = auto (lazy retry),
1 = force fake, 0 = force real.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 2, 2026

Sanitizer Performance Benchmark

Benchmark main (min) PR (min) Change
gemm 0.167s 0.169s +1.4%
gemm_oob 0.175s 0.175s +0.2%
indirect_load 0.255s 0.257s +0.7%
nested_loop 0.332s 0.335s +0.7%
block_pointer_loop_advance 0.162s 0.160s -1.3%
liger_jsd 0.138s 0.140s +1.7%
flaggems_layernorm 0.414s 0.417s +0.7%
swiglu 0.170s 0.171s +1.0%
cross_entropy 0.160s 0.161s +0.8%
fused_linear_jsd 0.208s 0.210s +0.6%
Total 2.181s 2.195s +0.7%

Iterations: 1 warmup + 20 measured

@mark14wu mark14wu marked this pull request as draft March 2, 2026 05:53
@mark14wu mark14wu force-pushed the feat/lazy-fake-to-real-tensor branch from 2d654b8 to 3245e1b Compare March 2, 2026 05:54
mark14wu added 3 commits March 3, 2026 00:24
Replace the coarse-grained retry strategy (NeedRealTensorsError + full
kernel re-run) with fine-grained lazy materialization: a TensorMaterializer
rebases GPU pointers to on-demand CPU copies only when
IndirectSymbolicExprBase.concretize() is called, avoiding unnecessary
full-tensor copies for kernels with indirect loads.

- Add TensorMaterializer class to patch.py
- Update concretize() to rebase pointers via materializer
- Remove NeedRealTensorsError, reset_for_retry(), retry logic
- Simplify virtual_memory config back to bool
# Conflicts:
#	triton_viz/core/client.py
@mark14wu mark14wu marked this pull request as ready for review March 6, 2026 16:48
@mark14wu
Copy link
Copy Markdown
Collaborator Author

mark14wu commented Mar 6, 2026

I guess test is missing for this branch.

@mark14wu mark14wu marked this pull request as draft March 6, 2026 18:21
mark14wu and others added 6 commits March 15, 2026 19:09
rebase_pointers() assumed all pointers came from a single GPU storage,
breaking when a pointer tensor spans multiple storages. It also crashed
on masked-out garbage addresses because concretize() called rebase
before computing the mask.

- Make rebase_pointers() mask-aware with per-storage-group rebasing
- Reorder concretize() to compute mask before calling rebase_pointers
- Add regression tests for TensorMaterializer and Config env parsing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix mask broadcast bug where scalar/lower-dimensional masks were not
expanded to match ptr_data shape, leaving most lanes unrebased. Add
np.broadcast_to before flattening. Fix README documenting wrong default
for SANITIZER_ENABLE_FAKE_TENSOR (0 → 1). Add 6 regression tests for
non-zero offset rebasing and broadcast masks. Rename test classes to
match pytest.ini python_classes = *Test pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…llback

Replace boolean `virtual_memory` config with explicit `TensorMode` enum:
  - FORCE_REAL  (env=0):      always copy tensors to CPU upfront
  - LAZY_AUTO   (unset/auto): fake tensors + lazy materialization with
                               eager fallback on unmappable pointers
  - FORCE_FAKE  (env=1):      fake tensors + lazy materialization,
                               errors on unmappable pointers

Unrecognised env values now emit a warning and default to LAZY_AUTO.

Fix rebase_pointers mask handling: cast to dtype=bool and use .ravel()
so scalar, broadcastable, and same-shape masks all work correctly per
Triton load/store semantics.

Add _eager_materialise_all() fallback in LAZY_AUTO mode: when _find_base
fails, materialise every registered storage to CPU before retrying,
instead of surfacing a raw RuntimeError.

Extract _rebase_core() helper to avoid duplicating fast/slow path logic
between the normal path and the fallback retry.

Tests:
  - 7 new config tests (0/1/unset/auto/AUTO/unrecognised + warning)
  - scalar False mask, failure-path tests (FORCE_FAKE raises,
    LAZY_AUTO fallback materialises all storages)
  - Update e2e fixture from _isolate_virtual_memory to _isolate_tensor_mode

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Error, e2e tests

Address code-review blockers and high-priority items:

Blocker 1 — Thread safety:
  _cpu_offset() now uses double-checked locking so concurrent workers
  (TRITON_VIZ_NUM_SMS > 1) never materialise the same storage twice.
  _eager_materialise_all() also runs under the lock.

Blocker 2 — Fallback contract:
  Replace catch-all `except RuntimeError` with dedicated
  UnmappablePointerError.  Rename the LAZY_AUTO fallback from
  "eager-real fallback" to "pre-materialise all storages + retry" in
  docs, Config docstring, and README — no real-tensor rebuild or kernel
  re-run happens, so the old wording was misleading.

High — Default flipped to FORCE_REAL:
  Unset / "0" → FORCE_REAL (safe default, same as main).
  "auto" → LAZY_AUTO (opt-in).  "1" → FORCE_FAKE.
  Unrecognised values warn and default to FORCE_REAL.
  This separates "new mechanism lands" from "default behaviour changes".

Medium — Backwards compat:
  config.virtual_memory is now a deprecation shim via __getattr__ /
  __setattr__.  Reads map FORCE_REAL→False, else True.  Writes map
  False→FORCE_REAL, True→LAZY_AUTO.  DeprecationWarning on every access.

E2E tests (5 new):
  - LAZY_AUTO indirect load — no false OOB
  - FORCE_FAKE indirect load — no false OOB
  - LAZY_AUTO indirect store — no false OOB
  - LAZY_AUTO + num_sms=2 concurrent — no crash, no false OOB
  - LAZY_AUTO OOB indirect — sanitizer detects and aborts

Unit tests (7 new):
  - virtual_memory deprecation shim (read/write, 5 cases)
  - Thread safety: concurrent _cpu_offset / rebase_pointers (2 cases)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Virtual memory should not be always on

1 participant