Development updates 20260406#6
Merged
jduprat merged 36 commits intofacebookresearch:mainfrom Apr 6, 2026
Merged
Conversation
When a Layout was passed as the shape, iteration triggered Layout.__iter__() which yields coordinates (not shape components), causing size() to return 0 and a ZeroDivisionError.
Previously crd2crd used zip_transform for the tuple-tuple case, which dropped src_shape. This meant hierarchical-to-flat coordinate conversions like crd2crd(((0,1), 0), (6,2), ((2,3), 2)) failed with a structure mismatch. Now when src_shape is a tuple it is zipped alongside crd and dst_shape in the recursion.
Controls text inside grid cells: True (default) — full detail (offset in flat, row/col/offset in nested) "offset" — offset number only (cleaner hierarchical views) False — no text (colored grid with boundaries only) list/tuple — custom labels indexed by offset value Threaded through draw_layout, show_layout, draw_composite, and the internal _draw_grid and _draw_hierarchical_grid functions.
When label_hierarchy_levels=False, the hierarchical grid used R0/C0 prefixes while the flat grid used plain 0/1/2. Now both modes use plain integers for consistency.
- examples: runs examples/viz.py - check: alias for test - Updated .PHONY for all targets
Modes 0,1 now form the 2D grid (rows, columns) and modes 2+ are the panel axes. Previously modes 0..r-3 were panels and the last two modes were the grid, which disagreed with CuTe's layout ordering (cf. Figure 1 in arXiv:2603.02298v1).
When interleave_colors=True, the 8-color rainbow palette is reordered so consecutive indices share hues (blue, lt blue, green, lt green, ...) instead of cycling through all hues first. This matches the coloring convention used in the CuTe paper (arXiv:2603.02298).
T[(2, None)] now works the same as T[2, :], matching the convention used by slice_and_offset and draw_slice.
Tensor.__str__ now prints {offset}∘layout matching the CuTe paper
convention (arXiv:2603.02298). draw_slice default title shows the
slice result in the same notation, with single-mode results unwrapped
for clean display.
Tuple keys like ((0, None), None) were incorrectly treated as fixed coordinates. Now detected via _has_nested_none and delegated to slice_and_offset which handles partial hierarchical slicing.
slice_modes was flattening results across top-level modes via extend, producing (2, 2, 2) instead of (2, (2, 2)) for hierarchical slices. Now each mode's result is kept as a unit, matching CuTe C++ behavior.
Tensor now accepts an optional `data` parameter (any indexable object). When storage is present, `tensor[i, j]` returns data elements instead of raw offsets, `tensor[i, j] = val` writes through coordinates, and draw_layout auto-labels cells with data values. Slicing produces sub-Tensors that share the parent's storage (view semantics). - data property is read-write with size validation (len >= cosize) - __eq__ includes element-wise data comparison - __setitem__ for scalar writes (all coords fixed) - Auto-label in _build_layout_figure and _build_composite_figure - Broadened cell_labels isinstance check for array-like types - New docs/tensor_api.md; layout_api.md points to it - New examples/tensor.py; updated viz.ipynb with storage section - Makefile examples target runs all three example scripts
Replace internal _is_contiguous helpers in test oracles with the public is_contiguous() function from layouts.py.
transpose=True renders rank-1 layouts as N×1 columns instead of the default 1×N rows. Useful when a mode slice represents a column dimension. Ignored for rank >= 2.
…ayout Implement to_F2_matrix() which converts power-of-2 layouts to their binary matrix representation over GF(2), where offset_bits = M @ coord_bits (mod 2). Swizzles fold naturally into the matrix as XOR is linear over F2. Validate correctness by comparing atom layouts from atoms_nv against known-good basis vectors from Triton's LinearLayoutConversionsTest.cpp: - SM80 MMAv2 C accumulator (16×8) — MMAv2_16x16, line 433 - SM90 GMMA C accumulator (64×16) — MMAv3_64x16, line 522 - SM90 GMMA C accumulator (64×32) — MMAv3_4x2Warps, line 575 - SM90 warp-level F64/C64 atoms reuse SM80_16x8_Row - SM80 C layout shared across FP16/FP32/BF16/INT8/FP8/SM120 - SM100 UMMA col-major identity for all standard M×N sizes - A/B operand self-consistency checks
Previously, __getitem__ with a single int on a rank-2+ tensor fixed
mode 0 and returned a sub-Tensor. This was inconsistent with
__setitem__ (which already did flat evaluation) and with CuTe C++
Tensor::operator()(int) semantics.
Now tensor[i] decomposes i via idx2crd into the natural coordinate and
computes the offset, enabling the canonical copy algorithm:
def copy(src, dst):
for i in range(size(dst.layout)):
dst[i] = src[i]
Mode-0 slicing is still available via tensor[i, :].
These query functions now delegate to obj.layout when given a Tensor (or any object with a .layout attribute), avoiding the need to write size(t.layout) everywhere. Uses duck typing (hasattr) to avoid circular imports between layouts.py and tensor.py.
…me=None The show_* functions were redundant with draw_*(filename=None) and caused double figure display in notebooks — they returned an open Figure that Jupyter rendered twice (once from the inline backend, once from the return value). All draw_* functions already handle filename=None correctly via _save_figure, which displays inline and closes the figure. When displaying inline, _save_figure returns None (not the figure) to prevent Jupyter's auto-display of the cell return value. Tests that inspect figure internals now use _build_*_figure() directly. Also removes demo() from viz.py (redundant with examples/).
Returns a new Tensor sharing the backing store with a different layout. Validates that the new layout's cosize fits within the storage length. This is the Python equivalent of CuTe's make_tensor(t.data(), new_layout).
isinstance(layout, Tensor) fails when the kernel caches a different class object than viz.py imported, which is common with editable installs. This causes Tensor data labels to silently fall back to showing raw offsets instead of data values. Replace with hasattr checks in both _build_composite_figure and _build_layout_figure.
panel_size defaults to None instead of (4, 4). When None, the figure scans all panel layouts and computes compact dimensions: ~0.55 inches per cell plus padding. This eliminates the need to manually tune panel_size for every call — 1-row layouts get short panels, 4×8 layouts get taller ones. Explicit panel_size still overrides.
Rendering options (cell_labels, colorize, num_colors, color_layout, flatten_hierarchical, etc.) are no longer enumerated as named parameters. They flow through **kwargs as defaults to every panel, overridable per-panel via (Layout, opts_dict) tuples. This is more future-proof: adding a new rendering option no longer requires touching the draw_composite signature. Example: draw_composite([tensor], cell_labels='offset') forces all panels to show offsets even when Tensor data is present.
Renders GEMM operands in the standard matmul pattern:
B^T (K×N)
A (M×K) C (M×N)
B is automatically transposed so K aligns vertically with A's columns
and N aligns horizontally with C's columns. Accepts Layout (offsets)
or Tensor (data values). Supports cell_labels, colorize, and other
rendering options via **kwargs.
Implements §2.6.1 COPY and §2.6.2 GEMM from Cecka (arXiv:2603.02298v1) using pure layout algebra. The examples are very illustrative of the expressibility of Layouts. Each algorithm variant has three views: - Offset maps: layout structure via draw_composite / draw_gemm - Data maps: element values after the operation - Physical view: flat backing-store contents via Tensor.view() COPY variants: 1D, transpose, gather, scatter, broadcast, tensor transpose. GEMM variants: NT, TN, GETT (with hierarchy boxes), 1D/2D CONV (im2col). GEMM figures use draw_gemm for the standard matmul spatial arrangement.
Encourage users to report incorrect mappings or confirm correctness via GitHub issues. Added to atoms_nv.py, atoms_amd.py, and atoms_amx.py.
Wave32 matrix multiply atoms for AMD RDNA architectures, added alongside the existing CDNA MFMA atoms in atoms_amd.py: RDNA3 (gfx1100): 16x16x16 — FP16, BF16, INT8, INT4 RDNA4 (gfx1200): 16x16x32 — FP16, BF16, FP8, INT8; 16x16x64 INT4 Thread-value layout: 32 lanes × 8 values for the 16×16 C tile. Each lane owns one column, split into two 8-row groups. A/B inputs are bijections (no broadcast, unlike CDNA MFMA). 14 atoms total with structural invariant tests.
Subgroup-cooperative matrix multiply atoms for Intel Xe architectures: Xe-HPC (Ponte Vecchio): 8x8x8, subgroup_size=8 — FP16, BF16, TF32, INT8 Xe-HPG (Arc): 8x16x8, subgroup_size=16 — FP16, BF16, INT8 Each subgroup lane owns one column of C and B. A is broadcast (stride 0 on thread dimension) — all lanes see the same A tile. 7 atoms total with structural invariant tests including column- ownership verification for C and B.
…2298v1 Walk through the six key application patterns from §3.3–3.5 of the paper, building from simplest (TV layouts) to most general (logical divide): - §3.3.4 Partitioning: TV layouts for thread-value data distribution - §3.3.5 Tiling: logical_divide and its variants (zipped, tiled, flat) - §3.4.2 Vectorization: upcast/downcast and contiguity analysis - §3.4.4 Admissibility: max_common_vector, compatible, weakly_congruent - §3.5.1 Logical Product: blocked_product vs raked_product - §3.5.2 Logical Divide: complement-based construction and duality Each section includes CUDA context, step-by-step algebra with explain(), and inline visualizations via draw_tv_layout, draw_layout, draw_composite.
The panel size auto-computation was using the raw layout shape to determine grid dimensions. For TV layouts where grid_rows/grid_cols override the rendered grid shape (e.g., a (32,2)-shaped TV layout rendered on an 8×8 grid), this produced vastly oversized figures with large whitespace gaps. Now checks per-panel and default grid overrides before falling back to the layout shape.
Extend the algorithms notebook with two new topics: - Grouped GEMM: uniform-size batched GEMM as GETT (fold G into M), with discussion of why fully-independent and variable-size grouped GEMM fall outside the layout algebra paradigm. - REDUCE: accumulation along a mode, framed as a simplification of GEMM (N=1, B=1). Three worked examples (row sum, column sum, max) demonstrate the same "same function, different layouts" pattern.
Two new sections showing how GEMM and REDUCE compose into higher-level patterns: - Epilogue Fusion: GEMM + bias (broadcast via stride-0 layout) + ReLU activation, demonstrating that bias add is the COPY broadcast pattern applied additively. - Online Softmax: two-pass vs one-pass algorithms decomposed into REDUCE(max) + element-wise(exp) + REDUCE(sum) + element-wise(div). Includes per-row softmax on a matrix and connection to Flash Attention.
Python renders 1-element tuples as ((4, 2),) with a trailing comma. Layout.__str__ used repr() which preserved this, producing ugly display like ((4, 2),) : ((1, 4),) in CuTe notation. Add _fmt_shape() helper that recursively formats without trailing commas. __repr__ keeps the comma for eval-safety. Also fix viz.py _build_layout_figure to use str(tensor) instead of repr(tensor) for default Tensor titles.
Thread a `precision` kwarg through draw_layout, draw_composite, and draw_gemm so callers can control significant digits when displaying float data (e.g. softmax probabilities). Without it, str() renders full float precision which makes figures unreadable.
- Move REDUCE (rank-2) before GEMM (rank-3) for bottom-up progression - Reframe REDUCE as standalone; GEMM now "extends REDUCE" - Add rank-N annotations: COPY rank-1, REDUCE rank-2, GEMM rank-3 - Add clickable section links in intro with HTML anchors - Convert all numerical tensor data to float literals - Use precision=3 on softmax figures for readable labels
- Polish TV layout and tiling sections in applications.ipynb - Add intile/oftile coordinate diagram to generate_figures.py - Add generated intile_oftile.png figure
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[New] Tensor class now supports storage-backed tensors with coordinate indexing (tensor[i, j]), write-through, view semantics on slicing, None as free-dimension marker, Tensor.view(layout) for same-storage reinterpretation, and str with offset notation
[New] to_F2_matrix() — convert power-of-2 layouts to binary matrix representation over GF(2); validated against known-good basis vectors from Triton's LinearLayoutConversionsTest.cpp
[New] Intel Xe GPU DPAS atom definitions
[New] AMD RDNA3/RDNA4 WMMA atom definitions
[New] draw_gemm() for matmul spatial arrangement of A, B, C operand panels
[New] is_contiguous() as an alias for is_bijective()
[New] cell_labels parameter for draw_layout — user-supplied per-cell text
[New] interleave_colors option for hue-grouped palette
[New] transpose option in draw_layout for rank-1 column vectors
[New] precision parameter for float cell labels in viz
[New] examples/algorithms.ipynb — COPY, GEMM, Grouped GEMM, REDUCE, Epilogue Fusion, and Online Softmax visualized with layout algebra
[New] examples/applications.ipynb — six layout algebra patterns from arXiv:2603.02298v1
[Fix] idx2crd / crd2flat to accept Layout objects as shape argument
[Fix] crd2crd to thread src_shape through per-mode recursion
[Fix] Rank≥3 panel splitting to match CuTe convention
[Fix] Tensor slicing for hierarchical specs with nested Nones
[Fix] slice_modes to preserve hierarchical mode boundaries
[Fix] draw_composite auto-sizing to respect grid_rows/grid_cols overrides
[Fix] Trailing comma in Layout.str for 1-tuple shapes
[Cleanup] Remove show_* viz functions — draw_*(filename=None) handles inline display, fixing double-render in Jupyter
[Cleanup] Auto-compute draw_composite panel_size from layout dimensions
[Cleanup] Pass draw_composite rendering options through **kwargs
[Cleanup] Make size(), rank(), cosize(), depth(), mode(), flatten(), image() accept Tensors transparently
[Cleanup] Duck-type Tensor detection in viz instead of isinstance()
[Cleanup] Use plain integers for axis labels in hierarchical mode
[Docs] Community feedback notice added to all atom definition files
[Docs] Add examples and check targets to Makefile