Development updates 20260409#8
Merged
jduprat merged 24 commits intofacebookresearch:mainfrom Apr 9, 2026
Merged
Conversation
The SW128 swizzle visualization (8×128) was too wide when drawn
side-by-side. Add arrangement parameter to draw_swizzle and
_build_swizzle_figure ("horizontal" default, "vertical" stacks
panels). Use vertical arrangement in the notebook for this cell.
- Add parameterized _generate_im2col() to docs/generate_figures.py (accepts H, W, R, S; defaults to 4×4 input with 2×2 filter) - Replace ASCII art in algorithms.ipynb cell 65 with generated figure - Restructure 2D CONV section: explain general N-D mapping first (defining K, C, T, R, S, N, Z, P, Q), then specialize to 2D
_logical_divide_by_shape tried to compare Layout objects with <= against integers, causing TypeError when the tiler tuple contained Layout elements like (Layout(4,1), Layout(8,2)). The fix detects Layout elements and dispatches them through the compose/complement path per mode, matching CuTe C++ which treats tiler elements as Layouts (layout.hpp:1562). Found via arXiv:2603.02298v1 §3.5.2 examples.
When (remaining_shape-1)*remaining_stride < curr_shape, all of B fits within the current mode and higher modes are unreachable. The old code raised a divisibility error instead of absorbing B into this mode. This matches CuTe C++ layout.hpp:1077 which allows rest_stride < curr_shape as a valid composition case. The paper (§3.3.3) calls these "apparent violations" resolved by truncation. Example: compose((4,2,8):(3,12,97), 3:3) now correctly returns 3:9 instead of raising ValueError. Found via arXiv:2603.02298v1 §3.3.3 examples, verified against CuTe C++ composition_impl.
Rewrote left_inverse to match the CuTe C++ algorithm (layout.hpp:1324) instead of using right_inverse(Layout(L, complement(L))) which produced wrong results when complement coalesced away stride information. The C++ algorithm: 1. Coalesce, extract shapes/strides 2. Compute prefix products of shapes 3. Sort modes by stride (ascending) 4. Build inverse: new_shape = stride / result_size_so_far, new_stride = prefix_product[original_mode_index] 5. Append last sorted mode's shape, coalesce Example: left_inverse((4,8):(1,5)) now correctly returns (5,8):(1,4) matching Table 6 of arXiv:2603.02298v1, instead of the incorrect 4:1. Note: pycute also returns 4:1 for this case (same bug). Our implementation now matches the C++ ground truth and the paper.
56 tests derived from concrete examples in the CuTe paper:
- Figures 1-3, 5, 10: layout construction, folding, slicing
- Tables 2, 4-7: COPY, partitioning, inverses, complement
- §3.1-3.5.2: concatenation, coalesce, composition, inverse,
complement, logical product, logical divide, zipped divide
Tests cite specific figure/table/equation numbers. Run with --draw
to generate corresponding paper figures as SVGs:
pytest tests/paper_examples.py --draw
The old check `len(data) >= cosize(layout)` only works for zero-offset, nonnegative-stride layouts. It silently accepted Tensors that would read/write out of bounds (e.g. offset=10 with a 4-element buffer, or negative strides without a compensating offset). Replace with _address_bounds() which computes the actual min/max storage indices from (offset, layout, swizzle) and validates that the storage covers the full range.
cosize() now uses abs(stride) so negative-stride layouts report the correct memory span. _composition_1d carries the stride sign separately, matching CuTe's signed composition rules. Tensor.view() preserves the parent's base offset, enabling reverse-order views like Layout(4, -1).
Coalescing and segment analyses now rebase accessed offsets to the group's minimum, so reversed dense layouts analyze identically to their forward equivalents. Permutation analysis (cycles, order) rebases dense shifted images to [0, n) before decomposing. to_F2_matrix rejects negative strides (affine, not F2-linear) and fixes swizzle matrix construction for negative shift values.
Layouts with negative strides produce negative output indices which broke cell-label lookup (Python wraparound), TV grid sizing (based on cosize which is always positive), and TV mapping (assumed image starts at 0). Fix by rebasing: _tv_output_bounds() finds the actual min/max offsets, _compute_tv_mapping() shifts all indices so the minimum maps to cell 0, and _lookup_cell_label() guards against negative-index wraparound on user-provided label lists.
Previously t[:, 3] = 99 silently computed an offset from the slice object, writing to an arbitrary location. Now __setitem__ checks for slice/None markers anywhere in the key (including inside hierarchical tuples) and raises TypeError with guidance to write through a sliced sub-Tensor instead.
Nested tuple tilers like ((2, 3), 4) were flattened into Layout((2,3), 1) by the is_pure_shape path, collapsing mode structure that should recurse. Now _normalize_compose_tiler_element preserves tuple nesting so each level composes against its corresponding mode, matching CuTe semantics.
zipped_divide, tiled_divide, and flat_divide were reducing Layout tilers to their shape, silently discarding non-unit strides (e.g. Layout(4, 2) became just 4). Now Layout tilers take CuTe's tile_unzip terminal path, preserving their stride structure. tiled_divide and flat_divide are rebuilt on top of zipped_divide.
The complement bound for Layout tilers should be size(A) * cosize(B), not size(A) * size(B). With non-unit strides (e.g. Layout(3, 2), cosize=5) the old formula underestimated the codomain and produced wrong complement layouts in the explanation output.
explain(compose, A, (2, 4)) called B(i) on a plain tuple, raising TypeError. Now detects non-Layout tilers and shows the mode-by-mode decomposition instead of the pointwise A(B(i)) trace.
Update pyproject.toml: fix stale layout_algebra paths to tensor_layouts, add analysis.py to wildcard-import exceptions, exclude notebooks from the lint surface. Fix all reported warnings: f-strings with no interpolation, unused imports and variables, ambiguous variable name 'l'. Add Ruff instructions to README.
Two tests shared the name test_draw_swizzle_delegates_to_shared_builder, so pytest silently ran only the second. Rename the second to test_draw_swizzle_saves_figure_from_shared_builder and update the first to properly mock _save_figure through the full delegation path.
CuTe C++ sets stride=0 on any extent-1 tile or rest mode produced by logical_divide (e.g. logical_divide(4:3, 4) = (4,1):(3,0)). Our implementation kept the original stride, producing (4,1):(3,3) which is functionally equivalent but breaks exact-match tests against CuTe.
Compiles a small C++ program against CUTLASS/CuTe headers and compares exact layout strings for the cases fixed in recent commits: nested tuple tiler composition, unit-mode stride canonicalization, and oversize tiler division. Skipped automatically when CUTLASS headers are not installed.
When slice_and_offset builds the sublayout for a partial hierarchical slice, it now propagates the parent layout's swizzle to the result. Previously the swizzle was silently dropped, causing incorrect address computations for slices of swizzled layouts. Add tests verifying swizzle preservation for both the low-level slice_and_offset function and Tensor.__getitem__ hierarchical slicing.
Layout.__call__(None) — a bare None rather than a tuple containing None — now returns the layout unchanged, matching CuTe's slice(_, layout) identity operation. Previously bare-None fell through to has_none(), which does not handle a non-tuple None correctly. Add C++ oracle tests verifying slice(_, layout) for both rank-2 and scalar layouts, and add external compatibility tests for regular, scalar, and swizzled layouts.
Tensor.__getitem__ now recognizes a bare slice(None) (i.e. tensor[:]) and returns a new Tensor view with the same layout, offset, and data. This matches the behavior of the explicit tensor[:, :] full slice and provides the natural Python idiom for creating a view of the whole tensor. Previously tensor[:] fell through to _slice_single, which does not handle slice(None) correctly for this purpose. Update docs/tensor_api.md with the tensor[:] entry and add tests for both regular and swizzled tensor full slices.
Add a "Shape Factorization" section to docs/layout_api.md documenting shape_div and shape_mod, including the intentional policy difference from dynamic CuTe C++: this Python implementation requires exact scalar divisibility (b|a or a|b), raising ValueError for pairs like shape_div(6, 4) where CuTe C++ would return ceil_div(6, 4) = 2. Update the shape_div docstring with the same strict-policy explanation and add a ValueError example to the docstring examples.
image, is_injective, is_surjective, is_bijective, is_contiguous, and functionally_equal are O(size) exhaustive enumerations — analysis-tier operations, not core algebra. Moving them to analysis.py keeps the cost model clear: the core layouts module is efficient, exhaustive introspection is explicitly opt-in via the analysis module. Updated all imports, tests, examples, notebooks, and documentation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[New] Negative stride support across Layout, Tensor, analysis, and visualization —
cosize()andcompose()handle negative strides via magnitude decomposition;Tensor.view()preserves base offset; analysis functions rebase the addressed footprint to a local origin; TV visualization rebases negative offsets and avoids Python negative-index wraparound in explicit cell labels[New] CuTe C++ oracle test suite — compiles regression cases directly against installed CUTLASS headers for
compose(),logical_divide(),zipped_divide(),tiled_divide(),flat_divide(),left_inverse(), andlogical_product(); gracefully skips when CUTLASS or a C++ compiler is unavailable[New]
tensor[:]whole-view full slice on Tensor, matching the explicittensor[:, :]behavior[New] Vertical arrangement in draw_swizzle for wide layouts — before/after grids stack top-to-bottom when columns exceed a threshold
[New] Paper examples test suite (arXiv:2603.02298v1) with --draw pytest option for visual output
[Fix] left_inverse for non-contiguous (padded) layouts
[Fix]
compose()to truncate unreachable modes before the divisibility check[Fix]
logical_divide()to support Layout tilers embedded in by-mode tuples[Fix]
compose()andlogical_divide()for nested tuple tilers[Fix]
zipped_divide(),tiled_divide(),flat_divide()to preserve Layout tiler strides instead of silently degrading to shape tilers[Fix] Canonicalize stride to 0 for unit-extent modes in
logical_divide()[Fix]
Layout.__call__(None)as full-slice identity[Fix] Preserve swizzle attribute in
slice_and_offset()sublayout results[Fix]
explain(compose)crash on tuple tilers[Fix]
explain(logical_product)to usecosize(B)for complement bound[Fix] Duplicate test name shadowing draw_swizzle coverage — one of two identically-named tests was silently never running
[Robustness] Reject free coordinates (slices, None) in
Tensor.__setitem__with a clear TypeError guiding users to the slice-then-index pattern[Cleanup] Move exhaustive introspection helpers (
image(),is_injective(),is_surjective(),is_bijective(),is_contiguous(),functionally_equal()) from layouts.py to analysis.py — keeps the core module efficient, O(size) enumeration is opt-in[Cleanup] Configure Ruff with correct src/tensor_layouts/ paths, fix lint warnings across the codebase
[Docs]
im2col()figure and CONV→GEMM mapping clarification in applications notebook[Docs] Document
shape_div()strict scalar divisibility policy