Skip to content

Development updates 20260409#8

Merged
jduprat merged 24 commits intofacebookresearch:mainfrom
jduprat:dev
Apr 9, 2026
Merged

Development updates 20260409#8
jduprat merged 24 commits intofacebookresearch:mainfrom
jduprat:dev

Conversation

@jduprat
Copy link
Copy Markdown
Contributor

@jduprat jduprat commented Apr 9, 2026

[New] Negative stride support across Layout, Tensor, analysis, and visualization — cosize() and compose() handle negative strides via magnitude decomposition; Tensor.view() preserves base offset; analysis functions rebase the addressed footprint to a local origin; TV visualization rebases negative offsets and avoids Python negative-index wraparound in explicit cell labels
[New] CuTe C++ oracle test suite — compiles regression cases directly against installed CUTLASS headers for compose(), logical_divide(), zipped_divide(), tiled_divide(), flat_divide(), left_inverse(), and logical_product(); gracefully skips when CUTLASS or a C++ compiler is unavailable
[New] tensor[:] whole-view full slice on Tensor, matching the explicit tensor[:, :] behavior
[New] Vertical arrangement in draw_swizzle for wide layouts — before/after grids stack top-to-bottom when columns exceed a threshold
[New] Paper examples test suite (arXiv:2603.02298v1) with --draw pytest option for visual output
[Fix] left_inverse for non-contiguous (padded) layouts
[Fix] compose() to truncate unreachable modes before the divisibility check
[Fix] logical_divide() to support Layout tilers embedded in by-mode tuples
[Fix] compose() and logical_divide() for nested tuple tilers
[Fix] zipped_divide(), tiled_divide(), flat_divide() to preserve Layout tiler strides instead of silently degrading to shape tilers
[Fix] Canonicalize stride to 0 for unit-extent modes in logical_divide()
[Fix] Layout.__call__(None) as full-slice identity
[Fix] Preserve swizzle attribute in slice_and_offset() sublayout results
[Fix] explain(compose) crash on tuple tilers
[Fix] explain(logical_product) to use cosize(B) for complement bound
[Fix] Duplicate test name shadowing draw_swizzle coverage — one of two identically-named tests was silently never running
[Robustness] Reject free coordinates (slices, None) in Tensor.__setitem__ with a clear TypeError guiding users to the slice-then-index pattern
[Cleanup] Move exhaustive introspection helpers (image(), is_injective(), is_surjective(), is_bijective(), is_contiguous(), functionally_equal()) from layouts.py to analysis.py — keeps the core module efficient, O(size) enumeration is opt-in
[Cleanup] Configure Ruff with correct src/tensor_layouts/ paths, fix lint warnings across the codebase
[Docs] im2col() figure and CONV→GEMM mapping clarification in applications notebook
[Docs] Document shape_div() strict scalar divisibility policy

jduprat added 24 commits April 8, 2026 16:20
The SW128 swizzle visualization (8×128) was too wide when drawn
side-by-side. Add arrangement parameter to draw_swizzle and
_build_swizzle_figure ("horizontal" default, "vertical" stacks
panels). Use vertical arrangement in the notebook for this cell.
- Add parameterized _generate_im2col() to docs/generate_figures.py
  (accepts H, W, R, S; defaults to 4×4 input with 2×2 filter)
- Replace ASCII art in algorithms.ipynb cell 65 with generated figure
- Restructure 2D CONV section: explain general N-D mapping first
  (defining K, C, T, R, S, N, Z, P, Q), then specialize to 2D
_logical_divide_by_shape tried to compare Layout objects with <=
against integers, causing TypeError when the tiler tuple contained
Layout elements like (Layout(4,1), Layout(8,2)).

The fix detects Layout elements and dispatches them through the
compose/complement path per mode, matching CuTe C++ which treats
tiler elements as Layouts (layout.hpp:1562).

Found via arXiv:2603.02298v1 §3.5.2 examples.
When (remaining_shape-1)*remaining_stride < curr_shape, all of B fits
within the current mode and higher modes are unreachable. The old code
raised a divisibility error instead of absorbing B into this mode.

This matches CuTe C++ layout.hpp:1077 which allows rest_stride <
curr_shape as a valid composition case. The paper (§3.3.3) calls these
"apparent violations" resolved by truncation.

Example: compose((4,2,8):(3,12,97), 3:3) now correctly returns 3:9
instead of raising ValueError.

Found via arXiv:2603.02298v1 §3.3.3 examples, verified against CuTe
C++ composition_impl.
Rewrote left_inverse to match the CuTe C++ algorithm (layout.hpp:1324)
instead of using right_inverse(Layout(L, complement(L))) which produced
wrong results when complement coalesced away stride information.

The C++ algorithm:
1. Coalesce, extract shapes/strides
2. Compute prefix products of shapes
3. Sort modes by stride (ascending)
4. Build inverse: new_shape = stride / result_size_so_far,
   new_stride = prefix_product[original_mode_index]
5. Append last sorted mode's shape, coalesce

Example: left_inverse((4,8):(1,5)) now correctly returns (5,8):(1,4)
matching Table 6 of arXiv:2603.02298v1, instead of the incorrect 4:1.

Note: pycute also returns 4:1 for this case (same bug). Our
implementation now matches the C++ ground truth and the paper.
56 tests derived from concrete examples in the CuTe paper:
- Figures 1-3, 5, 10: layout construction, folding, slicing
- Tables 2, 4-7: COPY, partitioning, inverses, complement
- §3.1-3.5.2: concatenation, coalesce, composition, inverse,
  complement, logical product, logical divide, zipped divide

Tests cite specific figure/table/equation numbers. Run with --draw
to generate corresponding paper figures as SVGs:

    pytest tests/paper_examples.py --draw
The old check `len(data) >= cosize(layout)` only works for zero-offset,
nonnegative-stride layouts. It silently accepted Tensors that would
read/write out of bounds (e.g. offset=10 with a 4-element buffer, or
negative strides without a compensating offset).

Replace with _address_bounds() which computes the actual min/max storage
indices from (offset, layout, swizzle) and validates that the storage
covers the full range.
cosize() now uses abs(stride) so negative-stride layouts report the
correct memory span.  _composition_1d carries the stride sign separately,
matching CuTe's signed composition rules.  Tensor.view() preserves the
parent's base offset, enabling reverse-order views like Layout(4, -1).
Coalescing and segment analyses now rebase accessed offsets to the
group's minimum, so reversed dense layouts analyze identically to
their forward equivalents.  Permutation analysis (cycles, order)
rebases dense shifted images to [0, n) before decomposing.
to_F2_matrix rejects negative strides (affine, not F2-linear) and
fixes swizzle matrix construction for negative shift values.
Layouts with negative strides produce negative output indices which
broke cell-label lookup (Python wraparound), TV grid sizing (based on
cosize which is always positive), and TV mapping (assumed image starts
at 0).

Fix by rebasing: _tv_output_bounds() finds the actual min/max offsets,
_compute_tv_mapping() shifts all indices so the minimum maps to cell 0,
and _lookup_cell_label() guards against negative-index wraparound on
user-provided label lists.
Previously t[:, 3] = 99 silently computed an offset from the slice
object, writing to an arbitrary location.  Now __setitem__ checks for
slice/None markers anywhere in the key (including inside hierarchical
tuples) and raises TypeError with guidance to write through a sliced
sub-Tensor instead.
Nested tuple tilers like ((2, 3), 4) were flattened into Layout((2,3), 1)
by the is_pure_shape path, collapsing mode structure that should recurse.
Now _normalize_compose_tiler_element preserves tuple nesting so each
level composes against its corresponding mode, matching CuTe semantics.
zipped_divide, tiled_divide, and flat_divide were reducing Layout
tilers to their shape, silently discarding non-unit strides (e.g.
Layout(4, 2) became just 4).  Now Layout tilers take CuTe's
tile_unzip terminal path, preserving their stride structure.
tiled_divide and flat_divide are rebuilt on top of zipped_divide.
The complement bound for Layout tilers should be size(A) * cosize(B),
not size(A) * size(B).  With non-unit strides (e.g. Layout(3, 2),
cosize=5) the old formula underestimated the codomain and produced
wrong complement layouts in the explanation output.
explain(compose, A, (2, 4)) called B(i) on a plain tuple, raising
TypeError.  Now detects non-Layout tilers and shows the mode-by-mode
decomposition instead of the pointwise A(B(i)) trace.
Update pyproject.toml: fix stale layout_algebra paths to tensor_layouts,
add analysis.py to wildcard-import exceptions, exclude notebooks from
the lint surface.  Fix all reported warnings: f-strings with no
interpolation, unused imports and variables, ambiguous variable name
'l'.  Add Ruff instructions to README.
Two tests shared the name test_draw_swizzle_delegates_to_shared_builder,
so pytest silently ran only the second.  Rename the second to
test_draw_swizzle_saves_figure_from_shared_builder and update the first
to properly mock _save_figure through the full delegation path.
CuTe C++ sets stride=0 on any extent-1 tile or rest mode produced by
logical_divide (e.g. logical_divide(4:3, 4) = (4,1):(3,0)).  Our
implementation kept the original stride, producing (4,1):(3,3) which
is functionally equivalent but breaks exact-match tests against CuTe.
Compiles a small C++ program against CUTLASS/CuTe headers and compares
exact layout strings for the cases fixed in recent commits: nested
tuple tiler composition, unit-mode stride canonicalization, and
oversize tiler division.  Skipped automatically when CUTLASS headers
are not installed.
When slice_and_offset builds the sublayout for a partial hierarchical
slice, it now propagates the parent layout's swizzle to the result.
Previously the swizzle was silently dropped, causing incorrect address
computations for slices of swizzled layouts.

Add tests verifying swizzle preservation for both the low-level
slice_and_offset function and Tensor.__getitem__ hierarchical slicing.
Layout.__call__(None) — a bare None rather than a tuple containing None —
now returns the layout unchanged, matching CuTe's slice(_, layout)
identity operation. Previously bare-None fell through to has_none(),
which does not handle a non-tuple None correctly.

Add C++ oracle tests verifying slice(_, layout) for both rank-2 and
scalar layouts, and add external compatibility tests for regular,
scalar, and swizzled layouts.
Tensor.__getitem__ now recognizes a bare slice(None) (i.e. tensor[:])
and returns a new Tensor view with the same layout, offset, and data.
This matches the behavior of the explicit tensor[:, :] full slice and
provides the natural Python idiom for creating a view of the whole tensor.

Previously tensor[:] fell through to _slice_single, which does not
handle slice(None) correctly for this purpose.

Update docs/tensor_api.md with the tensor[:] entry and add tests for
both regular and swizzled tensor full slices.
Add a "Shape Factorization" section to docs/layout_api.md documenting
shape_div and shape_mod, including the intentional policy difference from
dynamic CuTe C++: this Python implementation requires exact scalar
divisibility (b|a or a|b), raising ValueError for pairs like
shape_div(6, 4) where CuTe C++ would return ceil_div(6, 4) = 2.

Update the shape_div docstring with the same strict-policy explanation
and add a ValueError example to the docstring examples.
image, is_injective, is_surjective, is_bijective, is_contiguous, and
functionally_equal are O(size) exhaustive enumerations — analysis-tier
operations, not core algebra. Moving them to analysis.py keeps the cost
model clear: the core layouts module is efficient, exhaustive introspection
is explicitly opt-in via the analysis module.

Updated all imports, tests, examples, notebooks, and documentation.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 9, 2026
@jduprat jduprat merged commit c305ec9 into facebookresearch:main Apr 9, 2026
3 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant