ENABLE_SINGLE_DIM_MM_FAMILY: opt-in DTensor strided sharding for mm-family ops by weifengpy · Pull Request #429 · meta-pytorch/autoparallel

weifengpy · 2026-04-21T18:27:35Z

current usage of einsum: AutoParallel rewrites PyTorch's view → mm → view decomposition of nn.Linear into einsum

This PR adds an opt-in toggle ENABLE_SINGLE_DIM_MM_FAMILY (default False) that routes mm/addmm/bmm/baddbmm/_scaled_mm through single-dim path and enumerates _StridedShard variants from observed input split_factors

Default behavior is unchanged. Flip the flag to opt in.

Headline Result

LLaMA3-8B at PR #424-class config (dim=4096, seqlen=8192, 64-rank 8×8 fake-PG mesh, cost_model=nccl, single-H100, fake collectives):

Scale	Solver time	Solver objective (NCCL cost proxy)
LLaMA3-8B 2-layer	NATIVE 45.7s vs EINSUM 76.1s (-40%)	NATIVE 57576 vs EINSUM 57761 (-0.32% cheaper)
LLaMA3-8B 32-layer	NATIVE 29.5 min vs EINSUM >4 h (timed out)	NATIVE 520184, EINSUM unknown

Objectives reproducible across seeds 0 and 1 (solver is deterministic given the graph).
EINSUM's strategy-space-per-node is ~1.5-2× larger (einsum bsk,kn->bsn has 4 axes vs. mm mk,kn->mn with 3), making ILP scaling superlinearly worse at depth.
_StridedShard never appears in the chosen strategies for the LLaMA3-8B configs tested. Phase 1's _StridedShard enumeration is correct when dormant and ready when exercised by other workloads.

Summary: AP currently routes mm/addmm/bmm through the legacy register_op_strategy path (gen_einsum_strategies), which only emits plain Shard/Partial/Replicate placements. This misses the _StridedShard strategies that DTensor's single-dim mm path synthesizes when inputs carry _StridedShard — the natural output of view-flatten on multi-dim-sharded tensors. Changes: - Add _PREFER_SINGLE_DIM_OPS allowlist (mm, addmm, bmm, baddbmm, _scaled_mm) in shardings/dtensor_sharding_helpers.py; get_op_strategy routes those to the upstream single-dim path first. - Extend _try_single_dim_strategy's placeholder resolution to emit both Shard(d) and _StridedShard(d, sf) variants, with sf drawn from split_factors observed on upstream input strategies. Previous plain-Shard-only behavior is preserved when no input carries _StridedShard. - Fix _StridedShard miss in is_shard() call sites in apply_sharding.py (_localize_shape_arg) and cost_models/compute_estimation.py (_get_sharded_shape_stride) — _StridedShard-sharded dims were not being divided by mesh_size, causing over-counted FLOPs and wrong local shapes. Benchmarked on LLaMA3-8B at PR meta-pytorch#424-class config (dim=4096, seqlen=8192, 64-rank 8x8 fake-PG mesh): 2-layer: NATIVE solve 45.7s / objective 57576 vs EINSUM 76.1s / 57761 (-40% solve, -0.32% objective). 32-layer: NATIVE solve 29.5 min / objective 520184; EINSUM did not complete in 4h wall time. Test Plan: Three new unit tests in tests/test_propagation_rules.py: - test_mm_strategy_enumerates_strided_shard - test_mm_strategy_plain_shard_still_present - test_mm_strategy_backward_grad_weight_strided Authored with Claude.

Extends the Phase 1 DTensor strided-shard work by consolidating the _StridedShard-aware sharded check behind a shared is_shard_like() helper in shardings/dtensor_sharding_helpers.py, then uses it at the remaining is_shard() call sites flagged during the plan audit: - propagation_rules.py:177 - strategy-shape validity check - propagation_rules.py:552 - LayerNorm forward reduction-axis check - propagation_rules.py:626 - LayerNorm backward reduction-axis check - propagation_rules.py:702 - aten.pad trailing-dim shard removal - placement_options.py:560 - flex_attention Q/KV dim validity check Also migrates the Phase 1 inline fixes in apply_sharding.py and cost_models/compute_estimation.py to the helper for consistency. These call sites aren't on the Linear view->mm->view critical path, so the fixes are defense-in-depth rather than bugs observed in LLaMA3 benchmarks. However, user code that exercises LayerNorm/pad/flex_attention with _StridedShard-carrying inputs would have silently accepted invalid strategies without these fixes. All existing tests pass with _APPLY_VIEW_MM_VIEW_PATTERN both True and False (tests/test_optimize_placement.py, 11/11 in each configuration; plus the three new tests/test_propagation_rules.py::test_mm_strategy_*). Authored with Claude.

…y torch Under the nightly torch that CI installs (>=2.13.0.dev20260421), DTensor's GraphPipelineStage no longer exposes the underscore-prefixed _validate_fwd_outputs attribute, so mypy flags the call at graph_passes/graph_pp_runner.py:511 with [attr-defined]. The same error appears on remote/main, so this is a pre-existing CI break not caused by the rest of this stack — but it blocks the lint job on every PR until main picks up a fix. Add a narrow `# type: ignore[attr-defined]` to unblock this PR's CI. The real fix (either restoring the attribute upstream or switching to whatever replaced it) is separate work and should happen independently. Authored with Claude.

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 21, 2026

weifengpy marked this pull request as draft April 21, 2026 18:28

weifengpy force-pushed the dtensor-native-linear-phase1 branch 2 times, most recently from c4d8b71 to d35a9db Compare April 21, 2026 19:54

weifengpy added 2 commits April 21, 2026 13:03

weifengpy force-pushed the dtensor-native-linear-phase1 branch from 3beb495 to dfce245 Compare April 21, 2026 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENABLE_SINGLE_DIM_MM_FAMILY: opt-in DTensor strided sharding for mm-family ops#429

ENABLE_SINGLE_DIM_MM_FAMILY: opt-in DTensor strided sharding for mm-family ops#429
weifengpy wants to merge 3 commits into
meta-pytorch:mainfrom
weifengpy:dtensor-native-linear-phase1

weifengpy commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weifengpy commented Apr 21, 2026

Headline Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant