Skip to content

Use upstream OpStrategy stringification for clustering and re-enable DS3 local_map#440

Merged
sanketpurandare merged 1 commit into
mainfrom
sanketpurandare/stack/2
May 8, 2026
Merged

Use upstream OpStrategy stringification for clustering and re-enable DS3 local_map#440
sanketpurandare merged 1 commit into
mainfrom
sanketpurandare/stack/2

Conversation

@sanketpurandare

@sanketpurandare sanketpurandare commented May 1, 2026

Copy link
Copy Markdown
Contributor

Stacked PRs:


Use upstream OpStrategy stringification for clustering and re-enable DS3 local_map

This updates graph clustering to rely on PyTorch's fixed OpStrategy stringification again. The previous AutoParallel workaround printed only output placements for input strategies and handled DTensorSpec/None cases locally; now both the node strategy and strategy-bearing inputs use the full OpStrategy string, which keeps the clustering key closer to the original PyTorch graph-region tracker behavior and includes the input strategy detail needed to avoid false matches.

The CUDA workflow now runs examples/example_ds3_local_map.py under torchrun again. That example had been disabled while OpStrategy.str could crash on local_map/HOP tuple outputs and non-tensor getitem strategies containing None specs; with the upstream None-spec handling available, the local workaround can be dropped and the real DS3 local_map path can return to CI coverage.

The clustering test coverage was also tightened around the behavior that matters for repeated-subgraph replacement. The helpers now compute clusters once, track per-layer coverage from the raw clustered regions, assert large same-phase cross-layer clusters in forward and backward, and reject any cluster that mixes forward and backward partitioner phases. The high-coverage test now exercises both the LLaMA graph and the DS3 local_map graph so the formerly fragile local_map/getitem strategy path remains covered.

Authored with Claude.

@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/2 branch from 1f1a12c to 0ef0c2d Compare May 1, 2026 02:52
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/1 branch from 84d89de to bbedceb Compare May 1, 2026 02:52
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 1, 2026
@sanketpurandare sanketpurandare marked this pull request as ready for review May 1, 2026 02:55
@sanketpurandare sanketpurandare requested review from aditvenk and fmassa May 1, 2026 02:55
Comment thread autoparallel/graph_passes/graph_clustering.py Outdated
Comment on lines +80 to +81
# Avoid OpStrategy.__str__ which calls mesh_shape, crashing when
# strategies contain None specs (local_map, getitem nodes).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be fixed upstream?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this is breaking torchtitan CI in autop as well as blocking graph-pp integration torchtitan and autoparalell experiments in graph trainer. Once I make sure everything works properly, I can land an upstream fix in pytorch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanketpurandare sanketpurandare marked this pull request as draft May 4, 2026 03:14
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/1 to main May 4, 2026 03:14
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/1 May 4, 2026 03:14
@sanketpurandare sanketpurandare marked this pull request as ready for review May 4, 2026 03:14
sanketpurandare added a commit that referenced this pull request May 4, 2026
…_map (#436)

After #432 enabled repeated_subgraphs by default, get_identical_regions
hashes every node's OpStrategy to build cluster keys. Two kinds of nodes
carry None specs — call_local_map (opaque to sharding propagation) and
getitem on non-tensor outputs from multi-return ops like SDPA. Hashing
called str(op_strategy) which invokes OpStrategy.__str__() ->
mesh_shape -> strategies[0].mesh, and that dereferences None.

Fix:
- _prepare_op_strategy: for OpStrategy, join str(s) on each OpSpec
  instead of calling str(op_strategy). OpSpec.__str__ delegates to
  _pretty_print_spec which already handles None. The per-graph-constant
  mesh shape is omitted from the hash key, which doesn't affect
  clustering correctness.
- _print_output_specs: guard for output_specs being None (not just
  individual elements), which causes TypeError on iteration.

Tested with torchrun --nproc-per-node 4 examples/example_ds3_local_map.py.
Clustering finds 16 groups covering 1106/1396 nodes (79%), with the
largest forward group identifying the 4 identical MoE layers as one
cluster (4 regions x 117 nodes).

Validated: pytest tests/ passes (327 tests, 1 xfail), including the
new test_prepare_op_strategy_none_specs regression test.

stack-info: PR: #440, branch: sanketpurandare/stack/2
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/2 branch from 0ef0c2d to a8d2d18 Compare May 4, 2026 03:18
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/1 to main May 4, 2026 03:18
@sanketpurandare sanketpurandare marked this pull request as draft May 4, 2026 03:32
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/2 branch from a8d2d18 to 584c14c Compare May 4, 2026 03:32
@sanketpurandare sanketpurandare marked this pull request as ready for review May 4, 2026 03:32
@sanketpurandare sanketpurandare marked this pull request as draft May 4, 2026 04:02
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/2 branch 2 times, most recently from 2c58669 to 2176ae2 Compare May 4, 2026 04:07
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/4 May 4, 2026 04:07
@sanketpurandare sanketpurandare marked this pull request as ready for review May 4, 2026 04:08
@sanketpurandare sanketpurandare marked this pull request as draft May 4, 2026 20:00
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/4 to main May 4, 2026 20:00
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/2 branch from 2176ae2 to 215f0c2 Compare May 4, 2026 20:00
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/4 May 4, 2026 20:00
@sanketpurandare sanketpurandare marked this pull request as ready for review May 4, 2026 20:00
sanketpurandare added a commit that referenced this pull request May 4, 2026
…_map (#436)

After #432 enabled repeated_subgraphs by default, get_identical_regions
hashes every node's OpStrategy to build cluster keys. Two kinds of nodes
carry None specs — call_local_map (opaque to sharding propagation) and
getitem on non-tensor outputs from multi-return ops like SDPA. Hashing
called str(op_strategy) which invokes OpStrategy.__str__() ->
mesh_shape -> strategies[0].mesh, and that dereferences None.

Fix:
- _prepare_op_strategy: for OpStrategy, join str(s) on each OpSpec
  instead of calling str(op_strategy). OpSpec.__str__ delegates to
  _pretty_print_spec which already handles None. The per-graph-constant
  mesh shape is omitted from the hash key, which doesn't affect
  clustering correctness.
- _print_output_specs: guard for output_specs being None (not just
  individual elements), which causes TypeError on iteration.

Tested with torchrun --nproc-per-node 4 examples/example_ds3_local_map.py.
Clustering finds 16 groups covering 1106/1396 nodes (79%), with the
largest forward group identifying the 4 identical MoE layers as one
cluster (4 regions x 117 nodes).

Validated: pytest tests/ passes (327 tests, 1 xfail), including the
new test_prepare_op_strategy_none_specs regression test.

stack-info: PR: #440, branch: sanketpurandare/stack/2
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/2 branch from 215f0c2 to 9308ab5 Compare May 4, 2026 20:28
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/4 to main May 4, 2026 20:28
@sanketpurandare sanketpurandare marked this pull request as draft May 4, 2026 23:50
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/2 branch from 9308ab5 to 78124cd Compare May 4, 2026 23:50
@sanketpurandare sanketpurandare marked this pull request as ready for review May 4, 2026 23:50
@sanketpurandare sanketpurandare marked this pull request as draft May 8, 2026 00:23
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/2 branch from 78124cd to 05e6055 Compare May 8, 2026 00:23
@sanketpurandare sanketpurandare changed the title Fix graph clustering crash on None specs, re-enable example_ds3_local_map (#436) Use upstream OpStrategy stringification for clustering and re-enable DS3 local_map May 8, 2026
@sanketpurandare sanketpurandare marked this pull request as ready for review May 8, 2026 00:23
…DS3 local_map

This updates graph clustering to rely on PyTorch's fixed OpStrategy stringification again. The previous AutoParallel workaround printed only output placements for input strategies and handled DTensorSpec/None cases locally; now both the node strategy and strategy-bearing inputs use the full OpStrategy string, which keeps the clustering key closer to the original PyTorch graph-region tracker behavior and includes the input strategy detail needed to avoid false matches.

The CUDA workflow now runs examples/example_ds3_local_map.py under torchrun again. That example had been disabled while OpStrategy.__str__ could crash on local_map/HOP tuple outputs and non-tensor getitem strategies containing None specs; with the upstream None-spec handling available, the local workaround can be dropped and the real DS3 local_map path can return to CI coverage.

The clustering test coverage was also tightened around the behavior that matters for repeated-subgraph replacement. The helpers now compute clusters once, track per-layer coverage from the raw clustered regions, assert large same-phase cross-layer clusters in forward and backward, and reject any cluster that mixes forward and backward partitioner phases. The high-coverage test now exercises both the LLaMA graph and the DS3 local_map graph so the formerly fragile local_map/getitem strategy path remains covered.

Authored with Claude.

stack-info: PR: #440, branch: sanketpurandare/stack/2
@sanketpurandare sanketpurandare marked this pull request as draft May 8, 2026 00:30
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/2 branch from 05e6055 to ca7ecbf Compare May 8, 2026 00:30
@sanketpurandare sanketpurandare marked this pull request as ready for review May 8, 2026 00:30
@sanketpurandare sanketpurandare merged commit af1a402 into main May 8, 2026
7 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-enable example_ds3_local_map in CI after PyTorch OpStrategy.str fix

4 participants