Use upstream OpStrategy stringification for clustering and re-enable DS3 local_map#440
Merged
Merged
Conversation
1f1a12c to
0ef0c2d
Compare
84d89de to
bbedceb
Compare
This was referenced May 1, 2026
xmfan
approved these changes
May 1, 2026
aditvenk
reviewed
May 1, 2026
aditvenk
approved these changes
May 1, 2026
fmassa
reviewed
May 1, 2026
Comment on lines
+80
to
+81
| # Avoid OpStrategy.__str__ which calls mesh_shape, crashing when | ||
| # strategies contain None specs (local_map, getitem nodes). |
Contributor
There was a problem hiding this comment.
Shouldn't this be fixed upstream?
Contributor
Author
There was a problem hiding this comment.
Yes, but this is breaking torchtitan CI in autop as well as blocking graph-pp integration torchtitan and autoparalell experiments in graph trainer. Once I make sure everything works properly, I can land an upstream fix in pytorch.
Contributor
Author
There was a problem hiding this comment.
@fmassa Here is the tracking infor for the upstream Fixes:
- Original Issue caught in autoparallel: Re-enable example_ds3_local_map in CI after PyTorch OpStrategy.str fix #436
- @aditvenk's upstream fix that partially solves the issue: [DTensor] Fix OpSpec.mesh crash when specs contain None entries pytorch/pytorch#181541
- Remaining issue filed upstream: DTensor OpStrategy.__str__ still crashes for all-None non-tensor strategies pytorch/pytorch#182370
- PR that fixes it: [DTensor] Make DTensor OpStrategy stringification handle missing mesh pytorch/pytorch#182371
This was referenced May 4, 2026
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
…_map (#436) After #432 enabled repeated_subgraphs by default, get_identical_regions hashes every node's OpStrategy to build cluster keys. Two kinds of nodes carry None specs — call_local_map (opaque to sharding propagation) and getitem on non-tensor outputs from multi-return ops like SDPA. Hashing called str(op_strategy) which invokes OpStrategy.__str__() -> mesh_shape -> strategies[0].mesh, and that dereferences None. Fix: - _prepare_op_strategy: for OpStrategy, join str(s) on each OpSpec instead of calling str(op_strategy). OpSpec.__str__ delegates to _pretty_print_spec which already handles None. The per-graph-constant mesh shape is omitted from the hash key, which doesn't affect clustering correctness. - _print_output_specs: guard for output_specs being None (not just individual elements), which causes TypeError on iteration. Tested with torchrun --nproc-per-node 4 examples/example_ds3_local_map.py. Clustering finds 16 groups covering 1106/1396 nodes (79%), with the largest forward group identifying the 4 identical MoE layers as one cluster (4 regions x 117 nodes). Validated: pytest tests/ passes (327 tests, 1 xfail), including the new test_prepare_op_strategy_none_specs regression test. stack-info: PR: #440, branch: sanketpurandare/stack/2
0ef0c2d to
a8d2d18
Compare
a8d2d18 to
584c14c
Compare
2c58669 to
2176ae2
Compare
xmfan
approved these changes
May 4, 2026
2176ae2 to
215f0c2
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
…_map (#436) After #432 enabled repeated_subgraphs by default, get_identical_regions hashes every node's OpStrategy to build cluster keys. Two kinds of nodes carry None specs — call_local_map (opaque to sharding propagation) and getitem on non-tensor outputs from multi-return ops like SDPA. Hashing called str(op_strategy) which invokes OpStrategy.__str__() -> mesh_shape -> strategies[0].mesh, and that dereferences None. Fix: - _prepare_op_strategy: for OpStrategy, join str(s) on each OpSpec instead of calling str(op_strategy). OpSpec.__str__ delegates to _pretty_print_spec which already handles None. The per-graph-constant mesh shape is omitted from the hash key, which doesn't affect clustering correctness. - _print_output_specs: guard for output_specs being None (not just individual elements), which causes TypeError on iteration. Tested with torchrun --nproc-per-node 4 examples/example_ds3_local_map.py. Clustering finds 16 groups covering 1106/1396 nodes (79%), with the largest forward group identifying the 4 identical MoE layers as one cluster (4 regions x 117 nodes). Validated: pytest tests/ passes (327 tests, 1 xfail), including the new test_prepare_op_strategy_none_specs regression test. stack-info: PR: #440, branch: sanketpurandare/stack/2
215f0c2 to
9308ab5
Compare
9308ab5 to
78124cd
Compare
78124cd to
05e6055
Compare
…DS3 local_map This updates graph clustering to rely on PyTorch's fixed OpStrategy stringification again. The previous AutoParallel workaround printed only output placements for input strategies and handled DTensorSpec/None cases locally; now both the node strategy and strategy-bearing inputs use the full OpStrategy string, which keeps the clustering key closer to the original PyTorch graph-region tracker behavior and includes the input strategy detail needed to avoid false matches. The CUDA workflow now runs examples/example_ds3_local_map.py under torchrun again. That example had been disabled while OpStrategy.__str__ could crash on local_map/HOP tuple outputs and non-tensor getitem strategies containing None specs; with the upstream None-spec handling available, the local workaround can be dropped and the real DS3 local_map path can return to CI coverage. The clustering test coverage was also tightened around the behavior that matters for repeated-subgraph replacement. The helpers now compute clusters once, track per-layer coverage from the raw clustered regions, assert large same-phase cross-layer clusters in forward and backward, and reject any cluster that mixes forward and backward partitioner phases. The high-coverage test now exercises both the LLaMA graph and the DS3 local_map graph so the formerly fragile local_map/getitem strategy path remains covered. Authored with Claude. stack-info: PR: #440, branch: sanketpurandare/stack/2
05e6055 to
ca7ecbf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
Use upstream OpStrategy stringification for clustering and re-enable DS3 local_map
This updates graph clustering to rely on PyTorch's fixed OpStrategy stringification again. The previous AutoParallel workaround printed only output placements for input strategies and handled DTensorSpec/None cases locally; now both the node strategy and strategy-bearing inputs use the full OpStrategy string, which keeps the clustering key closer to the original PyTorch graph-region tracker behavior and includes the input strategy detail needed to avoid false matches.
The CUDA workflow now runs examples/example_ds3_local_map.py under torchrun again. That example had been disabled while OpStrategy.str could crash on local_map/HOP tuple outputs and non-tensor getitem strategies containing None specs; with the upstream None-spec handling available, the local workaround can be dropped and the real DS3 local_map path can return to CI coverage.
The clustering test coverage was also tightened around the behavior that matters for repeated-subgraph replacement. The helpers now compute clusters once, track per-layer coverage from the raw clustered regions, assert large same-phase cross-layer clusters in forward and backward, and reject any cluster that mixes forward and backward partitioner phases. The high-coverage test now exercises both the LLaMA graph and the DS3 local_map graph so the formerly fragile local_map/getitem strategy path remains covered.
Authored with Claude.