Skip to content

Revert getitem sibling clustering (#445) and support kwargs inputs matching TorchTitan to fix CI#451

Merged
sanketpurandare merged 1 commit into
mainfrom
sanketpurandare/stack/7
May 8, 2026
Merged

Revert getitem sibling clustering (#445) and support kwargs inputs matching TorchTitan to fix CI#451
sanketpurandare merged 1 commit into
mainfrom
sanketpurandare/stack/7

Conversation

@sanketpurandare

@sanketpurandare sanketpurandare commented May 8, 2026

Copy link
Copy Markdown
Contributor

Stacked PRs:


Revert getitem sibling clustering and support kwargs inputs

Revert the graph clustering behavior introduced by 51f2f67 because the bridge-group recovery can reuse an already cluster-linked node as the root of another cluster group. The sharding optimizer treats cluster-linked nodes as not owning PuLP variables, so later flow constraints can resolve a linked key through cluster_links to a root key that was never materialized in pulp_variables. DeepSeek V3 placement then fails before solving with KeyError on the resolved cluster key.

Keep the later DeepSeek V3 clustering coverage, but remove the getitem-specific expectations that depended on the reverted bridge-group behavior. Make both the DS3 local_map example and DS3 clustering coverage use the TorchTitan debug shape, so a future reintroduction of getitem sibling recovery has to handle the graph that exposed the bug.

Also allow AutoParallel generated forward wrappers to accept kwargs by flattening (args, kwargs) when positional flattening does not match the traced input arity. TorchTitan GraphTrainer can pass model inputs by keyword, so the generated wrapper needs to preserve that call shape while keeping the positional-only path unchanged.

@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/7 branch from 115bb07 to ed31dca Compare May 8, 2026 01:45
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 8, 2026
Revert the graph clustering behavior introduced by 51f2f67 because the bridge-group recovery can reuse an already cluster-linked node as the root of another cluster group. The sharding optimizer treats cluster-linked nodes as not owning PuLP variables, so later flow constraints can resolve a linked key through cluster_links to a root key that was never materialized in pulp_variables. DeepSeek V3 placement then fails before solving with KeyError on the resolved cluster key.

Keep the later DeepSeek V3 clustering coverage, but remove the getitem-specific expectations that depended on the reverted bridge-group behavior. Make both the DS3 local_map example and DS3 clustering coverage use the TorchTitan debug shape, so a future reintroduction of getitem sibling recovery has to handle the graph that exposed the bug.

Also allow AutoParallel generated forward wrappers to accept kwargs by flattening (args, kwargs) when positional flattening does not match the traced input arity. TorchTitan GraphTrainer can pass model inputs by keyword, so the generated wrapper needs to preserve that call shape while keeping the positional-only path unchanged.

stack-info: PR: #451, branch: sanketpurandare/stack/7
@sanketpurandare sanketpurandare marked this pull request as draft May 8, 2026 02:00
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/7 branch from ed31dca to 68e6513 Compare May 8, 2026 02:00
@sanketpurandare sanketpurandare marked this pull request as ready for review May 8, 2026 02:00
@sanketpurandare sanketpurandare requested review from aditvenk and xmfan May 8, 2026 02:01
@sanketpurandare sanketpurandare changed the title Revert getitem sibling clustering and support kwargs inputs Revert getitem sibling clustering (#445) and support kwargs inputs matching TorchTitan May 8, 2026
@sanketpurandare sanketpurandare changed the title Revert getitem sibling clustering (#445) and support kwargs inputs matching TorchTitan Revert getitem sibling clustering (#445) and support kwargs inputs matching TorchTitan to fix CI May 8, 2026
@sanketpurandare sanketpurandare requested a review from fmassa May 8, 2026 02:03
@sanketpurandare sanketpurandare merged commit c7a206d into main May 8, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants