Move FSDP recompute tagging to the placement compile path#443
Merged
Conversation
cbfb975 to
6791d71
Compare
33e84e1 to
9800603
Compare
This was referenced May 4, 2026
6791d71 to
18dd7f7
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
stack-info: PR: #443, branch: sanketpurandare/stack/5
9800603 to
7e33971
Compare
7e33971 to
2eed64d
Compare
6e33a9d to
7bddf7d
Compare
aditvenk
approved these changes
May 4, 2026
7bddf7d to
529bae8
Compare
d1ce828 to
6bb34f3
Compare
1798240 to
1d58a94
Compare
1d58a94 to
9f3c42f
Compare
539c35b to
f6f6c35
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
This moves mark_fsdp_all_gather_recomputation out of _apply_placement_common and into apply_placement, after the sharded graph has been cleaned up, traced, converted from view to reshape, functionalized for fresh index_put_ mutations, written back to joint descriptors, and prepared for AOT compilation. The common placement helper now only builds and normalizes the parallel graph, while the training compile path applies the FSDP all-gather recomputation tags immediately before invoking aot_compile_joint_with_descriptors. Keeping the tag insertion at the apply_placement boundary makes the graph mutation order explicit: graph rewrites that affect structure happen first, descriptor state is refreshed, wait_tensor DCE behavior is installed, and then recompute metadata is added to the graph that the joint compiler consumes. This avoids mixing placement graph construction with compile-time recompute metadata and keeps the common helper usable for future placement flows that should not eagerly stamp FSDP recompute tags. The compile backend behavior is otherwise unchanged, but the Inductor overlap-scheduling patch set is now centralized in _INDUCTOR_OVERLAP_PATCHES and selected directly when overlap_scheduling is enabled. That keeps autoparallel_backend focused on installing optional functorch AC and Inductor overlap config patches around compile_fx without rebuilding the same overlap dictionary on each backend construction. Authored with Claude. stack-info: PR: #443, branch: sanketpurandare/stack/5
9f3c42f to
e944be4
Compare
e944be4 to
48ac620
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
This moves mark_fsdp_all_gather_recomputation out of _apply_placement_common and into apply_placement, after the sharded graph has been cleaned up, traced, converted from view to reshape, functionalized for fresh index_put_ mutations, written back to joint descriptors, and prepared for AOT compilation. The common placement helper now only builds and normalizes the parallel graph, while the training compile path applies the FSDP all-gather recomputation tags immediately before invoking aot_compile_joint_with_descriptors. Keeping the tag insertion at the apply_placement boundary makes the graph mutation order explicit: graph rewrites that affect structure happen first, descriptor state is refreshed, wait_tensor DCE behavior is installed, and then recompute metadata is added to the graph that the joint compiler consumes. This avoids mixing placement graph construction with compile-time recompute metadata and keeps the common helper usable for future placement flows that should not eagerly stamp FSDP recompute tags. The compile backend behavior is otherwise unchanged, but the Inductor overlap-scheduling patch set is now centralized in _INDUCTOR_OVERLAP_PATCHES and selected directly when overlap_scheduling is enabled. That keeps autoparallel_backend focused on installing optional functorch AC and Inductor overlap config patches around compile_fx without rebuilding the same overlap dictionary on each backend construction. Authored with Claude. stack-info: PR: #443, branch: sanketpurandare/stack/5
48ac620 to
70241c6
Compare
This moves mark_fsdp_all_gather_recomputation out of _apply_placement_common and into apply_placement, after the sharded graph has been cleaned up, traced, converted from view to reshape, functionalized for fresh index_put_ mutations, written back to joint descriptors, and prepared for AOT compilation. The common placement helper now only builds and normalizes the parallel graph, while the training compile path applies the FSDP all-gather recomputation tags immediately before invoking aot_compile_joint_with_descriptors. Keeping the tag insertion at the apply_placement boundary makes the graph mutation order explicit: graph rewrites that affect structure happen first, descriptor state is refreshed, wait_tensor DCE behavior is installed, and then recompute metadata is added to the graph that the joint compiler consumes. This avoids mixing placement graph construction with compile-time recompute metadata and keeps the common helper usable for future placement flows that should not eagerly stamp FSDP recompute tags. The compile backend behavior is otherwise unchanged, but the Inductor overlap-scheduling patch set is now centralized in _INDUCTOR_OVERLAP_PATCHES and selected directly when overlap_scheduling is enabled. That keeps autoparallel_backend focused on installing optional functorch AC and Inductor overlap config patches around compile_fx without rebuilding the same overlap dictionary on each backend construction. Authored with Claude. stack-info: PR: #443, branch: sanketpurandare/stack/5
70241c6 to
b24cf46
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
Move FSDP recompute tagging to the placement compile path
This moves mark_fsdp_all_gather_recomputation out of apply_placement_common and into apply_placement, after the sharded graph has been cleaned up, traced, converted from view to reshape, functionalized for fresh index_put mutations, written back to joint descriptors, and prepared for AOT compilation. The common placement helper now only builds and normalizes the parallel graph, while the training compile path applies the FSDP all-gather recomputation tags immediately before invoking aot_compile_joint_with_descriptors.
Keeping the tag insertion at the apply_placement boundary makes the graph mutation order explicit: graph rewrites that affect structure happen first, descriptor state is refreshed, wait_tensor DCE behavior is installed, and then recompute metadata is added to the graph that the joint compiler consumes. This avoids mixing placement graph construction with compile-time recompute metadata and keeps the common helper usable for future placement flows that should not eagerly stamp FSDP recompute tags.
The compile backend behavior is otherwise unchanged, but the Inductor overlap-scheduling patch set is now centralized in _INDUCTOR_OVERLAP_PATCHES and selected directly when overlap_scheduling is enabled. That keeps autoparallel_backend focused on installing optional functorch AC and Inductor overlap config patches around compile_fx without rebuilding the same overlap dictionary on each backend construction.
Authored with Claude.