Add TorchTitan DSv3 equivalence coverage#444
Open
sanketpurandare wants to merge 1 commit into
Open
Conversation
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
711e30c to
dbdb5ec
Compare
33e84e1 to
9800603
Compare
This was referenced May 4, 2026
9800603 to
7e33971
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
dbdb5ec to
22c5479
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
22c5479 to
2409c89
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
2409c89 to
9cf7a08
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
9cf7a08 to
5e45e0a
Compare
aditvenk
reviewed
May 4, 2026
aditvenk
approved these changes
May 4, 2026
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
5e45e0a to
78b50ee
Compare
529bae8 to
1625996
Compare
48ac620 to
70241c6
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
This replaces the old TorchTitan run_train-based smoke check with an explicit four-rank DeepSeek V3 equivalence script that compares AutoParallel's local_map DS3 debug shape against TorchTitan's parallelized DeepSeek V3 implementation. The script initializes NCCL with one rank per CUDA device, builds matching 4-layer debug configs, uses the same token batch on both paths, initializes weights with the same seed, and compares both distributed cross-entropy loss and global gradient norm after backward. The AutoParallel side constructs the local_map DS3 model from make_dsv3_config(), runs it through AutoParallel on a 2D dp/ep mesh with input and output sharded across both dimensions, applies bfloat16 parameter compute with float32 reductions, and initializes the sharded module on-device. The TorchTitan side builds an equivalent hierarchical config, applies data_parallel_shard_degree=4 and expert_parallel_degree=2 with loss parallel disabled, disables compile and activation checkpointing, and runs the same loss/gradient reduction logic so the comparison is about model and sharding numerics rather than trainer machinery. The new script is intentionally named outside pytest's test_*.py collection pattern and is invoked directly from the TorchTitan workflow with torchrun. After the numerics check, the workflow still runs TorchTitan's graph_trainer_autoparallel integration suite in a temporary output directory, so CI covers both the direct DS3 numerical comparison and the native/backend GraphTrainer integration tests. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6
b9ec3e2 to
3b0baca
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
This replaces the old TorchTitan run_train-based smoke check with an explicit four-rank DeepSeek V3 equivalence script that compares AutoParallel's local_map DS3 debug shape against TorchTitan's parallelized DeepSeek V3 implementation. The script initializes NCCL with one rank per CUDA device, builds matching 4-layer debug configs, uses the same token batch on both paths, initializes weights with the same seed, and compares both distributed cross-entropy loss and global gradient norm after backward. The AutoParallel side constructs the local_map DS3 model from make_dsv3_config(), runs it through AutoParallel on a 2D dp/ep mesh with input and output sharded across both dimensions, applies bfloat16 parameter compute with float32 reductions, and initializes the sharded module on-device. The TorchTitan side builds an equivalent hierarchical config, applies data_parallel_shard_degree=4 and expert_parallel_degree=2 with loss parallel disabled, disables compile and activation checkpointing, and runs the same loss/gradient reduction logic so the comparison is about model and sharding numerics rather than trainer machinery. The new script is intentionally named outside pytest's test_*.py collection pattern and is invoked directly from the TorchTitan workflow with torchrun. After the numerics check, the workflow still runs TorchTitan's graph_trainer_autoparallel integration suite in a temporary output directory, so CI covers both the direct DS3 numerical comparison and the native/backend GraphTrainer integration tests. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6
3b0baca to
5eccc2d
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
This replaces the old TorchTitan run_train-based smoke check with an explicit four-rank DeepSeek V3 equivalence script that compares AutoParallel's local_map DS3 debug shape against TorchTitan's parallelized DeepSeek V3 implementation. The script initializes NCCL with one rank per CUDA device, builds matching 4-layer debug configs, uses the same token batch on both paths, initializes weights with the same seed, and compares both distributed cross-entropy loss and global gradient norm after backward. The AutoParallel side constructs the local_map DS3 model from make_dsv3_config(), runs it through AutoParallel on a 2D dp/ep mesh with input and output sharded across both dimensions, applies bfloat16 parameter compute with float32 reductions, and initializes the sharded module on-device. The TorchTitan side builds an equivalent hierarchical config, applies data_parallel_shard_degree=4 and expert_parallel_degree=2 with loss parallel disabled, disables compile and activation checkpointing, and runs the same loss/gradient reduction logic so the comparison is about model and sharding numerics rather than trainer machinery. The new script is intentionally named outside pytest's test_*.py collection pattern and is invoked directly from the TorchTitan workflow with torchrun. After the numerics check, the workflow still runs TorchTitan's graph_trainer_autoparallel integration suite in a temporary output directory, so CI covers both the direct DS3 numerical comparison and the native/backend GraphTrainer integration tests. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6
5eccc2d to
a5edf95
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
This replaces the old TorchTitan run_train-based smoke check with an explicit four-rank DeepSeek V3 equivalence script that compares AutoParallel's local_map DS3 debug shape against TorchTitan's parallelized DeepSeek V3 implementation. The script initializes NCCL with one rank per CUDA device, builds matching 4-layer debug configs, uses the same token batch on both paths, initializes weights with the same seed, and compares both distributed cross-entropy loss and global gradient norm after backward. The AutoParallel side constructs the local_map DS3 model from make_dsv3_config(), runs it through AutoParallel on a 2D dp/ep mesh with input and output sharded across both dimensions, applies bfloat16 parameter compute with float32 reductions, and initializes the sharded module on-device. The TorchTitan side builds an equivalent hierarchical config, applies data_parallel_shard_degree=4 and expert_parallel_degree=2 with loss parallel disabled, disables compile and activation checkpointing, and runs the same loss/gradient reduction logic so the comparison is about model and sharding numerics rather than trainer machinery. The new script is intentionally named outside pytest's test_*.py collection pattern and is invoked directly from the TorchTitan workflow with torchrun. After the numerics check, the workflow still runs TorchTitan's graph_trainer_autoparallel integration suite in a temporary output directory, so CI covers both the direct DS3 numerical comparison and the native/backend GraphTrainer integration tests. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6
a5edf95 to
7deb704
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
Keep the four-rank DeepSeek V3 eager-equivalence script in CI so AutoParallel's local_map model definition is checked directly against TorchTitan's parallelized DeepSeek V3 implementation before placement and compiler behavior enter the picture. Also run the two TorchTitan GraphTrainer AutoParallel integration tests explicitly: Llama3 FSDP+TP and DeepSeek V3 EFSDP+EP. Running them by test name makes the intended coverage visible in the workflow. Finally run both GraphTrainer AutoParallel numerics tests from TorchTitan: Llama3 versus eager and DeepSeek V3 versus eager. The DeepSeek V3 numerics command disables NCCL NVLS to match the stable TorchTitan H100 numerics setup. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6
7deb704 to
cd533f1
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
Keep the four-rank DeepSeek V3 eager-equivalence script in CI so AutoParallel's local_map model definition is checked directly against TorchTitan's parallelized DeepSeek V3 implementation before placement and compiler behavior enter the picture. Also run the two TorchTitan GraphTrainer AutoParallel integration tests explicitly: Llama3 FSDP+TP and DeepSeek V3 EFSDP+EP. Running them by test name makes the intended coverage visible in the workflow. Finally run both GraphTrainer AutoParallel numerics tests from TorchTitan: Llama3 versus eager and DeepSeek V3 versus eager. The DeepSeek V3 numerics command disables NCCL NVLS to match the stable TorchTitan H100 numerics setup. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6
cd533f1 to
4a9ecaf
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
Keep the four-rank DeepSeek V3 eager-equivalence script in CI so AutoParallel's local_map model definition is checked directly against TorchTitan's parallelized DeepSeek V3 implementation before placement and compiler behavior enter the picture. Also run the two TorchTitan GraphTrainer AutoParallel integration tests explicitly: Llama3 FSDP+TP and DeepSeek V3 EFSDP+EP. Running them by test name makes the intended coverage visible in the workflow. Finally run both GraphTrainer AutoParallel numerics tests from TorchTitan: Llama3 versus eager and DeepSeek V3 versus eager. The DeepSeek V3 numerics command disables NCCL NVLS to match the stable TorchTitan H100 numerics setup. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6
4a9ecaf to
27538dc
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
Keep the four-rank DeepSeek V3 eager-equivalence script in CI so AutoParallel's local_map model definition is checked directly against TorchTitan's parallelized DeepSeek V3 implementation before placement and compiler behavior enter the picture. Also run the two TorchTitan GraphTrainer AutoParallel integration tests explicitly: Llama3 FSDP+TP and DeepSeek V3 EFSDP+EP. Running them by test name makes the intended coverage visible in the workflow. Finally run both GraphTrainer AutoParallel numerics tests from TorchTitan: Llama3 versus eager and DeepSeek V3 versus eager. The DeepSeek V3 numerics command disables NCCL NVLS to match the stable TorchTitan H100 numerics setup. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6
27538dc to
702a78a
Compare
Add a four-rank TorchTitan DeepSeek V3 equivalence script that loads one shared full seed state into AutoParallel and TorchTitan distributed models, checks full state equality, and compares one forward/backward step. Align the AutoParallel DSv3 helper with TorchTitan semantics needed by the equivalence check: routing order, attention backend priority, initialization order, and global expert-token accounting. Run the equivalence script explicitly from the TorchTitan integration workflow. The script is intentionally not named test_*.py, so generic pytest jobs do not collect this torchrun-only distributed check. stack-info: PR: #444, branch: sanketpurandare/stack/6
Comment on lines
16
to
53
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
Add TorchTitan DSv3 equivalence coverage
Add a four-rank TorchTitan DeepSeek V3 equivalence script that loads one shared full seed state into AutoParallel and TorchTitan distributed models, checks full state equality, and compares one forward/backward step.
Align the AutoParallel DSv3 helper with TorchTitan semantics needed by the equivalence check: routing order, attention backend priority, initialization order, and global expert-token accounting.
Run the equivalence script explicitly from the TorchTitan integration workflow. The script is intentionally not named test_*.py, so generic pytest jobs do not collect this torchrun-only distributed check.