Add TorchTitan DSv3 equivalence coverage by sanketpurandare · Pull Request #444 · meta-pytorch/autoparallel

sanketpurandare · 2026-05-04T03:14:27Z

Stacked PRs:

Add TorchTitan DSv3 equivalence coverage

Add a four-rank TorchTitan DeepSeek V3 equivalence script that loads one shared full seed state into AutoParallel and TorchTitan distributed models, checks full state equality, and compares one forward/backward step.

Align the AutoParallel DSv3 helper with TorchTitan semantics needed by the equivalence check: routing order, attention backend priority, initialization order, and global expert-token accounting.

Run the equivalence script explicitly from the TorchTitan integration workflow. The script is intentionally not named test_*.py, so generic pytest jobs do not collect this torchrun-only distributed check.

stack-info: PR: #444, branch: sanketpurandare/stack/6

This replaces the old TorchTitan run_train-based smoke check with an explicit four-rank DeepSeek V3 equivalence script that compares AutoParallel's local_map DS3 debug shape against TorchTitan's parallelized DeepSeek V3 implementation. The script initializes NCCL with one rank per CUDA device, builds matching 4-layer debug configs, uses the same token batch on both paths, initializes weights with the same seed, and compares both distributed cross-entropy loss and global gradient norm after backward. The AutoParallel side constructs the local_map DS3 model from make_dsv3_config(), runs it through AutoParallel on a 2D dp/ep mesh with input and output sharded across both dimensions, applies bfloat16 parameter compute with float32 reductions, and initializes the sharded module on-device. The TorchTitan side builds an equivalent hierarchical config, applies data_parallel_shard_degree=4 and expert_parallel_degree=2 with loss parallel disabled, disables compile and activation checkpointing, and runs the same loss/gradient reduction logic so the comparison is about model and sharding numerics rather than trainer machinery. The new script is intentionally named outside pytest's test_*.py collection pattern and is invoked directly from the TorchTitan workflow with torchrun. After the numerics check, the workflow still runs TorchTitan's graph_trainer_autoparallel integration suite in a temporary output directory, so CI covers both the direct DS3 numerical comparison and the native/backend GraphTrainer integration tests. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6

Keep the four-rank DeepSeek V3 eager-equivalence script in CI so AutoParallel's local_map model definition is checked directly against TorchTitan's parallelized DeepSeek V3 implementation before placement and compiler behavior enter the picture. Also run the two TorchTitan GraphTrainer AutoParallel integration tests explicitly: Llama3 FSDP+TP and DeepSeek V3 EFSDP+EP. Running them by test name makes the intended coverage visible in the workflow. Finally run both GraphTrainer AutoParallel numerics tests from TorchTitan: Llama3 versus eager and DeepSeek V3 versus eager. The DeepSeek V3 numerics command disables NCCL NVLS to match the stable TorchTitan H100 numerics setup. Authored with Claude. stack-info: PR: #444, branch: sanketpurandare/stack/6

Add a four-rank TorchTitan DeepSeek V3 equivalence script that loads one shared full seed state into AutoParallel and TorchTitan distributed models, checks full state equality, and compares one forward/backward step. Align the AutoParallel DSv3 helper with TorchTitan semantics needed by the equivalence check: routing order, attention backend priority, initialization order, and global expert-token accounting. Run the equivalence script explicitly from the TorchTitan integration workflow. The script is intentionally not named test_*.py, so generic pytest jobs do not collect this torchrun-only distributed check. stack-info: PR: #444, branch: sanketpurandare/stack/6

sanketpurandare added a commit that referenced this pull request May 4, 2026

Update TorchTitan CI integration tests

dbdb5ec

stack-info: PR: #444, branch: sanketpurandare/stack/6

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 711e30c to dbdb5ec Compare May 4, 2026 03:14

sanketpurandare force-pushed the sanketpurandare/stack/5 branch from 33e84e1 to 9800603 Compare May 4, 2026 03:14

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 4, 2026

sanketpurandare force-pushed the sanketpurandare/stack/5 branch from 9800603 to 7e33971 Compare May 4, 2026 03:18

sanketpurandare added a commit that referenced this pull request May 4, 2026

Update TorchTitan CI integration tests

22c5479

stack-info: PR: #444, branch: sanketpurandare/stack/6

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from dbdb5ec to 22c5479 Compare May 4, 2026 03:18

sanketpurandare marked this pull request as draft May 4, 2026 03:21

sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 4, 2026 03:32

sanketpurandare added a commit that referenced this pull request May 4, 2026

Update TorchTitan CI integration tests

2409c89

stack-info: PR: #444, branch: sanketpurandare/stack/6

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 22c5479 to 2409c89 Compare May 4, 2026 03:32

sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 4, 2026 03:32

sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 4, 2026 04:02

sanketpurandare added a commit that referenced this pull request May 4, 2026

Update TorchTitan CI integration tests

9cf7a08

stack-info: PR: #444, branch: sanketpurandare/stack/6

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 2409c89 to 9cf7a08 Compare May 4, 2026 04:02

sanketpurandare added a commit that referenced this pull request May 4, 2026

Update TorchTitan CI integration tests

5e45e0a

stack-info: PR: #444, branch: sanketpurandare/stack/6

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 9cf7a08 to 5e45e0a Compare May 4, 2026 04:07

sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 4, 2026 04:08

aditvenk reviewed May 4, 2026

View reviewed changes

Comment thread .github/workflows/test_torchtitan.yml Outdated

aditvenk approved these changes May 4, 2026

View reviewed changes

sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 4, 2026 20:00

sanketpurandare added a commit that referenced this pull request May 4, 2026

Update TorchTitan CI integration tests

78b50ee

stack-info: PR: #444, branch: sanketpurandare/stack/6

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 5e45e0a to 78b50ee Compare May 4, 2026 20:00

sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 4, 2026 20:00

sanketpurandare force-pushed the sanketpurandare/stack/5 branch from 529bae8 to 1625996 Compare May 4, 2026 20:28

sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 8, 2026 00:34

sanketpurandare force-pushed the sanketpurandare/stack/5 branch from 48ac620 to 70241c6 Compare May 8, 2026 00:34

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from b9ec3e2 to 3b0baca Compare May 8, 2026 00:34

sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 8, 2026 00:35

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 3b0baca to 5eccc2d Compare May 8, 2026 00:35

sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 8, 2026 00:35

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 5eccc2d to a5edf95 Compare May 8, 2026 00:35

sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 8, 2026 00:35

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from a5edf95 to 7deb704 Compare May 8, 2026 00:36

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 7deb704 to cd533f1 Compare May 8, 2026 01:44

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from cd533f1 to 4a9ecaf Compare May 8, 2026 01:45

sanketpurandare mentioned this pull request May 8, 2026

Revert getitem sibling clustering (#445) and support kwargs inputs matching TorchTitan to fix CI #451

Merged

sanketpurandare changed the title ~~Add TorchTitan DeepSeek V3 equivalence coverage to CI~~ Add TorchTitan AutoParallel coverage to CI May 8, 2026

sanketpurandare changed the base branch from main to sanketpurandare/stack/7 May 8, 2026 01:45

sanketpurandare changed the base branch from sanketpurandare/stack/7 to main May 8, 2026 02:00

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 4a9ecaf to 27538dc Compare May 8, 2026 02:00

sanketpurandare changed the base branch from main to sanketpurandare/stack/7 May 8, 2026 02:00

sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 27538dc to 702a78a Compare May 8, 2026 03:04

sanketpurandare changed the base branch from sanketpurandare/stack/7 to main May 8, 2026 03:04

sanketpurandare mentioned this pull request May 8, 2026

Run TorchTitan GraphTrainer AutoParallel CI #452

Draft

github-advanced-security AI found potential problems May 11, 2026

View reviewed changes

Comment thread .github/workflows/test_torchtitan.yml Outdated

Comment on lines 16 to 53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TorchTitan DSv3 equivalence coverage#444

Add TorchTitan DSv3 equivalence coverage#444
sanketpurandare wants to merge 1 commit into
mainfrom
sanketpurandare/stack/6

sanketpurandare commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sanketpurandare commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!