Skip to content

Add TorchTitan DSv3 equivalence coverage#444

Open
sanketpurandare wants to merge 1 commit into
mainfrom
sanketpurandare/stack/6
Open

Add TorchTitan DSv3 equivalence coverage#444
sanketpurandare wants to merge 1 commit into
mainfrom
sanketpurandare/stack/6

Conversation

@sanketpurandare

@sanketpurandare sanketpurandare commented May 4, 2026

Copy link
Copy Markdown
Contributor

Stacked PRs:


Add TorchTitan DSv3 equivalence coverage

Add a four-rank TorchTitan DeepSeek V3 equivalence script that loads one shared full seed state into AutoParallel and TorchTitan distributed models, checks full state equality, and compares one forward/backward step.

Align the AutoParallel DSv3 helper with TorchTitan semantics needed by the equivalence check: routing order, attention backend priority, initialization order, and global expert-token accounting.

Run the equivalence script explicitly from the TorchTitan integration workflow. The script is intentionally not named test_*.py, so generic pytest jobs do not collect this torchrun-only distributed check.

sanketpurandare added a commit that referenced this pull request May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 711e30c to dbdb5ec Compare May 4, 2026 03:14
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/5 branch from 33e84e1 to 9800603 Compare May 4, 2026 03:14
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 4, 2026
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/5 branch from 9800603 to 7e33971 Compare May 4, 2026 03:18
sanketpurandare added a commit that referenced this pull request May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from dbdb5ec to 22c5479 Compare May 4, 2026 03:18
@sanketpurandare sanketpurandare marked this pull request as draft May 4, 2026 03:21
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 4, 2026 03:32
sanketpurandare added a commit that referenced this pull request May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 22c5479 to 2409c89 Compare May 4, 2026 03:32
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 4, 2026 03:32
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 4, 2026 04:02
sanketpurandare added a commit that referenced this pull request May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 2409c89 to 9cf7a08 Compare May 4, 2026 04:02
sanketpurandare added a commit that referenced this pull request May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 9cf7a08 to 5e45e0a Compare May 4, 2026 04:07
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 4, 2026 04:08
Comment thread .github/workflows/test_torchtitan.yml Outdated
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 4, 2026 20:00
sanketpurandare added a commit that referenced this pull request May 4, 2026
stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 5e45e0a to 78b50ee Compare May 4, 2026 20:00
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 4, 2026 20:00
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/5 branch from 529bae8 to 1625996 Compare May 4, 2026 20:28
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 8, 2026 00:34
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/5 branch from 48ac620 to 70241c6 Compare May 8, 2026 00:34
sanketpurandare added a commit that referenced this pull request May 8, 2026
This replaces the old TorchTitan run_train-based smoke check with an explicit four-rank DeepSeek V3 equivalence script that compares AutoParallel's local_map DS3 debug shape against TorchTitan's parallelized DeepSeek V3 implementation. The script initializes NCCL with one rank per CUDA device, builds matching 4-layer debug configs, uses the same token batch on both paths, initializes weights with the same seed, and compares both distributed cross-entropy loss and global gradient norm after backward.

The AutoParallel side constructs the local_map DS3 model from make_dsv3_config(), runs it through AutoParallel on a 2D dp/ep mesh with input and output sharded across both dimensions, applies bfloat16 parameter compute with float32 reductions, and initializes the sharded module on-device. The TorchTitan side builds an equivalent hierarchical config, applies data_parallel_shard_degree=4 and expert_parallel_degree=2 with loss parallel disabled, disables compile and activation checkpointing, and runs the same loss/gradient reduction logic so the comparison is about model and sharding numerics rather than trainer machinery.

The new script is intentionally named outside pytest's test_*.py collection pattern and is invoked directly from the TorchTitan workflow with torchrun. After the numerics check, the workflow still runs TorchTitan's graph_trainer_autoparallel integration suite in a temporary output directory, so CI covers both the direct DS3 numerical comparison and the native/backend GraphTrainer integration tests.

Authored with Claude.

stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from b9ec3e2 to 3b0baca Compare May 8, 2026 00:34
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 8, 2026 00:35
sanketpurandare added a commit that referenced this pull request May 8, 2026
This replaces the old TorchTitan run_train-based smoke check with an explicit four-rank DeepSeek V3 equivalence script that compares AutoParallel's local_map DS3 debug shape against TorchTitan's parallelized DeepSeek V3 implementation. The script initializes NCCL with one rank per CUDA device, builds matching 4-layer debug configs, uses the same token batch on both paths, initializes weights with the same seed, and compares both distributed cross-entropy loss and global gradient norm after backward.

The AutoParallel side constructs the local_map DS3 model from make_dsv3_config(), runs it through AutoParallel on a 2D dp/ep mesh with input and output sharded across both dimensions, applies bfloat16 parameter compute with float32 reductions, and initializes the sharded module on-device. The TorchTitan side builds an equivalent hierarchical config, applies data_parallel_shard_degree=4 and expert_parallel_degree=2 with loss parallel disabled, disables compile and activation checkpointing, and runs the same loss/gradient reduction logic so the comparison is about model and sharding numerics rather than trainer machinery.

The new script is intentionally named outside pytest's test_*.py collection pattern and is invoked directly from the TorchTitan workflow with torchrun. After the numerics check, the workflow still runs TorchTitan's graph_trainer_autoparallel integration suite in a temporary output directory, so CI covers both the direct DS3 numerical comparison and the native/backend GraphTrainer integration tests.

Authored with Claude.

stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 3b0baca to 5eccc2d Compare May 8, 2026 00:35
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/5 May 8, 2026 00:35
sanketpurandare added a commit that referenced this pull request May 8, 2026
This replaces the old TorchTitan run_train-based smoke check with an explicit four-rank DeepSeek V3 equivalence script that compares AutoParallel's local_map DS3 debug shape against TorchTitan's parallelized DeepSeek V3 implementation. The script initializes NCCL with one rank per CUDA device, builds matching 4-layer debug configs, uses the same token batch on both paths, initializes weights with the same seed, and compares both distributed cross-entropy loss and global gradient norm after backward.

The AutoParallel side constructs the local_map DS3 model from make_dsv3_config(), runs it through AutoParallel on a 2D dp/ep mesh with input and output sharded across both dimensions, applies bfloat16 parameter compute with float32 reductions, and initializes the sharded module on-device. The TorchTitan side builds an equivalent hierarchical config, applies data_parallel_shard_degree=4 and expert_parallel_degree=2 with loss parallel disabled, disables compile and activation checkpointing, and runs the same loss/gradient reduction logic so the comparison is about model and sharding numerics rather than trainer machinery.

The new script is intentionally named outside pytest's test_*.py collection pattern and is invoked directly from the TorchTitan workflow with torchrun. After the numerics check, the workflow still runs TorchTitan's graph_trainer_autoparallel integration suite in a temporary output directory, so CI covers both the direct DS3 numerical comparison and the native/backend GraphTrainer integration tests.

Authored with Claude.

stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 5eccc2d to a5edf95 Compare May 8, 2026 00:35
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/5 to main May 8, 2026 00:35
sanketpurandare added a commit that referenced this pull request May 8, 2026
This replaces the old TorchTitan run_train-based smoke check with an explicit four-rank DeepSeek V3 equivalence script that compares AutoParallel's local_map DS3 debug shape against TorchTitan's parallelized DeepSeek V3 implementation. The script initializes NCCL with one rank per CUDA device, builds matching 4-layer debug configs, uses the same token batch on both paths, initializes weights with the same seed, and compares both distributed cross-entropy loss and global gradient norm after backward.

The AutoParallel side constructs the local_map DS3 model from make_dsv3_config(), runs it through AutoParallel on a 2D dp/ep mesh with input and output sharded across both dimensions, applies bfloat16 parameter compute with float32 reductions, and initializes the sharded module on-device. The TorchTitan side builds an equivalent hierarchical config, applies data_parallel_shard_degree=4 and expert_parallel_degree=2 with loss parallel disabled, disables compile and activation checkpointing, and runs the same loss/gradient reduction logic so the comparison is about model and sharding numerics rather than trainer machinery.

The new script is intentionally named outside pytest's test_*.py collection pattern and is invoked directly from the TorchTitan workflow with torchrun. After the numerics check, the workflow still runs TorchTitan's graph_trainer_autoparallel integration suite in a temporary output directory, so CI covers both the direct DS3 numerical comparison and the native/backend GraphTrainer integration tests.

Authored with Claude.

stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from a5edf95 to 7deb704 Compare May 8, 2026 00:36
sanketpurandare added a commit that referenced this pull request May 8, 2026
Keep the four-rank DeepSeek V3 eager-equivalence script in CI so AutoParallel's local_map model definition is checked directly against TorchTitan's parallelized DeepSeek V3 implementation before placement and compiler behavior enter the picture.

Also run the two TorchTitan GraphTrainer AutoParallel integration tests explicitly: Llama3 FSDP+TP and DeepSeek V3 EFSDP+EP. Running them by test name makes the intended coverage visible in the workflow.

Finally run both GraphTrainer AutoParallel numerics tests from TorchTitan: Llama3 versus eager and DeepSeek V3 versus eager. The DeepSeek V3 numerics command disables NCCL NVLS to match the stable TorchTitan H100 numerics setup.

Authored with Claude.

stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 7deb704 to cd533f1 Compare May 8, 2026 01:44
sanketpurandare added a commit that referenced this pull request May 8, 2026
Keep the four-rank DeepSeek V3 eager-equivalence script in CI so AutoParallel's local_map model definition is checked directly against TorchTitan's parallelized DeepSeek V3 implementation before placement and compiler behavior enter the picture.

Also run the two TorchTitan GraphTrainer AutoParallel integration tests explicitly: Llama3 FSDP+TP and DeepSeek V3 EFSDP+EP. Running them by test name makes the intended coverage visible in the workflow.

Finally run both GraphTrainer AutoParallel numerics tests from TorchTitan: Llama3 versus eager and DeepSeek V3 versus eager. The DeepSeek V3 numerics command disables NCCL NVLS to match the stable TorchTitan H100 numerics setup.

Authored with Claude.

stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from cd533f1 to 4a9ecaf Compare May 8, 2026 01:45
@sanketpurandare sanketpurandare changed the title Add TorchTitan DeepSeek V3 equivalence coverage to CI Add TorchTitan AutoParallel coverage to CI May 8, 2026
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/7 May 8, 2026 01:45
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/7 to main May 8, 2026 02:00
sanketpurandare added a commit that referenced this pull request May 8, 2026
Keep the four-rank DeepSeek V3 eager-equivalence script in CI so AutoParallel's local_map model definition is checked directly against TorchTitan's parallelized DeepSeek V3 implementation before placement and compiler behavior enter the picture.

Also run the two TorchTitan GraphTrainer AutoParallel integration tests explicitly: Llama3 FSDP+TP and DeepSeek V3 EFSDP+EP. Running them by test name makes the intended coverage visible in the workflow.

Finally run both GraphTrainer AutoParallel numerics tests from TorchTitan: Llama3 versus eager and DeepSeek V3 versus eager. The DeepSeek V3 numerics command disables NCCL NVLS to match the stable TorchTitan H100 numerics setup.

Authored with Claude.

stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 4a9ecaf to 27538dc Compare May 8, 2026 02:00
@sanketpurandare sanketpurandare changed the base branch from main to sanketpurandare/stack/7 May 8, 2026 02:00
sanketpurandare added a commit that referenced this pull request May 8, 2026
Keep the four-rank DeepSeek V3 eager-equivalence script in CI so AutoParallel's local_map model definition is checked directly against TorchTitan's parallelized DeepSeek V3 implementation before placement and compiler behavior enter the picture.

Also run the two TorchTitan GraphTrainer AutoParallel integration tests explicitly: Llama3 FSDP+TP and DeepSeek V3 EFSDP+EP. Running them by test name makes the intended coverage visible in the workflow.

Finally run both GraphTrainer AutoParallel numerics tests from TorchTitan: Llama3 versus eager and DeepSeek V3 versus eager. The DeepSeek V3 numerics command disables NCCL NVLS to match the stable TorchTitan H100 numerics setup.

Authored with Claude.

stack-info: PR: #444, branch: sanketpurandare/stack/6
@sanketpurandare sanketpurandare force-pushed the sanketpurandare/stack/6 branch from 27538dc to 702a78a Compare May 8, 2026 03:04
@sanketpurandare sanketpurandare changed the base branch from sanketpurandare/stack/7 to main May 8, 2026 03:04
Add a four-rank TorchTitan DeepSeek V3 equivalence script that loads one shared full seed state into AutoParallel and TorchTitan distributed models, checks full state equality, and compares one forward/backward step.

Align the AutoParallel DSv3 helper with TorchTitan semantics needed by the equivalence check: routing order, attention backend priority, initialization order, and global expert-token accounting.

Run the equivalence script explicitly from the TorchTitan integration workflow. The script is intentionally not named test_*.py, so generic pytest jobs do not collect this torchrun-only distributed check.

stack-info: PR: #444, branch: sanketpurandare/stack/6
Comment thread .github/workflows/test_torchtitan.yml Outdated
Comment on lines 16 to 53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants