Align DeepSeek V3 test config with TorchTitan shape#441
Merged
Conversation
sanketpurandare
added a commit
that referenced
this pull request
May 1, 2026
The flat DeepSeekV3ModelArgs and MoEArgs dataclasses are replaced by a tree of small dataclasses (DeepSeekV3Config -> LayerConfig -> AttentionConfig / MoEConfig / ...) whose attribute paths match torchtitan's DeepSeekV3Model.Config. Because the model reads config attributes via duck typing (no torchtitan import), either autoparallel's own DeepSeekV3Config or torchtitan's Config can be passed in. Concrete changes in dsv3.py: - Deleted DeepSeekV3ModelArgs and MoEArgs. - Added config dataclasses: DeepSeekV3Config, LayerConfig, AttentionConfig, MoEConfig, RoPEConfig, NormConfig, etc. - Added make_dsv3_config() factory that builds the config tree from scalar hyperparameters (same role as torchtitan's _debugmodel()). - MoE.__init__ now takes explicit keyword args instead of MoEArgs. The DeviceMesh (needed by local_map) is a constructor parameter threaded through DeepSeekV3Model -> TransformerBlock -> MoE. - Attention.__init__ takes (attn_config, model_config) and derives use_flex_attn from inner_attention type name instead of a flag. - precompute_freqs_cis reads from config.rope.*. example_ds3_local_map.py is updated to use make_dsv3_config(). Validated: pytest tests/ passes (327 tests, 1 xfail). Model construction verified with both autoparallel's DeepSeekV3Config and torchtitan's Config via duck typing. stack-info: PR: #441, branch: sanketpurandare/stack/3
1f1a12c to
0ef0c2d
Compare
96ea142 to
2c6c7e7
Compare
This was referenced May 1, 2026
2c6c7e7 to
2a2aec7
Compare
This was referenced May 4, 2026
0ef0c2d to
a8d2d18
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
The flat DeepSeekV3ModelArgs and MoEArgs dataclasses are replaced by a tree of small dataclasses (DeepSeekV3Config -> LayerConfig -> AttentionConfig / MoEConfig / ...) whose attribute paths match torchtitan's DeepSeekV3Model.Config. Because the model reads config attributes via duck typing (no torchtitan import), either autoparallel's own DeepSeekV3Config or torchtitan's Config can be passed in. Concrete changes in dsv3.py: - Deleted DeepSeekV3ModelArgs and MoEArgs. - Added config dataclasses: DeepSeekV3Config, LayerConfig, AttentionConfig, MoEConfig, RoPEConfig, NormConfig, etc. - Added make_dsv3_config() factory that builds the config tree from scalar hyperparameters (same role as torchtitan's _debugmodel()). - MoE.__init__ now takes explicit keyword args instead of MoEArgs. The DeviceMesh (needed by local_map) is a constructor parameter threaded through DeepSeekV3Model -> TransformerBlock -> MoE. - Attention.__init__ takes (attn_config, model_config) and derives use_flex_attn from inner_attention type name instead of a flag. - precompute_freqs_cis reads from config.rope.*. example_ds3_local_map.py is updated to use make_dsv3_config(). Validated: pytest tests/ passes (327 tests, 1 xfail). Model construction verified with both autoparallel's DeepSeekV3Config and torchtitan's Config via duck typing. stack-info: PR: #441, branch: sanketpurandare/stack/3
2a2aec7 to
7425c44
Compare
7425c44 to
77d855b
Compare
87a7c6c to
9afe651
Compare
fmassa
approved these changes
May 4, 2026
Contributor
Author
Yes it does and we can directly pass the config now from TorchTitan, no patching needed, same for annotations. Also by adding a compute type we don't have to force the entire model to be bfloat16, it interfaces nicely with optimizer in torchtitan as well. |
215f0c2 to
9308ab5
Compare
d1ce828 to
6bb34f3
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 4, 2026
The flat DeepSeekV3ModelArgs and MoEArgs dataclasses are replaced by a tree of small dataclasses (DeepSeekV3Config -> LayerConfig -> AttentionConfig / MoEConfig / ...) whose attribute paths match torchtitan's DeepSeekV3Model.Config. Because the model reads config attributes via duck typing (no torchtitan import), either autoparallel's own DeepSeekV3Config or torchtitan's Config can be passed in. Concrete changes in dsv3.py: - Deleted DeepSeekV3ModelArgs and MoEArgs. - Added config dataclasses: DeepSeekV3Config, LayerConfig, AttentionConfig, MoEConfig, RoPEConfig, NormConfig, etc. - Added make_dsv3_config() factory that builds the config tree from scalar hyperparameters (same role as torchtitan's _debugmodel()). - MoE.__init__ now takes explicit keyword args instead of MoEArgs. The DeviceMesh (needed by local_map) is a constructor parameter threaded through DeepSeekV3Model -> TransformerBlock -> MoE. - Attention.__init__ takes (attn_config, model_config) and derives use_flex_attn from inner_attention type name instead of a flag. - precompute_freqs_cis reads from config.rope.*. example_ds3_local_map.py is updated to use make_dsv3_config(). Validated: pytest tests/ passes (327 tests, 1 xfail). Model construction verified with both autoparallel's DeepSeekV3Config and torchtitan's Config via duck typing. stack-info: PR: #441, branch: sanketpurandare/stack/3
6bb34f3 to
562992a
Compare
562992a to
13740d1
Compare
13740d1 to
539c35b
Compare
sanketpurandare
added a commit
that referenced
this pull request
May 8, 2026
This refactors AutoParallel's DeepSeek V3 test model around a hierarchical config shape that mirrors TorchTitan's DeepSeek V3 configuration while keeping make_dsv3_config() as the lightweight constructor for tests and examples. The model now reads layer, attention, RoPE, norm, FFN, and MoE settings through config objects, uses lm_head naming, accepts an optional mesh and compute_dtype, and preserves graph-trainer annotations for module FQNs and expert-parallel regions. The DS3 implementation now performs linear, RMSNorm, attention, dense FFN, shared expert, and output computations through explicit compute-dtype helpers so the debug model can run in the same bfloat16 style expected by the TorchTitan comparison path. The old FORCE_BALANCED_ROUTING and CPU fill-index path are removed, expert execution unwraps DTensor weights locally where needed, and the local_map example uses MixedPrecisionPolicy plus make_dsv3_config() instead of maintaining a separate flat DeepSeekV3ModelArgs/MoEArgs construction path. The local_map example also now binds each rank to its CUDA device before mesh/DTensor work, seeds DTensor RNG state explicitly, initializes NCCL with device_id, runs backward with autograd multithreading disabled, initializes weights on the rank device, and uses a device-specific final barrier. Those changes clean up the DTensor RNG sync, CUDA context/cuBLAS, and NCCL barrier warnings while keeping the example aligned with the real 2D dp/ep sharding constraints. Two supporting graph/module fixes are included because they are required by the updated DS3 path. AutoParallel functionalizes index_put_ mutations when the mutation target is a fresh non-input tensor before AOT compilation, with tests that ensure input mutations are left alone. Parallel module construction now preserves non-persistent buffer registration when rebuilding sharded modules, so aliased RoPE buffers such as freqs_cis and rope.cache do not reappear in state_dict(). Authored with Claude. stack-info: PR: #441, branch: sanketpurandare/stack/3
539c35b to
f6f6c35
Compare
This refactors AutoParallel's DeepSeek V3 test model around a hierarchical config shape that mirrors TorchTitan's DeepSeek V3 configuration while keeping make_dsv3_config() as the lightweight constructor for tests and examples. The model now reads layer, attention, RoPE, norm, FFN, and MoE settings through config objects, uses lm_head naming, accepts an optional mesh and compute_dtype, and preserves graph-trainer annotations for module FQNs and expert-parallel regions. The DS3 implementation now performs linear, RMSNorm, attention, dense FFN, shared expert, and output computations through explicit compute-dtype helpers so the debug model can run in the same bfloat16 style expected by the TorchTitan comparison path. The old FORCE_BALANCED_ROUTING and CPU fill-index path are removed, expert execution unwraps DTensor weights locally where needed, and the local_map example uses MixedPrecisionPolicy plus make_dsv3_config() instead of maintaining a separate flat DeepSeekV3ModelArgs/MoEArgs construction path. The local_map example also now binds each rank to its CUDA device before mesh/DTensor work, seeds DTensor RNG state explicitly, initializes NCCL with device_id, runs backward with autograd multithreading disabled, initializes weights on the rank device, and uses a device-specific final barrier. Those changes clean up the DTensor RNG sync, CUDA context/cuBLAS, and NCCL barrier warnings while keeping the example aligned with the real 2D dp/ep sharding constraints. Two supporting graph/module fixes are included because they are required by the updated DS3 path. AutoParallel functionalizes index_put_ mutations when the mutation target is a fresh non-input tensor before AOT compilation, with tests that ensure input mutations are left alone. Parallel module construction now preserves non-persistent buffer registration when rebuilding sharded modules, so aliased RoPE buffers such as freqs_cis and rope.cache do not reappear in state_dict(). Authored with Claude. stack-info: PR: #441, branch: sanketpurandare/stack/3
f6f6c35 to
d6e0f18
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked PRs:
Align DeepSeek V3 test config with TorchTitan shape
This refactors AutoParallel's DeepSeek V3 test model around a hierarchical config shape that mirrors TorchTitan's DeepSeek V3 configuration while keeping make_dsv3_config() as the lightweight constructor for tests and examples. The model now reads layer, attention, RoPE, norm, FFN, and MoE settings through config objects, uses lm_head naming, accepts an optional mesh and compute_dtype, and preserves graph-trainer annotations for module FQNs and expert-parallel regions.
The DS3 implementation now performs linear, RMSNorm, attention, dense FFN, shared expert, and output computations through explicit compute-dtype helpers so the debug model can run in the same bfloat16 style expected by the TorchTitan comparison path. The old FORCE_BALANCED_ROUTING and CPU fill-index path are removed, expert execution unwraps DTensor weights locally where needed, and the local_map example uses MixedPrecisionPolicy plus make_dsv3_config() instead of maintaining a separate flat DeepSeekV3ModelArgs/MoEArgs construction path.
The local_map example also now binds each rank to its CUDA device before mesh/DTensor work, seeds DTensor RNG state explicitly, initializes NCCL with device_id, runs backward with autograd multithreading disabled, initializes weights on the rank device, and uses a device-specific final barrier. Those changes clean up the DTensor RNG sync, CUDA context/cuBLAS, and NCCL barrier warnings while keeping the example aligned with the real 2D dp/ep sharding constraints.
Two supporting graph/module fixes are included because they are required by the updated DS3 path. AutoParallel functionalizes index_put_ mutations when the mutation target is a fresh non-input tensor before AOT compilation, with tests that ensure input mutations are left alone. Parallel module construction now preserves non-persistent buffer registration when rebuilding sharded modules, so aliased RoPE buffers such as freqs_cis and rope.cache do not reappear in state_dict().
Authored with Claude.