Megatron-Core currently does not support Context Parallelism (CP) with DSAttention. This blocks long-context training and post-training for models that use the experimental DSA path, such as GLM-style sparse attention models. This also affects Megatron Bridge/ Nemo-RL usage, since the Megatron Bridge GLM recipe points to experimental_attention_variant == "dsa".
Current Behavior
When experimental_attention_variant == "dsa", transformer_config.py asserts that context_parallel_size == 1:
assert self.context_parallel_size == 1, "Currently context parallelism is not supported by DSAttention!"
Possible Solution / Reference
One possible reference is the Slime RL PR that added GLM 5.1 support:
THUDM/slime#1599
From a preliminary read, Slime does not use mcore DSAttention directly for that recipe. Instead, it builds mcore GPT/MLA layers but replaces the attention path with a custom GLM5 DSA implementation. They also add a Slime-specific --allgather-cp flag to support Context Parallelism
This is only a possible design reference, not necessarily a request to implement the Slime approach directly.
Megatron-Core currently does not support Context Parallelism (CP) with
DSAttention. This blocks long-context training and post-training for models that use the experimental DSA path, such as GLM-style sparse attention models. This also affects Megatron Bridge/ Nemo-RL usage, since the Megatron Bridge GLM recipe points toexperimental_attention_variant == "dsa".Current Behavior
When
experimental_attention_variant == "dsa", transformer_config.py asserts thatcontext_parallel_size == 1:Possible Solution / Reference
One possible reference is the Slime RL PR that added GLM 5.1 support:
THUDM/slime#1599
From a preliminary read, Slime does not use mcore DSAttention directly for that recipe. Instead, it builds mcore GPT/MLA layers but replaces the attention path with a custom GLM5 DSA implementation. They also add a Slime-specific --allgather-cp flag to support Context Parallelism
This is only a possible design reference, not necessarily a request to implement the Slime approach directly.