[Feature-request] Context Parallel support for DSAttention

Megatron-Core currently does not support Context Parallelism (CP) with `DSAttention`. This blocks long-context training and post-training for models that use the experimental DSA path, such as GLM-style sparse attention models. This also affects Megatron Bridge/ Nemo-RL usage, since the Megatron Bridge GLM recipe points to `experimental_attention_variant == "dsa"`.

### Current Behavior

When `experimental_attention_variant == "dsa"`, [transformer_config.py](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/transformer_config.py#L2452)  asserts that `context_parallel_size == 1`:

```python
assert self.context_parallel_size == 1, "Currently context parallelism is not supported by DSAttention!"
```

### Possible Solution / Reference
One possible reference is the Slime RL PR that added GLM 5.1 support:
https://github.com/THUDM/slime/pull/1599

From a preliminary read, Slime does not use mcore DSAttention directly for that recipe. Instead, it builds mcore GPT/MLA layers but replaces the attention path with a custom GLM5 DSA implementation. They also add a Slime-specific [--allgather-cp](https://github.com/THUDM/slime/blob/main/slime/backends/megatron_utils/data.py#L78) flag to support Context Parallelism

This is only a possible design reference, not necessarily a request to implement the Slime approach directly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature-request] Context Parallel support for DSAttention #4878

Current Behavior

Possible Solution / Reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature-request] Context Parallel support for DSAttention #4878

Description

Current Behavior

Possible Solution / Reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions