Skip to content

[Feature-request] Context Parallel support for DSAttention #4878

Description

@slikhite-1

Megatron-Core currently does not support Context Parallelism (CP) with DSAttention. This blocks long-context training and post-training for models that use the experimental DSA path, such as GLM-style sparse attention models. This also affects Megatron Bridge/ Nemo-RL usage, since the Megatron Bridge GLM recipe points to experimental_attention_variant == "dsa".

Current Behavior

When experimental_attention_variant == "dsa", transformer_config.py asserts that context_parallel_size == 1:

assert self.context_parallel_size == 1, "Currently context parallelism is not supported by DSAttention!"

Possible Solution / Reference

One possible reference is the Slime RL PR that added GLM 5.1 support:
THUDM/slime#1599

From a preliminary read, Slime does not use mcore DSAttention directly for that recipe. Instead, it builds mcore GPT/MLA layers but replaces the attention path with a custom GLM5 DSA implementation. They also add a Slime-specific --allgather-cp flag to support Context Parallelism

This is only a possible design reference, not necessarily a request to implement the Slime approach directly.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions