Skip to content

[trainer] feat: support sequence parallel#45

Open
zhtmike wants to merge 11 commits intoverl-omnifrom
verl-omni-dev-sp
Open

[trainer] feat: support sequence parallel#45
zhtmike wants to merge 11 commits intoverl-omnifrom
verl-omni-dev-sp

Conversation

@zhtmike
Copy link
Copy Markdown
Owner

@zhtmike zhtmike commented Mar 5, 2026

What does this PR do?

  • support sequence parallel for Diffusers SDPA attn backend
  • Require diffusers >= 0.38

Memory:
螢幕截圖 2026-03-20 上午9 45 27

Val Reward:
螢幕截圖 2026-03-20 上午9 46 04

Performance:
螢幕截圖 2026-03-20 上午9 46 47

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@zhtmike zhtmike marked this pull request as ready for review March 19, 2026 09:06
@zhtmike zhtmike requested a review from Copilot March 19, 2026 09:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Ulysses Sequence Parallel (SP) support for Diffusers-based training in the FSDP engine, including runtime monkey-patches to work around current upstream Diffusers limitations.

Changes:

  • Enable Ulysses SP device mesh initialization and Diffusers context-parallel enablement in DiffusersFSDPEngine.
  • Introduce Diffusers monkey-patches (mesh shape, attention mask shape, attention backward) plus a helper to fix hook mesh wiring.
  • Add distributed equivalence tests (fwd, fwd+bwd, and FSDP-wrapped) and an example training script enabling ulysses_sequence_parallel_size=2.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
verl/workers/engine/fsdp/diffusers_impl.py Turns on Ulysses SP path in the Diffusers FSDP engine, pads seq-lens for SP alignment, and wires Diffusers context-parallel config/hooks.
verl/models/diffusers/monkey_patch.py Adds monkey-patches and mesh-fix utilities needed for Diffusers Ulysses SP to work.
verl/models/diffusers/__init__.py Exposes the monkey-patch entrypoint as a package API.
tests/models/test_diffusers_ulysses.py Adds multi-GPU distributed tests checking SP vs non-SP forward/backward equivalence (plus FSDP-wrapped case).
examples/flowgrpo_trainer/run_flowgrpo_sp2.sh Provides a runnable example config for training with ulysses_sequence_parallel_size=2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Ulysses sequence parallel (SP) support for Diffusers-based training, including device-mesh setup and input sequence alignment, plus accompanying distributed tests and an example launch script.

Changes:

  • Enable Ulysses SP in DiffusersFSDPEngine by creating a (dp, ring, ulysses) device mesh, wiring the SP process group, and padding text embeddings to SP-aligned lengths.
  • Update vLLM-Omni async server to use OmniEngineArgs for engine arg construction.
  • Add distributed tests for Diffusers Ulysses SP forward/backward equivalence and an example script enabling ulysses_sequence_parallel_size=2.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
verl/workers/rollout/vllm_rollout/vllm_omni_async_server.py Switches vLLM-Omni engine args type used to build the async engine config.
verl/workers/engine/fsdp/diffusers_impl.py Implements Diffusers Ulysses SP wiring (mesh + group), enables parallelism, and pads text inputs to SP-friendly lengths.
tests/models/test_diffusers_ulysses.py Adds distributed tests validating SP vs non-SP forward/backward behavior (including an FSDP-wrapped case).
examples/flowgrpo_trainer/run_flowgrpo_sp2.sh Example FlowGRPO training invocation with Diffusers Ulysses SP size set to 2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +218 to +220
module_sp = AutoModel.from_config(cfg, torch_dtype=torch.bfloat16)
module_sp.enable_parallelism(config=ContextParallelConfig(ulysses_degree=sp_size, mesh=ulysses_device_mesh))
module_sp = module_sp.to(device=device, dtype=torch.bfloat16)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: ContextParallelConfig(..., mesh=ulysses_device_mesh) is not compatible with current released diffusers APIs and is likely to raise TypeError. It would be better to share a single helper to build the config in a version-tolerant way so all tests stay in sync with the production engine behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +367 to +369
module_sp = AutoModel.from_config(cfg, torch_dtype=torch.bfloat16)
module_sp.enable_parallelism(config=ContextParallelConfig(ulysses_degree=sp_size, mesh=ulysses_device_mesh))
module_sp = module_sp.to(device=device, dtype=torch.bfloat16)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: ContextParallelConfig(..., mesh=ulysses_device_mesh) may not be a valid constructor call in released diffusers versions and can fail at import/runtime. Consider adding a small compatibility helper (or reusing the engine’s logic) so the test works across supported diffusers versions.

Copilot uses AI. Check for mistakes.
Comment on lines +249 to +251
module.enable_parallelism(
config=ContextParallelConfig(ulysses_degree=sp_size, mesh=self.ulysses_device_mesh)
)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContextParallelConfig construction here passes mesh=..., but in current diffusers releases the public constructor does not accept a mesh kwarg (the mesh is handled internally / via different API). This is likely to raise a TypeError at runtime. To keep compatibility, consider using a version/feature-gated path (e.g., try constructing without mesh and set the mesh via the supported attribute/method, or rely on enable_parallelism() to create the mesh when possible).

Suggested change
module.enable_parallelism(
config=ContextParallelConfig(ulysses_degree=sp_size, mesh=self.ulysses_device_mesh)
)
cp_config = ContextParallelConfig(ulysses_degree=sp_size)
# Some diffusers versions do not accept `mesh` in the constructor;
# set it via attribute when available for compatibility.
if hasattr(cp_config, "mesh"):
cp_config.mesh = self.ulysses_device_mesh
module.enable_parallelism(config=cp_config)

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +112
module_sp = AutoModel.from_config(cfg, torch_dtype=torch.bfloat16)
module_sp.enable_parallelism(config=ContextParallelConfig(ulysses_degree=sp_size, mesh=ulysses_device_mesh))
module_sp = module_sp.to(device=device, dtype=torch.bfloat16)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContextParallelConfig(ulysses_degree=..., mesh=...) assumes the constructor accepts a mesh kwarg, which is not true for current diffusers releases. This will make the test error before it can run. Consider version/feature-gating this and constructing the config using the API supported by released diffusers (or mirroring whatever compatibility shim is used in the engine).

Copilot uses AI. Check for mistakes.
@zhtmike zhtmike force-pushed the verl-omni branch 6 times, most recently from eb393e7 to 0cbbe96 Compare April 2, 2026 01:48
@niehen6174
Copy link
Copy Markdown

I wasn’t able to find your contact email, so I’m reaching out here.

I’m interested in learning more about the technical details and discussing contributions.

My email is 1639206518@qq.com. I’d greatly appreciate it if you could reach out. I’d be happy to connect and discuss further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants