Fix GRPO weight-sync hangs and HCCL resource exhaustion on NPU by addsubmuldiv · Pull Request #102 · modelscope/twinkle

addsubmuldiv · 2026-03-05T01:18:36Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

Problem Statement

In the GRPO training flow, the model -> vLLM sampler weight-sync path had multiple stability issues:

Intermittent hangs, often stuck at worker-side recv_pyobj().
HCCL init failures in long runs (not found specific resource / link establish timeout).
Occasional SHM cleanup error: BufferError: cannot close exported pointers exist.

Root Causes

Metadata control plane used PUB/SUB, which can lose messages under process timing jitter, causing bucket-step desynchronization across ranks.
NPU path rebuilt HCCL communicator every sync; repeated stream/channel allocation eventually exhausted driver resources.
SHM tensor lifecycle/reference cleanup was not strict enough, leaving exported pointers alive on close.

Key Changes

HCCL metadata protocol refactor (control plane)
- Replaced PUB/SUB metadata flow with strict REQ/REP handshake keyed by bucket_id.
- Added explicit timeout and error context to avoid silent infinite blocking.
  File:
- hccl_checkpoint_engine.py
Default communicator reuse on NPU
- Set NPU default to rebuild_group=False (with env-based rollback).
  File:
- mixin.py
vLLM IPC robustness improvements
- Use per-sync unique zmq_handle to avoid endpoint collisions across actors.
- Add IPC timeout config.
  File:
- vllm_engine.py
Worker-side receive and memory handling fixes
- Fix device binding (Torch.get_device(local_rank)).
- Support chunk metadata reassembly.
- Keep SHM path on CPU (avoid accidental extra NPU memory spikes).
- Harden SHM cleanup to reduce BufferError risk.
  File:
- vllm_worker_extension.py

Compatibility & Config

Behavior change: NPU checkpoint HCCL group is reused by default.
Roll back to old behavior (rebuild every sync) via:
- TWINKLE_CKPT_HCCL_REBUILD_GROUP=1
Optional timeout configs:
- TWINKLE_CKPT_HCCL_META_TIMEOUT_S
- TWINKLE_VLLM_IPC_TIMEOUT_S

Outcome

Eliminates silent hangs caused by metadata loss/desync.
Significantly reduces long-run HCCL failures caused by resource exhaustion.
Fixes/reduces SHM close exceptions
Improves observability and debuggability of the GRPO weight-sync pipeline.

gemini-code-assist · 2026-03-05T01:20:06Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several critical stability issues in the GRPO training flow's model-to-vLLM sampler weight synchronization path, particularly on NPU platforms. The changes aim to eliminate intermittent hangs, reduce HCCL initialization failures due to resource exhaustion, resolve shared memory cleanup errors, and fix shutdown warnings. By refactoring the metadata communication protocol, reusing HCCL communicators, and hardening IPC and memory management, the PR significantly improves the reliability, observability, and debuggability of the weight-sync pipeline.

Highlights

HCCL Metadata Protocol Refactor: The metadata control plane for HCCL was refactored from an unreliable PUB/SUB pattern to a strict REQ/REP handshake, keyed by bucket ID, to prevent message loss and desynchronization.
NPU Communicator Reuse: The default behavior for NPU checkpoint HCCL groups was changed to reuse communicators across syncs, significantly reducing resource exhaustion and init failures in long-running processes. This can be overridden via an environment variable.
vLLM IPC Robustness Improvements: Inter-process communication (IPC) for vLLM was enhanced by using unique ZMQ handles per sync to avoid endpoint collisions, adding configurable IPC timeouts, and ensuring shutdown compatibility with both synchronous and asynchronous vLLM implementations.
Worker-side Memory and Device Handling: Fixed device binding for NPU workers, added support for reassembling chunked metadata, ensured shared memory (SHM) operations remain on CPU to avoid NPU memory spikes, and hardened SHM cleanup to reduce BufferError risks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py
- Replaced ZMQ PUB/SUB with REQ/REP for metadata exchange, adding explicit timeouts and error context.
- Removed the BroadcastOperation class, integrating its logic directly into send_weights and receive_weights.
- Updated send_weights and receive_weights to handle chunked metadata and use the new REQ/REP protocol.
- Improved finalize method to properly terminate the ZMQ context.
src/twinkle/checkpoint_engine/mixin.py
- Configured HCCLCheckpointEngine to reuse the communicator by default on NPU, with an environment variable (TWINKLE_CKPT_HCCL_REBUILD_GROUP) to allow overriding this behavior.
src/twinkle/infra/_ray/resource_manager.py
- Enhanced resource assertion error messages to provide more context and hints, especially when NPU resources are available but GPU is requested.
src/twinkle/sampler/vllm_sampler/vllm_engine.py
- Added TWINKLE_VLLM_IPC_BUCKET_MB environment variable for configurable bucket size.
- Implemented per-sync unique ZMQ IPC handles to prevent collisions across actors.
- Configured ZMQ sockets with send/receive timeouts using TWINKLE_VLLM_IPC_TIMEOUT_S.
- Modified weight streaming to support chunking of large tensors across buckets.
- Added cleanup logic for IPC socket files upon completion.
src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py
- Corrected NPU device binding logic to use Torch.get_device(local_rank) for accurate device assignment.
- Integrated ZMQ timeouts for IPC communication.
- Modified _broadcast_obj to handle chunked weight metadata and reassemble tensors.
- Improved shared memory (shm) cleanup with retries and warnings for BufferError.
src/twinkle/utils/framework.py
- Updated module import paths for GPU and NPU platforms from twinkle.utils.platform to twinkle.utils.platforms.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly improves the stability and robustness of the weight synchronization process for NPU environments by replacing the unreliable ZMQ PUB/SUB protocol with a more robust REQ/REP handshake, adding timeouts, and enhancing the vLLM IPC mechanism. However, it introduces critical security vulnerabilities due to insecure deserialization; specifically, the use of ZMQ's recv_pyobj() (which uses Python's pickle module) for inter-process communication over TCP and IPC sockets poses a significant risk of Remote Code Execution (RCE). This should be addressed by switching to a secure serialization format like JSON for metadata exchange. Additionally, an inconsistency was found in hccl_checkpoint_engine.py where the receiver supports chunking but the sender does not, which needs to be aligned.

src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py

src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py

src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py

Copilot

Pull request overview

This PR fixes multiple stability issues in the GRPO weight-sync pipeline, specifically targeting the model → vLLM sampler weight transfer path on Ascend NPU (HCCL) and the vLLM IPC path:

Changes:

Refactored the HCCL checkpoint engine's metadata control plane from unreliable ZMQ PUB/SUB to strict REQ/REP handshakes with per-bucket synchronization, preventing message loss and bucket-step desynchronization across ranks.
Added weight chunking support to handle weights larger than the bucket size, replacing the previous hard error for oversized weights, and set NPU default to rebuild_group=False to avoid HCCL resource exhaustion from repeated communicator rebuilds.
Improved vLLM IPC robustness with per-sync unique ZMQ endpoints (avoiding cross-actor collisions), configurable timeouts, hardened SHM cleanup, and proper device binding with local_rank.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/twinkle/utils/framework.py`	Fixes import path from `.platform` to `.platforms` for GPU/NPU device classes
`src/twinkle/checkpoint_engine/mixin.py`	Sets NPU default `rebuild_group=False` with env-var rollback (`TWINKLE_CKPT_HCCL_REBUILD_GROUP`)
`src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py`	Major refactor: PUB/SUB → REQ/REP metadata protocol, chunked weight transfer, timeout-based error handling, ZMQ context lifecycle management
`src/twinkle/sampler/vllm_sampler/vllm_engine.py`	Per-sync unique IPC endpoints, configurable IPC timeouts, chunked weight transfer, IPC socket cleanup
`src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py`	Fixed device binding, chunk reassembly, SHM cleanup hardening, timeout-based error handling

You can also share your feedback on Copilot code review. Take the survey.

src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py

src/twinkle/checkpoint_engine/mixin.py

src/twinkle/sampler/vllm_sampler/vllm_engine.py

src/twinkle/checkpoint_engine/mixin.py

addsubmuldiv added 4 commits March 4, 2026 00:32

[WIP]fix npu grpo

5bc1adf

drop useless code

34cbceb

[WIP] remove useless code

ebf6177

fix hccl resource problem

d928e40

fix lint

f636a58

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

addsubmuldiv added 3 commits March 5, 2026 09:25

fix lint

d83dd50

fix send_weights

8e8f037

drop useless code

54d6ca7

hjh0119 self-assigned this Mar 5, 2026

addsubmuldiv marked this pull request as ready for review March 5, 2026 02:19

Copilot AI review requested due to automatic review settings March 5, 2026 02:19

Copilot started reviewing on behalf of addsubmuldiv March 5, 2026 02:20 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py Show resolved Hide resolved

tastelikefeet reviewed Mar 5, 2026

View reviewed changes

src/twinkle/checkpoint_engine/mixin.py Outdated Show resolved Hide resolved

hjh0119 reviewed Mar 5, 2026

View reviewed changes

src/twinkle/sampler/vllm_sampler/vllm_engine.py Outdated Show resolved Hide resolved

tastelikefeet reviewed Mar 5, 2026

View reviewed changes

src/twinkle/sampler/vllm_sampler/vllm_engine.py Outdated Show resolved Hide resolved

tastelikefeet reviewed Mar 5, 2026

View reviewed changes

src/twinkle/sampler/vllm_sampler/vllm_engine.py Outdated Show resolved Hide resolved

hjh0119 reviewed Mar 5, 2026

View reviewed changes

src/twinkle/checkpoint_engine/mixin.py Show resolved Hide resolved

addsubmuldiv added 3 commits March 5, 2026 14:30

fix review

a77258e

fix review

abf77ea

add env_var doc

e55cd2b

hjh0119 approved these changes Mar 5, 2026

View reviewed changes

addsubmuldiv merged commit 3d5d60c into modelscope:main Mar 5, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GRPO weight-sync hangs and HCCL resource exhaustion on NPU#102

Fix GRPO weight-sync hangs and HCCL resource exhaustion on NPU#102
addsubmuldiv merged 11 commits intomodelscope:mainfrom
addsubmuldiv:npu_grpo_fix_0303

addsubmuldiv commented Mar 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

addsubmuldiv commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

Problem Statement

Root Causes

Key Changes

Compatibility & Config

Outcome

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

addsubmuldiv commented Mar 5, 2026 •

edited

Loading