Skip to content

Fix GRPO weight-sync hangs and HCCL resource exhaustion on NPU#102

Merged
addsubmuldiv merged 11 commits intomodelscope:mainfrom
addsubmuldiv:npu_grpo_fix_0303
Mar 5, 2026
Merged

Fix GRPO weight-sync hangs and HCCL resource exhaustion on NPU#102
addsubmuldiv merged 11 commits intomodelscope:mainfrom
addsubmuldiv:npu_grpo_fix_0303

Conversation

@addsubmuldiv
Copy link
Collaborator

@addsubmuldiv addsubmuldiv commented Mar 5, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

Problem Statement

In the GRPO training flow, the model -> vLLM sampler weight-sync path had multiple stability issues:

  1. Intermittent hangs, often stuck at worker-side recv_pyobj().
  2. HCCL init failures in long runs (not found specific resource / link establish timeout).
  3. Occasional SHM cleanup error: BufferError: cannot close exported pointers exist.

Root Causes

  1. Metadata control plane used PUB/SUB, which can lose messages under process timing jitter, causing bucket-step desynchronization across ranks.
  2. NPU path rebuilt HCCL communicator every sync; repeated stream/channel allocation eventually exhausted driver resources.
  3. SHM tensor lifecycle/reference cleanup was not strict enough, leaving exported pointers alive on close.

Key Changes

  1. HCCL metadata protocol refactor (control plane)

    • Replaced PUB/SUB metadata flow with strict REQ/REP handshake keyed by bucket_id.
    • Added explicit timeout and error context to avoid silent infinite blocking.
      File:
    • hccl_checkpoint_engine.py
  2. Default communicator reuse on NPU

    • Set NPU default to rebuild_group=False (with env-based rollback).
      File:
    • mixin.py
  3. vLLM IPC robustness improvements

    • Use per-sync unique zmq_handle to avoid endpoint collisions across actors.
    • Add IPC timeout config.
      File:
    • vllm_engine.py
  4. Worker-side receive and memory handling fixes

    • Fix device binding (Torch.get_device(local_rank)).
    • Support chunk metadata reassembly.
    • Keep SHM path on CPU (avoid accidental extra NPU memory spikes).
    • Harden SHM cleanup to reduce BufferError risk.
      File:
    • vllm_worker_extension.py

Compatibility & Config

  • Behavior change: NPU checkpoint HCCL group is reused by default.
  • Roll back to old behavior (rebuild every sync) via:
    • TWINKLE_CKPT_HCCL_REBUILD_GROUP=1
  • Optional timeout configs:
    • TWINKLE_CKPT_HCCL_META_TIMEOUT_S
    • TWINKLE_VLLM_IPC_TIMEOUT_S

Outcome

  • Eliminates silent hangs caused by metadata loss/desync.
  • Significantly reduces long-run HCCL failures caused by resource exhaustion.
  • Fixes/reduces SHM close exceptions
  • Improves observability and debuggability of the GRPO weight-sync pipeline.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several critical stability issues in the GRPO training flow's model-to-vLLM sampler weight synchronization path, particularly on NPU platforms. The changes aim to eliminate intermittent hangs, reduce HCCL initialization failures due to resource exhaustion, resolve shared memory cleanup errors, and fix shutdown warnings. By refactoring the metadata communication protocol, reusing HCCL communicators, and hardening IPC and memory management, the PR significantly improves the reliability, observability, and debuggability of the weight-sync pipeline.

Highlights

  • HCCL Metadata Protocol Refactor: The metadata control plane for HCCL was refactored from an unreliable PUB/SUB pattern to a strict REQ/REP handshake, keyed by bucket ID, to prevent message loss and desynchronization.
  • NPU Communicator Reuse: The default behavior for NPU checkpoint HCCL groups was changed to reuse communicators across syncs, significantly reducing resource exhaustion and init failures in long-running processes. This can be overridden via an environment variable.
  • vLLM IPC Robustness Improvements: Inter-process communication (IPC) for vLLM was enhanced by using unique ZMQ handles per sync to avoid endpoint collisions, adding configurable IPC timeouts, and ensuring shutdown compatibility with both synchronous and asynchronous vLLM implementations.
  • Worker-side Memory and Device Handling: Fixed device binding for NPU workers, added support for reassembling chunked metadata, ensured shared memory (SHM) operations remain on CPU to avoid NPU memory spikes, and hardened SHM cleanup to reduce BufferError risks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py
    • Replaced ZMQ PUB/SUB with REQ/REP for metadata exchange, adding explicit timeouts and error context.
    • Removed the BroadcastOperation class, integrating its logic directly into send_weights and receive_weights.
    • Updated send_weights and receive_weights to handle chunked metadata and use the new REQ/REP protocol.
    • Improved finalize method to properly terminate the ZMQ context.
  • src/twinkle/checkpoint_engine/mixin.py
    • Configured HCCLCheckpointEngine to reuse the communicator by default on NPU, with an environment variable (TWINKLE_CKPT_HCCL_REBUILD_GROUP) to allow overriding this behavior.
  • src/twinkle/infra/_ray/resource_manager.py
    • Enhanced resource assertion error messages to provide more context and hints, especially when NPU resources are available but GPU is requested.
  • src/twinkle/sampler/vllm_sampler/vllm_engine.py
    • Added TWINKLE_VLLM_IPC_BUCKET_MB environment variable for configurable bucket size.
    • Implemented per-sync unique ZMQ IPC handles to prevent collisions across actors.
    • Configured ZMQ sockets with send/receive timeouts using TWINKLE_VLLM_IPC_TIMEOUT_S.
    • Modified weight streaming to support chunking of large tensors across buckets.
    • Added cleanup logic for IPC socket files upon completion.
  • src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py
    • Corrected NPU device binding logic to use Torch.get_device(local_rank) for accurate device assignment.
    • Integrated ZMQ timeouts for IPC communication.
    • Modified _broadcast_obj to handle chunked weight metadata and reassemble tensors.
    • Improved shared memory (shm) cleanup with retries and warnings for BufferError.
  • src/twinkle/utils/framework.py
    • Updated module import paths for GPU and NPU platforms from twinkle.utils.platform to twinkle.utils.platforms.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves the stability and robustness of the weight synchronization process for NPU environments by replacing the unreliable ZMQ PUB/SUB protocol with a more robust REQ/REP handshake, adding timeouts, and enhancing the vLLM IPC mechanism. However, it introduces critical security vulnerabilities due to insecure deserialization; specifically, the use of ZMQ's recv_pyobj() (which uses Python's pickle module) for inter-process communication over TCP and IPC sockets poses a significant risk of Remote Code Execution (RCE). This should be addressed by switching to a secure serialization format like JSON for metadata exchange. Additionally, an inconsistency was found in hccl_checkpoint_engine.py where the receiver supports chunking but the sender does not, which needs to be aligned.

@hjh0119 hjh0119 self-assigned this Mar 5, 2026
@addsubmuldiv addsubmuldiv marked this pull request as ready for review March 5, 2026 02:19
Copilot AI review requested due to automatic review settings March 5, 2026 02:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes multiple stability issues in the GRPO weight-sync pipeline, specifically targeting the model → vLLM sampler weight transfer path on Ascend NPU (HCCL) and the vLLM IPC path:

Changes:

  • Refactored the HCCL checkpoint engine's metadata control plane from unreliable ZMQ PUB/SUB to strict REQ/REP handshakes with per-bucket synchronization, preventing message loss and bucket-step desynchronization across ranks.
  • Added weight chunking support to handle weights larger than the bucket size, replacing the previous hard error for oversized weights, and set NPU default to rebuild_group=False to avoid HCCL resource exhaustion from repeated communicator rebuilds.
  • Improved vLLM IPC robustness with per-sync unique ZMQ endpoints (avoiding cross-actor collisions), configurable timeouts, hardened SHM cleanup, and proper device binding with local_rank.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/twinkle/utils/framework.py Fixes import path from .platform to .platforms for GPU/NPU device classes
src/twinkle/checkpoint_engine/mixin.py Sets NPU default rebuild_group=False with env-var rollback (TWINKLE_CKPT_HCCL_REBUILD_GROUP)
src/twinkle/checkpoint_engine/hccl_checkpoint_engine.py Major refactor: PUB/SUB → REQ/REP metadata protocol, chunked weight transfer, timeout-based error handling, ZMQ context lifecycle management
src/twinkle/sampler/vllm_sampler/vllm_engine.py Per-sync unique IPC endpoints, configurable IPC timeouts, chunked weight transfer, IPC socket cleanup
src/twinkle/sampler/vllm_sampler/vllm_worker_extension.py Fixed device binding, chunk reassembly, SHM cleanup hardening, timeout-based error handling

You can also share your feedback on Copilot code review. Take the survey.

@addsubmuldiv addsubmuldiv merged commit 3d5d60c into modelscope:main Mar 5, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants